Building confidence in climate model projections: an analysis of inferences from fit

Climate model projections are used to inform policy decisions and constitute a major focus of climate research. Confidence in climate projections relies on the adequacy of climate models for those projections. The question of how to argue for the adequacy of models for climate projections has not gotten sufficient attention in the climate modeling community. The most common way to evaluate a climate model is to assess in a quantitative way degrees of ‘model fit’; that is, how well model results fit observation‐based data (empirical accuracy) and agree with other models or model versions (robustness). However, such assessments are largely silent about what those degrees of fit imply for a model's adequacy for projecting future climate. We provide a conceptual framework for discussing the evaluation of the adequacy of models for climate projections. Drawing on literature from philosophy of science and climate science, we discuss the potential and limits of inferences from model fit. We suggest that support of a model by background knowledge is an additional consideration that can be appealed to in arguments for a model's adequacy for long‐term projections, and that this should explicitly be spelled out. Empirical accuracy, robustness and support by background knowledge neither individually nor collectively constitute sufficient conditions in a strict sense for a model's adequacy for long‐term projections. However, they provide reasons that can be strengthened by additional information and thus contribute to a complex non‐deductive argument for the adequacy of a climate model or a family of models for long‐term climate projections. WIREs Clim Change 2017, 8:e454. doi: 10.1002/wcc.454


INTRODUCTION
T here is now broad scientific consensus that the Earth's climate has changed significantly over the last century, that much of the observed large-scale warming can be attributed to greenhouse gas emissions associated with human activities, and that these trends will continue in the near future. 1 Less clear is how exactly the climate will change in the more distant future on the global and on regional scales if greenhouse gas emissions increase, stabilize or decrease in particular ways. Climate predictions that are conditional on forcing scenarios are called climate projections. They tell us how one or more climate characteristics would evolve if greenhouse gas concentrations and other external forcings were to follow specified pathways in the future. Projections are currently a major focus of climate research, which is undertaken (and funded) in part with the aim of providing input to policy decisions for mitigation and adaptation to anthropogenic climate change. To project climate change, climate scientists inevitably have to rely on complex numerical climate models. But to what extent can we trust model-based climate projections?
Climate model projections cannot be directly evaluated. Most of them have forcing scenarios that are never realized; and in contrast to weather forecasts, they are generally too long term to allow repeated direct testing against observational data. Climate projections need therefore to be evaluated indirectly by assessing whether the models that generate them are adequate for this purpose. The question how we can argue for the claim that a model is adequate for making climate projections of the desired kind has not gotten sufficient attention in the climate modeling community.
The most common way to evaluate a climate model (or at least the one most commonly reported) is to assess in a quantitative way degrees of 'model fit'; that is, how well model results fit observationbased data and how well they agree with results of other models or model versions within model intercomparison projects. However, such assessments of the empirical accuracy and robustness of model results are largely silent about what those instances of fit imply for the trustworthiness of model applications. Less than two of the roughly ninety pages of the Intergovernmental Panel on Climate Change (IPCC) chapter on model evaluation of the most recent report from Working Group I 2 deal with the implications of model fit for climate projections. It is often assumed that the fact that state-of-the-art climate models reproduce many important features of current and past climate reasonably well warrants increased confidence in the model's suitability for quantitative projections, particularly at continental scales and above; and the robustness of climate model projections is typically seen to warrant a further increase in confidence in the projected outcome. [2][3][4] However, the arguments for these assumptions are hardly ever made explicit, and successful instances of model fit are often uncritically interpreted as confirming the models as such. But socalled model performance metrics are simply numbers that quantify the agreement between model results and observation-based data, and it is an open question what they imply for the model's adequacy for a purpose. 5,6 It seems fair to say that it is the transition from statements about quantitative measures of model fit ('model performance metrics') to hypotheses about a model's adequacy for projecting future climate change ('model quality metrics') where climate science struggles. [7][8][9] Or in simple words, which things need to be 'right' in a model, and how 'right,' so that we rely on the model to predict a certain quantity of interest?
In this article, we provide a conceptual framework for discussing the evaluation of a model's adequacy for long-term climate projections. To develop this framework, we firstly argue that model evaluation must be specific to the purpose the model is intended to serve. We then discuss the potential and limits of inferences from model fit; that is, from empirical accuracy and from robustness of model results. Moreover, we suggest that support of a model by background knowledge is an additional consideration that we can appeal to in arguments for a model's adequacy for climate projections. Empirical accuracy, robustness and support by background knowledge neither individually nor collectively constitute sufficient conditions in a strict logical sense for a model's adequacy for long-term projections. However, they provide reasons that can be strengthened by additional information and thus contribute to a non-deductively strong argument for the adequacy of a climate model for long-term climate projections. Finally, we summarize the suggested framework (see Box 3 at the end of the article) and discuss how it can be applied to purposes other than long-term projections and areas outside climate. (Box 1 provides some terminological clarifications.)

BOX 1 EVALUATION, VALIDATION, CONFIRMATION, VERIFICATION
We use 'evaluation' as an umbrella category for all kinds of epistemic assessments of models, model parts and model or simulation results.
• The kind of model evaluation we are primarily interested in is the assessment of whether a model (or a family of models) is adequate for a particular purpose. The purpose we focus on are climate projections; model evaluation is the basis for assessing which degree of confidence (or belief ) in these projections is warranted. • 'Model validation' is sometimes used in the way we use 'model evaluation. ' Winsberg,10 for example, defines model validation as the process of determining whether a model is a good enough representation of its target system for the purpose of the model. Whereas

ADEQUACY FOR PURPOSE The Need for Purpose-specific Evaluation
Climate models are representations of aspects of the climate system. They are based on fundamental physical laws (e.g., of fluid dynamics and conservation laws), which are expressed as mathematical equations. Since these equations are analytically intractable to solve, high performance computers are used to approximate solutions numerically on a spatial and temporal grid. To do so, the equations must be discretized and turned into a computer code. Some processes (such as cloud formation) are important for the model results but cannot be explicitly represented, since they either occur on a smaller scale than can be resolved in the model, or are too complex to be modeled in detail, or are not well-enough understood physically. They are included via a parameterization that estimates the net effect of such unresolved processes from large-scale quantities available. For example, the cloud cover and its effect on radiation can be estimated from atmospheric humidity, vertical stability, temperature and particles available for condensation, but without actually simulating how clouds form. The climate model consists of the computer code or the discretized equations and assumptions it encodes. A model simulation consists of running the model within a particular computational environment and with specific initial and boundary conditions (Ref 4, p. 152). Initial conditions specify the state of the climate system at the beginning of the simulation. Boundary conditions describe the external forcings and factors that affect the climate system but are not directly simulated by the model. They are the drivers of climate change and include, for example, concentrations of greenhouse gases in the atmosphere at a given time, the amount of aerosols and the amount of solar radiation received by the earth. 16 As any numerical model of a complex system, climate models represent their target in an idealized way. An idealization is a deliberate simplification of something complicated in order to make it more comprehensible and tractable. Two types of idealizations have received much attention in philosophical debates: so-called Aristotelian and Galilean idealizations. 17,18 Aristotelian idealization amounts to abstracting from all properties that are believed not to make a difference to the essential character of the system (e.g., describing planets as objects only having shape and mass). Galilean idealizations involve deliberate distortions into models in order to make them mathematically and computationally tractable (e.g., describing objects as point masses moving on frictionless planes). 19 Both types of idealizations are ubiquitous in climate modeling. They are involved in the construction of conceptual models that represent the relevant components of the climate system and their interaction in a qualitative way, as well as in the quantitative formulation of model equations, their discretization and their implementation into a computer. Even complex global circulation models (GCMs) that represent a wide range of atmospheric, oceanic, sea ice and land processes leave out many features of the climate system (e.g., biogeochemical, biological or ecological processes) and distort in different ways processes and aspects represented to make them mathematically climate modelers prefer 'evaluation,' hydrologists often use 'validation.' 11 • 'Model confirmation' is often restricted to the assessment of a model's empirical accuracy and thus to the question of how well a model is supported by data ( Ref 12,p. 967).
If confirmation is understood in this narrow sense, then one of our key claims is that model evaluation involves more than model confirmation, namely considerations of robustness and of coherence with background knowledge. • In general philosophy of science, 'verification' of a hypothesis is the demonstration of its truth; the opposite is falsification. a In climate science, modeling communities and the philosophy of computer simulation, 'verification' is typically understood as the determination of whether the output of the simulation approximates the true solutions to the differential equations of the original model. 10 While verification in the first sense is out of reach for climate models and fundamentally any model that describes an open system, 15 verification in the second sense is not our focus. Evaluation, validation and confirmation come in degrees and can be expressed in different ways, for example as: • Comparative evaluation: an increase in confidence in a model projection or a model's adequacy for such a projection is (not) warranted. • Absolute evaluation: the warranted confidence in a model projection or a model's adequacy for such a projection is (not) above some threshold, or (not) sufficient, for example, for using the projection for a particular decision.
and computationally tractable. GCMs include simplifying assumptions at the level of basic physical principles, for example, the continuum assumption that is necessary to apply the most common mathematical equations of fluid dynamics. This assumption ignores the fact that physical matter is made up of individual particles that collide with one another, and assumes that properties such as pressure and temperature vary continuously. Most obvious are idealizations at the level of parameterizations that account for the effect of processes that are too complex, too small, or not well-enough understood to be modeled in detail. In some cases, a theory is available but it would be too computationally expensive to fully incorporate it into the model (e.g., atmospheric radiative transfer); in other cases, no adequate general theory exists and the parameterizations need to be based on empirical measurements (Ref 4, p. 152). Aristotelian idealizations lead to incomplete representations, Galilean idealizations to models that partly mispresent their target system. Galilean idealization is often seen as merely pragmatic and bound up with the expectation that advances in computational power, physical understanding and mathematical techniques should lead to de-idealization and more accurate models (Ref 18, p. 261; Ref 20,p. 641). There is indeed a tendency to work towards less idealized models by removing simplifying assumptions, adding back complexities and replacing parameterizations with explicit representations. However, such efforts come up against their boundaries since computer resources, physical understanding and mathematical techniques will always be limited. Even within these limits, models may get too expensive to operate, too tedious to maintain and too complicated to handle and to understand. While some idealizations are simply the best science can currently do, others are preferable to more accurate representations of their target, for example, because they enhance our understanding of the target by highlighting those factors that make a difference to the essential character of the system. [21][22][23] The aim in climate modeling-as in modeling in general-is thus not (and cannot be) to arrive at a complete representation of the climate system that is correct in all details. The aim is rather to construct models that represent processes of the climate system in ways that make the model adequate for specific purposes. 9,11,24 These purposes involve answering a limited range of questions about the target system, for example, under a given forcing scenario, would Earth's global mean surface temperature (GMST) in 2100 be more than 2 C warmer than it was in a preindustrial state? To what extent is the global warming of the last fifty years due to human causes? Does climate change affect extreme summer precipitation over the Alps? 25 Whether idealizations are adequate depends on the question about the target system we want to answer. The evaluation of climate models needs thus to be specific to the purposes for which they are used because these purposes determine which processes and aspects a model has to represent, on what spatial and temporal scale, and how accurately.

How to Conceive of Adequacy for Purpose
The idea to assess a model's adequacy for a given purpose has been advocated for models in general. 26,27 In connection with climate models, the idea is often mentioned (e.g., Ref 16); it has been defended by Parker 9 and Knutti 11 and criticized by Katzav. 28 But what does it mean that a model is adequate for a purpose?
The purpose of a model is to provide information about the target system that allows model users to formulate hypotheses that satisfactorily answer their questions about the target system. These hypotheses are of different types (e.g., projection versus explanation), concern different climate variables (e.g., temperature versus precipitation) or events (e.g., storms versus sea-level rise) on different temporal (short-versus long-term) and spatial scales (global versus regional), are more or less specific (e.g., trends versus absolute numbers) and allow for different error margins.
A particular climate model can be adequate for some purposes, but not for others. For instance, a model may adequately predict within a specified error margin the GMST increase by 2100 relative to certain initial conditions under a certain forcing scenario, but it may not be adequate for predictions of changes in precipitation patterns in the Mediterranean area between 2050 and 2100. There is widespread consensus that projections are more reliable for temperature than for precipitation and most other quantities, for longer time averages, larger spatial averages, low specificity, and, all other things being equal and for absolute changes, for shorter lead-times (Ref 12, p. 970). The mentioned criteria for specifying a purpose are crucial to determine whether a model that is adequate for a particular purpose is also adequate for another purpose. This will be the case if the purposes are similar in the relevant respects.
A climate model can be adequate for a range of purposes, but often different purposes require different types of models. A simple energy balance model provides a good qualitative understanding of the greenhouse effect and a reasonable estimate of GMST trends; simple models are also used to explore scenarios and for probabilistic projections; 29,30 Earth models of intermediate complexity are often used for paleoclimate simulation that extend over thousands of years; 31 high resolution GCMs and Earth system models (ESMs) are used to simulate climate change over the 20th and 21st century; and regional climate models with an even higher resolution of 10 to 20 km (compared to the typical 100 to 300 km resolution of GCMs and ESMs) are used for climate projections on a regional or even local scale. 25,32,33 Moreover, even for a particular purpose, different models can be adequate.
A model is adequate for a purpose in the relevant sense if it conveys information about the target system that allows model users to formulate hypotheses that correctly answer their questions about the target, not by chance or luck, but because the model has properties that make it suitable for the purpose at issue (Ref 9, p 236, fn. 6). But what are those properties?
To be adequate for long-term projections, a climate model must represent sufficiently well (precise, detailed) all physical, chemical and biological processes of the climate system that are relevant for the evolution of the climate variables of interest at space and time scales of interest. b This means that idealizations must be such that they do not substantially affect the results (relative to a perfect complete model) for the purpose at issue. We do not require that the model actually resolves all relevant climate processes, but if it incorporates unresolved processes via parameterization, it becomes more questionable whether the model represents the processes accurately enough beyond the range where data is available. 11 Whether processes can be explicitly resolved or need to be parameterized depends (among other things) on the spatial resolution, which needs to be appropriate for the projection in question. However, a model may represent all climate processes underlying the projection of interest fairly well at an appropriate spatial resolution, but may still not be adequate to be used for this projection.
For instance, the model may be too computationally costly to be run on available computers and does thus not allow the model users to formulate hypotheses that answer their question about the target system. Thus, besides properties of models that are important for epistemic reasons, there are also properties that meet practical requirements. When constructing a model, we therefore inevitably face trade-offs in the performance or quality of different aspects or properties of the model (Ref 37, p. 797). The relevance of each of these properties for the purpose of the model provides a reasonable criterion for how to decide on trade-offs.
In applied contexts, the hypotheses investigated serve to inform policy decisions and need thus provide the information needed by society in order to properly inform decision-making. This requires that idealizations in model construction account for social aims and values. 38 For instance, since considerations of justice count in decision-making, we need information about the spatial and temporal distribution of effects, rather than about aggregated effects. 39 Criteria that guide decisions on idealizations related to the various steps in model construction need to be made explicit. Otherwise, there is a threat of intransparent influence of political interests. 40 But also when researchers take their decisions based on criteria such as how they can handle each of the alternatives or how familiar they are with each of the options, 41 it is necessary to explicitly consider whether these decisions are appropriate for the purpose in question.
In the remaining parts, we focus on the epistemic requirement that a model needs to represent the relevant processes sufficiently well rather than on practical and societal requirements. The next two sections examine how assessments of model fit can be used to evaluate the adequacy of a climate model for long-term climate projections.

FIT TO DATA: EMPIRICAL ACCURACY
The most straightforward approach to evaluate models is to assess their empirical accuracy; that is, the degree to which model results fit to observed or observation-based data. Simulation results are routinely compared by means of statistical measures with the current mean state in different variables (e.g., the monthly rainfall at each location), the variability in different variables (e.g., the magnitude and timescale of the El Niño Southern Oscillation in the tropical Pacific), recent observed trends and patterns in many variables (e.g., the decline of Arctic sea ice), the response to specific perturbations (like large volcanic eruptions), or with more distant past climate states (e.g., the climate response to solar variations in the Holocene, or the climate of the last ice age, or periods even further back). 2

Reliability of Data and Fit to Data
Climate scientists extensively deal with the questions of how well climate model results fit to data and how reliable observation-based data are. Let us review some of the associated difficulties. To begin with, the fit of model results to data is far from perfect. Each of today's climate models is known to give many results that do not come close to matching data within observational uncertainty and variability (i.e., the differences one would expect from random unpredictable variations like weather). Moreover, the fit is not equally good for all variables and scales. It is, for example, less good for precipitation than for temperature, and for regional than for global quantities. [42][43][44] Furthermore, the various ways of obtaining data in the climate sciences raises a host of questions about the reliability of the data. Particularly important ones concern the theory-ladenness of observation and data, which comes in different forms (Ref 45, p. 956). First, since instruments are used to obtain data, the quality of the data depends on the reliability of the instruments and the theory behind them. The data can thus only be trusted if the working of the instruments has been independently tested and confirmed, which is indeed the case for many relevant observations, 46 but is, for example, much harder for satellites that once deployed are inaccessible. Second, since the unprocessed raw data received directly from the measurement instruments often contain errors, gaps and are incomplete in many ways, models are applied to filter and to correct raw data, infer quantities (e.g., temperature and vegetation cover from irradiance at particular wavelengths) and extend them to gridded and reanalysis data sets that include values for chosen variables on regular spatial grid and at regular time intervals. That these general problems raise complex questions in the climate case is shown, for example, by the controversy about warming trends in upper troposphere from satellite and radiosonde measurements [47][48][49][50] (for an overview see Ref 51). Model filtered data 52,53 can be trusted to the extent to which the models used to correct and extend the data have been independently tested and are confirmed. Third, since often no direct measurements, for example of surface temperature in the distant past, are available, scientists gather proxy data of natural (or human) systems that are affected by temperature, such as ocean sediments, tree rings and ice cores, or tax records from grape harvest. The quality of the data depends on the availability of the proxy and on the reliability of the statistical methods used to process the raw proxy data and turn it into the data one is interested in, for example to turn a tree ring width or density into a temperature record. Finally, for many quantities there are only spatially sparse or short periods of observations, or none at all, because the quantity was or is too hard or too expensive to measure (e.g., the conditions at the ground of an ice sheet three kilometers thick), or because nobody cared about systematically measuring it at the time (e.g., the properties of seawater in the deep ocean before about 1950).
Despite these difficulties, it is safe to say that with each generation, models continue to better represent many aspects of the mean climate state and variability that we can observe, 54,55 and the amount and quality of data improves as well. But how can assessments of the empirical accuracy of models be used to argue for their adequacy for a particular purpose?

Determining Observational Consequences
Drawing inferences from empirical accuracy of model results to the adequacy of a model for a purpose is less straightforward than it first might seem. This is so even if for the moment we ignore problems in observations. The reason is that a model need not be empirically accurate in every respect and degree to be adequate for a specific purpose. Instances of fit (or misfit), for example, do not support (or undermine) a model's adequacy for projecting the value of some variable if they concern a quantity that is unrelated to the variable of interest.
The challenge then is to determine which instances of fit do support and which instances of misfit do undermine an adequacy-for-purpose hypothesis. This involves the challenge of determining what the model is likely to indicate about the observable properties of the target system if the model is adequate for the purpose of interest. 9 What we are looking for, then, are conditional claims of the following form: If a model is adequate for purpose P, then it reliably indicates the values of variables V for time T with accuracy α. In order to evaluate the model's adequacy for P, we need to check how well what is actually observed fits with the observational consequences that are likely to follow if the model is adequate for P.
Determining the observational consequences that are likely to follow if a model is adequate for projecting the values of some variable X is straightforward for the quantity of interest for past and present. It is often reasonable to claim that if a model is adequate for projecting within a specified error margin the values of X for some period in the far future (e.g., GMST for each year of the 2050s within 0.3 C under a given forcing scenario), then the model reliably indicates values of X for recent past and present with at least that accuracy. Since climate quantities

From Deductive to Non-Deductive Inferences
The conditional claims mentioned in the previous section roughly say that reliably indicating the values of some variables for past and present is a necessary condition for a model to be adequate for projecting X for the far future. If this necessary condition is not fulfilled, it deductively follows that the model is not adequate for projecting X (see Box 2); hence, the adequacy-for-purpose hypothesis is falsified. 56 For example, if we learn that our climate model only rarely indicates GMST of the recent past within 0.3 C of those derived from observation-based data, we can infer that the model is not adequate for projecting GMST values for the 2050s within 0.3 C. However, if the necessary condition is fulfilled, it does not follow that the model is adequate for projecting X. For this, the condition would need to be sufficient. As we will argue below, reliably reproducing past and present climate is not a sufficient condition for a model to be adequate for projecting future climate. Hence, we can use conditionals of the mentioned types to construct a deductively sound argument against the hypothesis that a model is adequate for the projection at issue, but the suggested arguments in support of such a hypothesis are either deductively invalid or contain a false premise.
Reliably indicating past and present climate cannot ensure that a model is adequate for long-term projections, but under certain conditions, it warrants an increased confidence in those projections. Claims about the empirical accuracy of model results should therefore be understood as premises of a nondeductive argument for the model's adequacy for projections of the desired kind. Non-deductive strength is non-monotonic, that is, adding premises can yield a stronger or weaker argument (see Box 2).

BOX 2 DEDUCTIVE VALIDITY AND NON-DEDUCTIVE STRENGTH
An argument is deductively valid if the truth of its premises guarantees the truth of its conclusion.
• Example: If a model is adequate for projecting X for the far future, then the model reliably indicates X for past and present. Model M does not reliably indicate X for past and present. Hence, M is not adequate for projecting X for the far future. • Deductive validity is (a) incompatible with the conclusion being false while all premises are true, (b) an all-or-nothing matter, and (c) monotonic, that is, adding premises never turns a valid into an invalid argument. An argument is non-deductively (e.g., inductively) strong if its premises provide a good reason for the truth of its conclusion.
• Example: Model M reliably indicates X and climate quantities upon which X depends for past and present. So probably, M is adequate for projecting X for the near future. • Non-deductive (e.g., inductive) strength is (a) compatible with the conclusion being false even if all premises are true, (b) a matter of degree, and (c) non-monotonic, that is, adding premises can yield a stronger or weaker argument. A deductively valid argument with true premises is called 'sound'; a non-deductively strong argument with true premises is usually called 'cogent'.
Hence, the evaluation of such arguments needs to assess whether all relevant information has been taken into account and the conditions for increased confidence are met. Whether this is indeed the case, is often hard to decide. This makes the evaluation of non-deductive arguments difficult.

Non-deductive Inferences from Empirical Accuracy
Non-deductive arguments from fit of model results with data about past and present climate to a model's adequacy for long-term projections are confronted with a number of worries.
Besides issues related to the reliability of data and the fit to data, there is the general problem of induction going back to the philosopher David Hume. Hume's problem was roughly the worry whether nature is uniform and things continue to go on in the same ways (Ref 57, 1.3.6). The specific worry in connection with arguments for a model's adequacy for long-term climate projections is whether the model's success in representing past and present climate provides good reasons to assume that it is adequate for projecting future climate. Here are two instances of this worry.
The first concerns calibration or tuning. The values of the parameters involved in parameterizations are often poorly constrained by an understanding of the underlying processes and need to be calibrated against observational data. Model calibration is unavoidable in climate modeling, and routinely done, but has so far been rarely discussed and documented systematically. 58,59 Model calibration consists in choosing a parameter configuration so that the model results better fit to data about past and present climate. 60 The worry is that if the fit to data is due to calibration, then it does not provide a strong reason for the adequacy of the model for long-term projections. 3,4,9,49,61 One reason is that parameters are often not tuned to their 'correct' values; calibration allows to compensate for structural errors by introducing compensating biases (e.g., in climate sensitivity and radiative forcing) during the calibration process. 39,62 The calibration of a model may thus guarantee success with respect to past and present climate irrespective of whether the model correctly accounts for those underlying processes that are relevant to the long-term evolution of the climate system. A second reason why model fit that is due to tuning does not provide a strong reason for a model's adequacy for projections is that the choice of parameters or model structure may be inconclusive given the data used in calibration. At least in some cases, there are different sets of parameter values that result in equally good fits with data. 63 Different models that agree in their performance as far as the dataset in calibration is concerned can disagree with respect to out-of-sample applications and thus with respect to long-term projections. For example, calibrating a model to short GMST trends provides only weak constraints for projections of future climate. 24,64 Both reasons substantiate the worry that the performance of a model with respect to the future might not be similar to the performance with respect to the data to which the model is tuned. It is thus unclear whether model success can be extrapolated from past and present to the future. A priori we should not expect that it can, yet this is often done implicitly.
The familiar strategy to avoid this problem is to split data and use one part of a dataset to calibrate a model and the other part to test it. 3 This has triggered a debate about whether data used in calibrating a model can nonetheless also be used in evaluating the model; that is, whether doublecounting is legitimate or whether data used for evaluation should not be used in calibration and need in this sense be use-novel (see Ref 65, for an overview). Climate scientists often declare double-counting as illegitimate. d This is an overstatement if it implies that the accommodation of data through calibration provides no epistemic support for the model. Successful calibration can confirm a model at least to some extent because it is far from trivial that a model can be successfully calibrated. The reason is that climate models can be evaluated and calibrated on a large number of variables and scales, but calibration involves usually only a limited number of parameters (Ref 24, p. 174). Philosophers, on the other hand, have argued that from a Bayesian perspective, there is no difference between calibrating and confirming and thus no problem with using the same data to calibrate and confirm a model. For the Bayesian, calibration 'is simply the common practice of testing hypotheses against evidence' (Ref 66, p. 615). Frisch 24 has shown that this claim does not follow from Bayesian formalism alone, and that in case of complex climate models, fit to data not used in calibration is a superior test of the adequacy of a model for long-term projections than fit to data used to calibrate the model. e Thus, to assess the strength of a non-deductive argument from model fit to the adequacy of the model for long-term projections, we need information about the extent to which the fit in question was dependent on tuning and whether tuned elements on which the fit depends have been tested out-of-sample (Ref 9, p. 245). The strength of such an argument depends also on how independent the data used to calibrate the model are from the data used to test it, and on the extent to which data not used in explicit calibration guided model construction in some other way, for example, by influencing choices in the model structure and design. The second worry is a more fundamental one. Even if climate model results fit to data that were not used in calibration, this may not provide strong reasons for a model's adequacy for long-term projections. Long-term projections for high-forcing scenarios lie outside the scope of boundary conditions previously observed in the instrumental record. At least prima facie, there is no reason to assume that successful performance under current boundary conditions is a good guide to successful performance under future boundary conditions that describe highforcing scenarios. We are confident that the physical principles on which climate models are based can be extrapolated beyond the range where they are evaluated. However, this is less clear for parameterizations that are empirically derived from observations mostly covering the narrow climate regime of the past century, and for the interaction between parameterizations and physical principles. For long-term projections, additional processes and feedbacks (e.g., methane emissions from thawing permafrost) may become relevant and take the system out-ofsample with respect to existing observations. If a model does not account for these processes and feedbacks, it could fit almost perfectly even to data about past and present climate not used for calibration but still be biased for projections. Success with respect to past and present climate alone is thus no assurance that the model will also be successful in projecting future climate (Ref 2, p. 828). 7,12,24,67 Some climate scientists conclude from this that it is hard to tell how relevant past data are or that they are not relevant at all for evaluating a model's adequacy for climate projections (Ref 68, p. 2146). This conclusion may be too hasty, but the considerations behind it show that to further strengthen an argument from model fit to the adequacy of the model for long-term projections, we need independent reason to assume that the model captures the relevant climate processes and feedbacks. f Worries related to calibration and missing feedbacks can also be mitigated by testing model results against data about paleoclimate epochs. [69][70][71][72] Paleoclimate states provide partly independent information not used in model development, and they were driven by forcings quite different from those of modern climate. However, the boundary conditions and the data are limited and derived from proxy data, which introduces large uncertainties, and those types of data also get increasingly used in the model development and evaluation process, which weakens the argument of an independent test.

FIT TO RESULTS OF OTHER MODELS: ROBUSTNESS
Climate model projections are too long term to allow repeated direct comparison with data, but they can be compared with projections of other models or model versions. This is what climate modelers extensively do in ensemble studies because of uncertainties in how to represent the climate system such that models lead to accurate projections of future climate.
In an ensemble study, each of several climate models or model versions is run with the same or similar initial and boundary conditions. There are two main types of such studies (Ref 67, p. 582). Perturbed physics (or parameter) ensemble studies employ different versions of the same model that differ in the values of their uncertain parameters, that is, are effectively parameter sensitivity tests. In this way, the ensemble explores how climate projections are impacted by the uncertainty about the values that should be assigned to model parameters. Multimodel ensemble studies employ several models that differ in a number of ways, for example, in number and complexity of processes included, parameterizations, spatiotemporal resolution, numerical methods and computing platforms. In this way, the ensemble explores how climate projections are impacted by structural and parametric uncertainty; that is, uncertainty about the form that modeling equations should take and how they should be solved computationally. The most ambitious multimodel ensemble study to date is Coupled Model Intercomparison Project 5 (CMIP 5) which has collected results from about 60 models from nearly thirty modeling centres around the world. 73 Both types of ensemble studies often include a limited investigation of the impacts of initial condition uncertainty as well by running multiple cases for the same experiments with different initial conditions. Ensembles help to deal with uncertainties either by producing robust projections or by providing estimates of uncertainty about future climate change. A model projection is robust if all or most models in the ensemble agree regarding the projection. If all models in an ensemble show more than a 4 C increase in GMST by 2100 when run under a certain forcing scenario, this projection is robust. In what follows, we focus on multimodel ensemble studies but similar arguments can be made for perturbed physics ensembles. We discuss three inferences from the robustness of projections: to their likely truth, to the warranted confidence in the projections, and to the correctness of the underlying causal assumptions.

Inference to the Likely Truth of a Projection
An inference from robustness of projections to their likely truth is legitimate if we have reasons to assume that it is likely that at least one model in the multimodel ensemble correctly projects the quantity of interest within the specified error margin.
A premise to this effect could be justified in two different ways (Ref 67, p. 584-589). One is to cite the success of the models in simulating past and present climate to support the claim that it is likely that at least one simulation in the ensemble correctly projects the quantity of interest within the specified error margin. Considerations in the last section pointed out the limits of such an argument. A second way to justify the required premise refers to the construction of the models rather than to their performance. It argues that the multimodel ensemble samples enough of the current uncertainty about how to represent the climate system for the projection at issue that it is likely that at least one simulation correctly projects the quantity of interest within the specified error margin. The problem with this line of argument is that today's multimodel ensembles group together existing models and are thus 'ensembles of opportunity,' 'not designed to span an uncertainty range' (Ref 62, p. 2653). One of the main sources of uncertainty are parameterizations of subgridprocesses such as cloud formation. Each state-of-theart climate model includes some representation of clouds, but ensemble studies do not attempt to ensure that the ensemble as a whole adequately samples (or spans) current uncertainty about how clouds should be represented; the same holds for other subgrid-processes (Ref 67, p. 585). Moreover, it is unclear how such a sampling could be achieved. In case of parameter uncertainty, the space of possibilities in which plausible alternatives are to be identified is clear since it is the space of numerical values (although it is computationally intractable due to its dimensionality), but in the case of structural uncertainty as it is addressed in multimodel ensemble studies, the space of possibilities is indeterminate since it ranges over model structures (Ref 74, p. 216). One may argue that in the presence of limited understanding and potential unknown unknowns it is fundamentally impossible to sample the uncertainty in how to build and calibrate a model since we do not really know what the uncertainty is.

Inference to the Warranted Confidence in a Projection
Similar difficulties beset sampling-based arguments from the robustness of model projections to the warranted confidence (or degree of belief ) in these projections. Such arguments combine robustness considerations with additional criteria of adequacy. Suppose that S is the set of all theoretically possible models that meet sufficiently well basic criteria such that each model in S has a significant chance of being adequate for projection P within some specified margin of error. For example, they simulate relevant aspects of past and present climate sufficiently well, include particular physical assumptions and have an appropriate spatiotemporal resolution. The models in S are currently considered the best theoretically possible models for P. In the absence of overriding evidence, the warranted confidence in P can then be identified with the fraction f of models in S whose simulations agree with respect to P within the specified error margin. Now, if the models in an ensemble constitute a random sample from S, then the fraction of models in the ensemble whose simulations agree with respect to P within the specified error margin provides a good estimator of f and thus of the warranted degree of confidence in P (cf. Ref 67,. The problem with an argument along this line is that today's multimodel ensembles are not random samples from the set of all theoretically possible models that meet basic criteria of adequacy. It is unclear what the space of possible models that meet the required criteria actually is, and climate scientists do not select today's models from this space by randomized procedures. As ensembles of opportunity, today's ensembles are not the kind of sample from which statisticians would usefully estimate uncertainty since their 'sampling is neither systematic nor random' (Ref 75, p. 2068). Currently available multimodel ensembles such as CMIP5 are not designed to systematically explore the space of models that meet the required criteria, and the statistical interpretation of the ensemble is unclear. 76 All we have is a very limited space of practically possible models, and this space involves near duplicates since models are used several times with minor modifications only. Even if these duplications were eliminated and the remaining model space randomly sampled, there would still be structural dependencies between the models. 54,[77][78][79] The models are of course based on the same physical understanding and use the same basic equations, but they also partly use the same parameterizations, make similar simplifications, and use the same computational methods; in many cases, they even share large fractions of code. As a result, the models inevitably share common errors (e.g., in the simulation of the Inter-tropical Convergence Zone ITCZ 80 ). Moreover, some climate processes that will significantly influence future climate change are not represented in any of today's models; some of these processes are recognized (e.g., the effect of methane hydrates), and perhaps some not. Both raises the worry that simulations from today's climate models might not so infrequently agree with respect to a projection, even though it is false (or biased), 6 because most models share similar deficiencies. Furthermore, the interdependency of models within today's small ensemble studies makes it likely that the models do not differ enough to provide a representative sample of the set of all theoretically possible models that meet the basic criteria of adequacy (whatever they exactly are). Given current uncertainty about how to represent the climate system adequately, the set of possible models that meet the basic criteria of adequacy is likely to include models that differ significantly from today's models. If today's models differ from one another much less than random samples from S would, they are biased estimators of the fraction of models in S whose simulations agree with respect to P within some error margin, and thus of the warranted degree of confidence in the projection at issue (Ref 67, p. 594). The IPCC (Ref 1, ch. 12) acknowledges this and downgrades the probabilities based on the frequency of ensemble results. g The robustness of model projections cannot be directly translated into probabilities without making strong assumptions about the ensemble, about dependence and criteria for adequate models. 84 But to the extent that the models of an ensemble are independent and differ thus in ways that are relevant for the projections at issue, the robustness warrants an increased confidence in the projections and can thus figure as an additional premise in a non-deductive argument for a model's adequacy for those projections. Such an argument combines premises about the robustness of model results with premises about their empirical accuracy; that is, premises about the model's success in reproducing (use-novel) data of past and present climate with premises about the agreement of the projections with projections of other models of an ensemble that are equally successful in reproducing (use-novel) data of past and present climate. The strength of such a non-deductive argument depends on whether all relevant information has been taken into account. To assess the strength of the argument, we need therefore more information about the extent to which the models in the ensemble are independent from each other and differ thus in relevant ways, for example, regarding equations, parameterizations, parameter values, resolution, boundaries, and numerical coding. 85

Inference to the Correctness of Causal Assumptions
Even if a sampling-based argument from robustness were sound, it would not prove that the models generate the robust projections because the models capture the relevant climate processes sufficiently well for this purpose. However, robustness of model results (combined with their empirical accuracy) is often seen as making it likely, or at least increasing our confidence, that the processes that determine these results are encapsulated sufficiently well in the models (Ref 3, p. 979-980; Ref 4, p. 160).
A framework for arguments to this effect is provided by Weisberg's (Ref 86;Ref 35, conception of robustness analysis, which involves three steps. First, a set of models is examined in order to determine whether they all lead to the same result, the robust property. Second, the models are analyzed for a common structure that generates the robust property, and the two are linked together into a robust theorem: Ceteris paribus, if [common causal structure] obtains, then [robust property] will obtain. For climate models that approximately agree in their GMST outcomes, we roughly get: Ceteris paribus, if [greenhouse gases are causally related to the energy budget of the Earth] obtains, then [for increasing greenhouse gases, increasing GMST] will obtain (Ref 3, p. 980). In the third step, stability analysis of the robust theorem is conducted in order to determine the limits of robustness and spell out the ceteris paribus clause. It needs to be determined how frequently the common structure shows up within 'a sufficiently heterogeneous set of models' (Ref 86, p. 739). And it needs to be investigated what defeats the core structure giving rise to the robust property, which is done by varying the values for parameters or by adding or removing mechanistic features from the model in multimodel ensemble studies. According to this proposal, we can establish a robust theorem by running simulations of a target system using models that differ systematically, but share common causal assumptions. If the property shows up in most of the simulations, then the robust theorem is established. One can then confirm the common causal assumptions of the models by observing the relevant property (Ref 86, p. 739). Thus, if the climate models within a multimodel ensemble are sufficiently heterogeneous and differ systematically, but share common causal assumptions about the climate system (e.g., a common causal structure of greenhouse gas causation), lead to the same robust property (e.g., GMST model outcomes for past and present approximately agree) and the robust property obtains in the target system (GMST has approximately increased as described by the models), then we can conclude that it is very likely that the common causal assumptions of the models are approximately true (i.e., greenhouse gas emissions are the relevant cause of global warming).
As already seen, the assumption that the models are sufficiently heterogeneous and differ systematically does not seem to be true for today's multimodel ensembles. More importantly, even if we can argue along the suggested line for the attribution of global warming up to now, we cannot argue in this way for the claim that the models of an ensemble represent sufficiently well the climate processes that are relevant for long-term climate projections. The reason is that confirming the common causal assumptions of models requires observation-based data for the robust property, but we only have such data for robust results about past and present climate. h Whether a model captures sufficiently well the climate processes that are relevant for the projections of interest needs rather to be shown with reference to background knowledge.

BEYOND FIT: BACKGROUND KNOWLEDGE
Climate scientists often stress that our confidence in climate model projections does not only come from the empirical accuracy and robustness of simulation results, but also from the foundation of the underlying models in accepted theoretical principles and the physical, chemical or biological understanding of the processes behind the results (Ref 2, p. 745). 7,11,16,68 Process understanding or, more generally, coherence with background knowledge is indeed a key consideration in determining the adequacy of a climate model for specific long-term projections. Implications of model fit for long-term projections are limited, but if we can argue independently from model fit that the relevant climate processes for a particular purpose are well observed and likely to be understood beyond the range observed, or even based on fundamental principles like conservation laws, and further that the model represents those processes sufficiently well, then a model's success in representing aspects of past and present climate provides good reasons to assume that it is adequate for projecting the quantity of interest in a future climate.
Background knowledge can be used to decide which processes a model needs to represent, and coherence with background theories and assumptions that embody a theoretical understanding of the processes at issue provides reasons that are independent from model fit. Claims about how well a climate model is supported by independently confirmed background theories and assumptions can thus figure as further premises in a non-deductive argument for the model's adequacy for the purpose of projecting future climate. But the supporting role of background knowledge is limited by the need of empirical parameterizations and the epistemic opacity i of complex models.

Problem of Empirical Parameterization
It greatly contributes to our confidence in model projections that climate models are based on fundamental physical principles (e.g., conservation of mass and energy) and that many of their equations are derived from well-confirmed physical theories (e.g., fluid dynamics). However, the extent to which theory can guide climate model evaluation (and construction) is limited. Besides well-accepted physical principles and approximations to well-understood physics (in cases where a physical theory is available but it would be computationally too costly to fully incorporate it into the model), climate models contain also empirical parameterizations of unresolved processes for which no general theory exists. Parameterization is even essentially involved in representations of processes for which we possess basic equations, since the discretization of the equations requires grid-scale dependent parameterization of unresolved subgrid processes (e.g., a parameterization of atmospheric convection). While physical principles can certainly be extrapolated beyond the range where the model is tested, this is less clear for parameterizations and their interactions with physical principles. 7 However, parameterizations like those for atmospheric convection can be independently tested by comparing them with specifically targeted observation-based data covering different regions and times (e.g., from aircraft campaigns and ground data measuring interesting weather situations), conducting experiments, varying parameter values, incorporating the parameterizations into other types of models (e.g., weather models which have daily verification) to test them in a different context and on a different time scale, and by basing them or evaluating them on high resolution models that explicitly resolve the involved processes (see Ref 88, for an overview). Some of these strategies raise questions similar to those already discussed, but the hope is that submodels of individual processes are easier to evaluate because these processes only work on certain scales, are linked to a relatively small number of physical, chemical or biological processes, can more directly be constrained by observations and allow for experimental testing under controlled conditions. 11 Nonetheless, if parameterizations are empirically derived from data covering the last century, this independent support does not ensure that the parameterizations are inductively stable and can be extrapolated beyond the range of where they were evaluated. For this, they need be based on (or at least loosely inspired by) physical, chemical or biological principles, which requires better observations and an improved understanding of the (mostly subgrid scale) processes in question (e.g., for cloud formation about how aerosols affect clouds and about the physics surrounding precipitation and clouds). Improving this understanding is particularly important but also difficult for processes that in the current climate state and within the limited time we have observed it, may be barely detectable (e.g., rapid calving of a large ice sheet, or the collapse of a marine ecosystem). In other words, parameterizing the unknown is both a dangerous and hopeless exercise, yet leaving it out will result in all models being robustly wrong.

Problem of Epistemic Opacity
Support of a model by background knowledge in general is also limited because in case of complex climate models, the understanding of how model components (physical principles, parameterizations, initial and boundary conditions) contribute to the various metrics of performance of the model is limited. As a result, it is difficult or even impossible to say where the successes and failures of climate models to reproduce data come from, and what they imply for a particular projection. Lenhard and Winsberg 91 argue that two features of complex climate models exacerbate or prevent such an analytic understanding of the models. First, the models are holistic in the sense that the effect of components and submodels can often not be tested in isolation because they are highly interactive and the complexity of their interaction makes it often impossible to independently assess the merits or shortcomings of each component, sub-model, process or parameterization. Second, the models contain 'kludges'; that is, unprincipled and complex pieces of programming that appear to work but are ill understood. Thus, even though climate models are based on well-established physical principles and theories that allow at least some understanding of the underlying climate processes, it is often impossible to fully understand the details of how model behavior emerges from the interaction of different model components ( Ref 24,p. 178). This lack of analytic understanding of models (and in fact the target system as well) makes it difficult to assess whether a models' success in representing certain aspects of past and present climate is a good reason to assume that the model is also adequate for projecting certain aspects of future climate. As a consequence, there is often no consensus of which instances of fit with observation-based data provide support for a projection.
There are strategies to deal with this problem. The first assumes that the analytic opacity of climate models does not preclude evaluating representations of individual processes in the context of the full model as well as in isolation (Ref 2, p. 754). At least in certain cases, it seems possible, by comparing individual and combined runs, to achieve an understanding of the effect of components or sub-models within the model and to attribute success or failure to certain components or sub-models (Ref 12, p. 967). Unfortunately, a parameterization that performs better when tested individually does not necessarily improve the performance of the model as a whole. If a model is biased with respect to aerosol concentration or humidity, then an improved parameterization of cloud formation may lead to a poorer performance of the model as a whole. This raises the question whether model construction should give priority to the optimization of model fit of the whole model, or strive for a more faithful representation of climate processes even though this might in the near term lead to a deterioration in model performance.
The second strategy is to pursue an ensemble approach and experiment on models and explore model hierarchies in order to advance the understanding of the behavior of the model and its target system. Experimenting with models means investigating the contribution of different processes and parameters in producing the (simulated) phenomena of interest by eliminating or adding particular processes and varying the values of parameters and comparing the simulations produced with and without these interventions. For example, one could include a realistic topography in a climate model in order to learn whether this makes much difference to the projection one is interested in. 92 Held 22 suggests that climate scientists should more extensively explore hierarchies of models that range from simplified models that represent some key causal factors in highly idealized ways to more and more complex models which include representations of additional causal factors and/or more realistic representations of previously included factors. Studying such hierarchies and tracing certain behaviors across different types of models might provide a better understanding of the interaction of different model components and sub-models.
The third strategy is to evaluate climate model results against use-novel data that have not been used in calibrating the model, in the cases where such data exist at all. As Frisch 24 argues, in case of complex climate models that allow only a limited analytic understanding, fit with use-novel data provides stronger reasons of a model's adequacy for projecting future climate than fit with data used in calibration. If a model is epistemically opaque and we do not know in detail how the more principled components of the model interact with parameterizations to result in the model's output, then successful simulation of use-novel data can be an indicator that the model adequately represents the relevant climate processes and does not posit spurious relations among data introduced through tuning.
A fourth and related strategy is to exploit an ensemble of models to find strong relationships between well observed quantities and the projection of interest, so-called emergent constraints, that help identify which instances of fit matter (Ref 2, p. 826; Ref 11). However, for the emergent constraints the coherence with background knowledge remains important to avoid the interpretation of spurious relationships. 93 It also is critical for the use-novel data, since performance with respect to any amount of data from the past and present alone cannot, as discussed, show that the model captures all relevant processes and feedbacks that will significantly influence future climate change.

CONCLUSION
From our argumentation, a conceptual framework for discussing the adequacy of a model or family of models for long-term climate projections emerges (see Box 3). The evaluation of models needs to be specific to the purpose they are intended to serve. Models are idealized representations and it depends on the purpose whether idealizations are adequate. To be adequate for long-term projections, a climate model needs to represent sufficiently well those processes that significantly shape the long-term evolution of the climate characteristics of interest. An argument for the hypothesis that a model or family of models is adequate for a long-term climate projection can draw on the assessment of models or their results with respect to three dimensions: the empirical accuracy of model results (i.e., their fit to data), the A certain degree of empirical accuracy, robustness and coherence with background knowledge is necessary for a model's adequacy for long-term projections, but they neither individually nor collectively provide sufficient conditions for the adequacy of a model for such projections in a strict logical sense. The empirical accuracy of model results may to some degree be due to tuning rather than to the model's representation of the processes that shape past and present climate (although climate scientists would deny that the degree of tuning questions our confidence in model projections, and that models are tuned to simulate a specific future projection). Even if the model represents these processes well enough, it may lack representations of feedbacks that will significantly influence future climate change at space and time scales of interest. Robustness considerations do not solve this second difficulty and mitigate the first only to the extent to which the models in question are sufficiently heterogeneous. Background knowledge that supports model components provides independent reasons to assume that the model represents sufficiently well the processes relevant for the long-term climate projection at issue. But the role of background theories is limited because of empirical parameterizations, and support of model components by background knowledge more generally faces the problem of the epistemic opacity of complex models. While such background knowledge is implicitly considered when interpreting results and assessing confidence, and provides often a strong argument for model results, 11 its contribution is hard to express in numbers and not as easily conveyed as 'beauty-contest like' instances of fit. As a consequence, unfortunately, background knowledge is rarely explicitly discussed in scientific publications, and instances of model fit are often shown without arguing how they support a conclusion.
Under certain conditions, empirical accuracy, robustness and coherence with background knowledge enhance our confidence in model projections and provide thus premises for a non-deductive argument for the claim that a model is adequate for those projections. In order to assess the strength of such an argument, more information is needed, for example, about the extent to which the fit to data was dependent on tuning, about the extent to which the models that agree in their projections are independent from each other and differ thus in relevant ways, and about the extent to which parameterizations are based on models that explicitly resolve the involved processes.
The suggested framework provides not only a basis to assess evaluation arguments, but indicates also how such arguments can be strengthened by future research. First, one can work towards a better performance of models with respect to each of the three dimensions; that is, better fit of model results to data and to results of other models, and better coherence of model components with further developed background knowledge. Since resources are limited and improved model components do not necessarily lead to better model fit, this may require deciding whether priority should be given to the optimization of model fit or to a more faithful representation of climate processes. Second, one can try to increase the significance that the performance with respect to each dimension has for a model's adequacy for long-term projections. The significance of empirical accuracy can be increased by splitting data and testing the model against data that have not been used to develop and calibrate the model, and by arguing independently that the model captures well enough the relevant climate processes. Only then does success of a model with respect to past and present provide reasons to assume that the model will also be successful with respect to future climate. The significance of robustness can be enhanced by increasing model independency and diversity. The significance of coherence of model components with background knowledge can be enhanced by improving model transparency (i.e., the understanding of how model components contribute to model fit) through the exploration of model ensembles and model hierarchies.
The suggested framework can be applied to other purposes than long-term projection (e.g., shortterm prediction, detection, or attribution) and in areas outside climate. We hypothesize that the three dimensions with respect to which the adequacy of models can be evaluated are the same for different purposes and different complex empirical areas in which predictions cannot directly be evaluated, but the weight given to the different dimensions varies for different purposes. For example, whereas climate models are based on well-confirmed physical theories, there are no generally accepted basic theories in the social sciences. Even if this is harmless for short-term predictions, it hampers long-term predictions. Model fit may be decisive for short-term predictions that can repeatedly be tested against data, but support by background theories that embody an understanding of the underlying processes is important for long-term predictions since we have reason to assume that extrapolation of trends is not reliable and calibrated parameterizations may no longer be valid on larger timescales. The challenge for scientists is on the one hand to borrow strengths from the three dimensions in a way that is optimal for their case, in order to maximize the confidence in projections made. On the other hand, it is to explicitly spell out how the fundamental theories, the assumptions, the empirical accuracy, the degree of robustness and the elements of background knowledge complement each other, in order to transparently document the line of argument supporting the projections, and to convince those who rely and make decisions based on them.
NOTES a These notions of verification and falsification play a key role in the philosophy of logical empiricism 13 and its criticism by critical rationalists, 14  In philosophy of science, this requirement is often expressed in terms of relevant similarities between the model system that the model equations describe and the target system. It then requires that the model system be, for the purpose of interest, in relevant respects and to a sufficient degree similar to the target system 26,27,34,35 (for a critical voice see Ref 36). e See section 'Beyond Fit' below. On a charitable reading, this may also be what climate scientists who argue against double-counting (cf. previous endnote) want to claim. f This is one reason why the slogan that a climate model must 'get the right effect for the right reasons' is potentially misleading. If 'getting the right effect' means reproducing past and present climate sufficiently well, then the requirement is too weak. A model can reproduce past and present climate for the right reason but still miss processes and feedbacks that are relevant for long-term projections. A second reason is that if 'for the right reason' implies that the model resolves all relevant processes (rather than incorporating them via parameterization), then the requirement is too demanding. g An alternative approach regards spread of ensemble results as guide to possibility rather than probability. 81 Betz 82,83 and Katzav 28 have argued that a focus on probabilistic projection is misguided and that models ought to be used to show that certain scenarios are real possibilities. h The same problem besets other argument forms, for example, inference to the best explanation. Here we conclude that the causal assumptions shared by the models of an ensemble are approximately true because this is the best explanation for the fact that the results of the models agree and fit reasonable well with observational data if the amount of data is much larger than the degree of freedom in the model. Such an argument faces the additional problem that there may be other equally good explanations of model fit, for example, the fit with data may be due to tuning and the fit with result of other models may reflect a social convergence process among institutions building models. 52 For a different line of criticism, see Ref 87. i The term 'epistemic opacity' is due to Humphreys. 89,90 While Humphreys focusses on limits in our understanding of the details of the computational process leading from the abstract model underlying a simulation to the output, we are primarily concerned with limits in our understanding of how the components of the underlying model interact and how the model relates to the target. This is also how other authors writing about climate modelling use the term (see Ref 24,p. 177; Ref 91,p. 258).