On predicting climate under climate change

Can today’s global climate model ensembles characterize the 21st century climate in their own ‘model-worlds’? This question is at the heart of how we design and interpret climate model experiments for both science and policy support. Using a low-dimensional nonlinear system that exhibits behaviour similar to that of the atmosphere and ocean, we explore the implications of ensemble size and two methods of constructing climatic distributions, for the quantification of a model’s climate. Small ensembles are shown to be misleading in non-stationary conditions analogous to externally forced climate change, and sometimes also in stationary conditions which reflect the case of an unforced climate. These results show that ensembles of several hundred members may be required to characterize a model’s climate and inform robust statements about the relative roles of different sources of climate prediction uncertainty.


Introduction
Multi-decadal climate model projections are key inputs to the Intergovernmental Panel on Climate Change (IPCC) assessment reports and to climate policy negotiations (IPCC 2007). Uncertainty in these projections has been categorized as forcing, initial condition (IC) and model uncertainty (Cox andStephenson 2007, Stainforth et al 2007). These represent, respectively, the consequences of uncertainty in future greenhouse gas concentrations, in the details of the current state of the system, and in our knowledge of and abilities to represent the Earth system in a computer model. Cox and Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Stephenson (2007) argued that on multi-decadal timescales IC uncertainty was relatively unimportant; an argument supported by subsequent analysis of the ensemble produced by the third Climate Model Intercomparison Project (CMIP3) (Hawkins and Sutton 2009). This implies that relatively small IC ensembles are appropriate for multi-decadal climate change prediction. Here, using results from a low-dimensional nonlinear system that exhibits climate-like behaviour, we demonstrate that this conclusion is ill-founded in both non-stationary conditions analogous to externally forced climate change, and also in stationary conditions analogous to an unforced climate. The results show that IC ensembles of several hundred members may be required to characterize a model's climate and inform robust statements about the relative role of different sources of climate prediction uncertainty. The IC ensembles in CMIP3 and CMIP5 are likely too small to reliably quantify their climate distributions and are therefore unable to characterize future climate even within their own 'model-worlds'.
Our focus is solely on IC uncertainty because there are fewer conceptual barriers in its quantification and it provides a lower bound on the uncertainty which must be assigned to the probabilities generated by multi-uncertainty assessments, such as the probabilistic climate predictions (e.g. Tebaldi et al 2005, Murphy et al 2009 used in many impact studies and in adaptation planning (e.g. Groves et al 2008, Manning et al 2009, DEFRA 2012. IC uncertainty provides a means of quantifying probabilities within a given model and a given forcing scenario. Its importance and the risks of its under-sampling are well understood within the meteorological (Epstein 1969, Palmer 1993, Collins 2002 and nonlinear systems (Grebogi et al 1987, Chu 1999 literatures but it is often taken to be relatively unimportant for decadal and multi-decadal climate prediction. If this source of uncertainty is not well quantified, however, then multi-model based probabilistic climate predictions will inevitably inherit the errors in the distributions representative of each model, implying that their probabilities are not reliable for use as such in societal planning (DEFRA 2012). Consider, for instance, the development of a river flood protection scheme using probabilistic predictions for the exceedance of specific rainfall thresholds. A lack of awareness of errors resulting from IC under-sampling could lead to overconfidence resulting in either over-adaptation, with significant additional costs, or under-adaptation leaving residual and unexpected risk. While there are other reasons to doubt the reliability of the probabilities generated in today's probabilistic predictions for impact assessments (Oreskes et al 2010, Frigg et al 2013, and there remain conceptual challenges in the assessment of model uncertainty (Stainforth et al 2007), nevertheless any future means of generating probabilities from multi-model ensembles will be flawed if the inputs do not adequately represent the underlying models. Thus for both the provision of information to society via climate services (WMO 2011) and for modelling experiments designed to help us better understand the climate system, the size of IC ensembles must be sufficient to adequately quantify the future climatic distributions within the model. This is true irrespective of the relative contribution of forcing, model and IC uncertainty and has significant implications for the assignment of computational resources in climate modelling experiments.
Herein we assess what size of IC ensemble may be adequate to evaluate the climate distributions within a particular model. As a context for the experiment we begin, in section 2, with a discussion of some conceptual issues in modelling climate, drawing on research in the nonlinear systems community to inform the role of IC uncertainty in climate prediction. With this as background section 2 then describes the layout of the rest of the letter.
2. Defining climate: conceptual approaches to modelling the climate system The atmosphere-ocean system is a nonlinear system (Rial et al 2004) whose long term behaviour has been conceptually associated with an attractor (Fraedrich 1986, Lorenz 1991, Sahay and Sreenivasan 1996, Palmer 1999. If the boundary conditions were fixed, then the 'climate' of the system could be considered as the variable distributions on this attractor. In reality the climate system is not stationary but subject to natural and anthropogenic forcing variations. Such a definition is therefore not directly relevant to the real world and can only be quantified for a model. Furthermore, and irrespective of the stationarity or non-stationarity of the system, when making multi-decadal climate predictions we do not want to consider all possible states of the system, only those consistent with current and recent observations. Consider, for instance, the current patterns of ocean circulation and land surface vegetation. These influence the climate of today, not just the weather, and constrain the climatic distributions of the future. This is the case despite the fact that we know that alternative patterns, and therefore alternative future distributions, would be consistent with the physical/biological processes and boundary conditions of the climate system. For the purposes of climate prediction, therefore, it is most useful to view climate as a distribution conditioned on our knowledge of the system's state at some point in time (Stainforth et al 2007). This knowledge is of course never perfect so within a model, 'climate' can be taken as the distribution resulting from an IC ensemble constructed using uncertainty in the system's state on a chosen reference date.
This approach raises conceptual issues. It is conceivable that multiple attractors coexist and the system is intransitive; that is to say the system can only pass through a subset of the possible system states-the subset being determined by the initial state (Lorenz 1970). For some intransitive nonlinear systems, a very small change in the initial state of the system does not just limit predictability due to chaos but can also change the attractor (and therefore climate) to which the system evolves (McDonald et al 1985). If dynamical system features of this nature, such as riddled basins of attraction (Viana et al 2009), were found in climate models then the model distributions from an IC ensemble could be dependent on the finest details of the chosen ICs (Lorenz 1968(Lorenz , 1976. The above interpretation of climate would then be ill-defined. We do not know if this is the case for Global Climate Models (GCMs), let alone the real climate system, so our starting point is to demonstrate a process for resolving this question by evaluating the distributions over long time periods in an analogous nonlinear system. Only distributions of single variables are considered because this approach could be equally well applied to variables from the very large state space of a GCM, where mapping the entire attractor is infeasible.
Taking this nonlinear dynamical systems perspective, in section 3 we introduce a climate-like system that, in section 4, we use to explore the process of quantifying climate in a climate prediction context. In section 4.1, we consider how to evaluate the distributions of individual variables of the system over long time periods, assuming no changes in forcing. The value of IC ensembles for this purpose is assessed. Next, in section 4.2, we explore how to use ensembles to quantify climate over shorter timeframes, conditioned on knowledge of the system state at some point in time. Finally, in section 4.3 we apply the same methodology under changing forcing conditions; providing an analogy with 21st century climate change. Section 5 contains a discussion of the implications of our results for climate prediction and the interpretation of climate model output in policy decisions.

The Lorenz-84/Stommel-61 coupled model
Our system consists of the Lorenz-84 model with sinusoidal seasonal forcing (Lorenz 1984, 1990, Broer et al 2002, 2003, coupled to the Stommel-61 ocean box model (Stommel 1961, Roebber 1995, Van Veen et al 2001-see the supplementary materials (available at stacks.iop.org/ERL/8/ 034021/mmedia). By comparison with GCMs which have >10 5 dimensional state spaces, this is a very simple climate model. Yet it is informative about the climate system because it is nonlinear, chaotic and includes high frequency variations representative of the atmosphere (X, Y, Z variables in figure 1) and low frequency variations representative of the ocean (T, S variables in figure 1). As pointed out by Smith (2002), 'it is unreasonable to expect solutions to low-dimensional problems to generalize to million dimensional spaces, (but) so too it is unlikely that problems identified in the simplified models will vanish in operational models'. In this case the variables within the model are representative of large scale circulation patterns in the atmosphere and ocean (see the supplementary materials), providing a direct physical link to climate. The model is therefore not only a tool with which it is practical to run extremely large ensembles but it is also relevant as an illustration of the behaviour which might credibly be expected in the climate system and the challenges likely to be encountered in studying the climate prediction problem with GCMs.
The evolution of model trajectories in the atmosphere and ocean components are shown in figure 1. The atmospheric behaviour is strongly influenced by the state of the ocean; both exhibit variability on multi-decadal timescales (see the supplementary materials and figure S5 available at stacks.iop. org/ERL/8/034021/mmedia). In our analysis we focus on the ocean variables for simplicity of presentation.

Determining the model climate from single trajectories
Distributions of the ocean variables are first determined from a 100 000 year simulation of the model (figures 2(a) and (b)). These distributions are then compared to distributions built up from an alternative trajectory over shorter simulation periods (figures 2(c) and (d)). The comparison is made using the Kolmogorov-Smirnov (KS) statistic, D (Massey 1951, Miller 1956, Press et al 1992 which is defined as the maximum difference in probability between two cumulative distribution functions; D = 0 if the distributions are identical and D = 1 if they are entirely different. In figure 2, D therefore represents the maximum error in the probability of exceedance of any threshold for the chosen variable, when using distributions constructed over shorter timeframes as opposed to the one constructed over 100 000 years. Thus, the greater D, the greater the error in quantifying the variable's probability distribution; if D = 0.2 then the probability of exceeding some threshold will be more or less than the estimated value by 0.2. While the two simulations do eventually converge towards similar distributions, figures 2(c) and (d) show that reliable estimates of the model's long term climate require a single simulation of tens of thousands of years. For timescales less than ∼1000 years, the distributions are substantially different (D > 0.1, median value) from the long term distribution. Even after 30 000 years there are distinguishable differences 5 (D > 0.030 (S), D > 0.018 (T); median values). Can an ensemble of simulations quantify the model's climate more quickly? Yes. Distributions constructed from the combination of three (red) or ten (blue) member IC ensembles are more likely to better represent the model's long term climate for any length of simulation. However, such ensembles are not more computationally efficient; the distributions built from ten (three) member ensembles are not more reliable than the distribution from a single simulation of ten (three) times the length and therefore the same computational requirement. Uncertainty in these comparisons is quantified by repeating the process one hundred times with different ICs centred on the same starting conditions (i.e. the same region of the model's state space-see the supplementary materials). That all the ensemble members converge to the same distribution suggests that this system is transitive as opposed to intransitive; that is to say it will  (d) show the difference, quantified by the KS statistic, between the (a) and (b) distributions and those generated using shorter simulations, for S (c) and T (d). The distributions are constructed from either single simulations (black) or by combining data up to the given time point from multiple simulations with different ICs; red-three simulations, blue-ten simulations. Uncertainty is quantified by repeating the process 100 times using different ICs distributed around the same location on the attractor (see figure 1 and the supplementary materials available at stacks. iop.org/ERL/8/034021/mmedia); error bars do not therefore represent uncertainty in our assessment of D but rather the distribution of values from which any particular simulation/ensemble may be considered a draw. Lines connect the median values at each time point; errors are the 10th-90th percentiles. pass through all of the possible system states over infinite time (Lorenz 1970). Simulations based on ICs centred on very different starting conditions (macroscopic initial condition uncertainty; Stainforth et al 2007) show the same result. Thus the model's climate appears not to be sensitive to the finest details of the chosen ICs and distributions constructed from sufficiently large IC ensembles are likely to be robust.

Determining the model climate from IC ensembles
Taking future climate as a distribution conditioned on our uncertain knowledge of the system's current state (Stainforth et al 2007), requires such IC ensembles to quantify this distribution within any given model. A 10 000 member IC ensemble is therefore run from a specific location on the attractor (figure 1); individual members are prescribed ICs randomly selected from Gaussian distributions around this location, as might result from observational uncertainty in the real-world climate (see the supplementary materials). The subsequent distributions of the ocean variables broaden rapidly, approaching the long term distribution after about 100 years (figures 3(a) and (b)). Over shorter periods the information in the ICs is evident in the distributions, not just in single trajectories; i.e. the probability of different states of the system (its climate), conditioned on an uncertain knowledge of the current state, is different from that experienced over very long timeframes. IC ensembles enable us to quantify these transient climate distributions. An important question for global climate modelling experiments is whether today's ensembles are sufficiently large to make such quantifications.
In contemporary climate modelling studies there may be only a single simulation or a small IC ensemble; a minimum of three is specified in CMIP5 (WCRP 2011). Climate variable distributions are therefore constructed over a time period; typically 30 years (WMO 1996, Burroughs 2003). An assumption is being made here that the distribution over time is representative of the distribution of possible states at an instant; we describe this as the kairodic assumption due to its similarity, in general terms, with the widely studied ergodic assumption 6 . In this model, even under stationary forcing conditions (constant parameters), such an assumption can be misleading. Using the 10 000 member IC ensemble distribution as the reference climatic distribution for each point in time, figures 4(a) and (b) show that small ensembles (≤9) using this approach provide substantially different results (D > 0.1 (T), D > 0.2 (S); median values). This implies that climate model ensembles of this size should not be interpreted probabilistically. Maintaining the kairodic assumption in larger ensembles better quantifies the distribution but can lead to convergence on the wrong distribution; an effect particularly evident in these results in the S variable at 30 years ( figure 4(b), towards D = 0.18). The instantaneous ensemble distributions converge towards the correct distribution, by design, but the convergence is slower because there are a factor of 30 fewer constituent data points. For accurate results instantaneous interrogation of a large IC ensemble is required, but if only small ensembles are available using a 30 year time-slice can sometimes improve the quantification of the distribution (e.g. compare the two methods for a 30 member ensemble in figure 4(a)). This is   figure 3. Plots show the KS statistic between a 10 000 member IC ensemble distribution constructed at 30 (blue) and 60 (red) years into a simulation with different methods of distribution construction. The right side of each subplot uses distributions constructed at the given time point (referred to as 'instantaneous' distributions), with varying ensemble size. The left side uses distributions constructed over a thirty year period around the given time point ('kairodic' distributions). Uncertainty is quantified by repeating the process 100 times using different ICs distributed around the same location on the attractor (see figure 1 and the supplementary materials available at stacks.iop.org/ERL/8/034021/mmedia). Dots represent the median; errors are the 10th-90th percentiles.
not the case, however, if the distribution itself is changing nonlinearly, as seen in the S variable at 30 years ( figure 4(b)).
Although from different simulations, in a transitive system we expect the instantaneous ensemble distributions to effectively be samples of the reference 10 000 ensemble member distributions at the same point in time. This appears to be the case as the right-hand side of the panels in figure 4 display values of D close to the statistically expected values given the size of the ensembles. In terms of quantifying the distribution of errors in probability for a given ensemble size, they provide a benchmark for the kairodic distributions shown on the left-hand side of the panels in figure 4. For example, the 3 member distribution using the kairodic assumption contains 90 constituent data points and can be contrasted with the 90 member instantaneous distribution; D = 0.09 (median value) for both variables in the latter case while D ≥ 0.16 for T and D ≥ 0.27 for S (median values) in the former. For both T and S the maximum errors in probability are greater in the kairodic distributions; by 0.07 and 0.18 respectively (median values). All kairodic distributions are different from the 10 000 member instantaneous distributions at the 95% significance level, where the critical value is D α = 0.14 for the 3 member kairodic and 90 member instantaneous distributions (see the supplementary materials for critical values at other ensemble sizes).

Determining the model climate under climate change
Under transient climate change we can no longer use an attractor to describe the system's behaviour but the use of IC ensembles remains informative (see Carvalho et al 2007 andChekroun et al 2011 for discussion of attractors in time-dependent dynamical systems). To represent this situation here, a linear trend is applied to one of the parameters in the model (F m , the mean value of a forcing term related to north-south temperature contrast). This takes the system from the attractor associated with F m = 7 (figures 1(a) and (b)) to that associated with F m = 8 (figures 1(c) and (d)) over a period of 100 years. Again a 10 000 member IC ensemble, conditioned on the same starting state, is used to quantify the changing climate over a 200 year simulation (figures 3(c) and (d)). Here, as in the stationary case above, the distributions derived using the kairodic assumption converge on distributions different from the 10 000 member instantaneous distributions (D > 0.13 in all cases). In most cases the differences are greater than in the stationary case, because the ensemble distributions determined using this method now incorporate data representative of the climate under a range of forcing conditions. This occurs despite the forcing change in this experiment being linear; a consequence of the model climate distributions responding nonlinearly. The instantaneous distributions from large (≥300 member) ensembles are more accurate than in the kairodic case and similarly accurate to that found in the stationary situation (D < 0.05; median value). With ensembles of thirty members similarly accurate/inaccurate results are achieved whichever interpretational method is applied despite the instantaneous distributions containing a factor of 30 fewer data points. Single simulations and small (≤9) ensembles again perform very poorly.
The above results represent a situation in which the 'forcing' is towards a more constrained attractor with lower variability on long timescales (see figures 1(c) and (d)). We can investigate the consequences of moving towards a less constrained attractor, with greater variability and regime behaviour present on long timescales, by inverting the trend and applying a linear decrease in F m from F m = 8 to 7 over 100 years. This scenario is more physically realistic since the polar amplification mechanism is expected to result in a decrease in the mean equator-to-pole temperature difference over the 21st century (Masson-Delmotte et al 2006). Our results (figures 3(e), (f), 4(e) and (f)), however, are not predictive; only illustrative of possible behaviour and the issues in ensemble interpretation. The same issues as in the previous case are clear but now, perhaps surprisingly, the kairodic approach gets worse as time progresses; for T at 60 years the instantaneous distribution from even a small, 30 member ensemble is likely to be closer to the model climate than a 900 member ensemble distribution derived using the kairodic approach ( figure 4(e)). This can be explained by the sudden change in distribution seen at about this point in time in figure 3(e). The existence of such sudden changes in distribution in GCMs would be of relevance to both science and policy but cannot be distinguished from internal variability if only small ensembles are available.
These findings do not change if the IC ensembles originate in a different region of the model state space (see supplementary figures S1 and S2 available at stacks.iop.org/ ERL/8/034021/mmedia).

Concluding remarks
On multi-decadal timescales we are increasingly interested in the tails of climatic distributions (Seneviratne et al 2012, Weitzman 2011; the unlikely but often most damaging events. If climate models are to be used to say anything about the probability of such events then the ensemble size and method of construction of climate distributions must be appropriate. Although this analysis has used a simple, low-dimensional climate model, wide distributions of regional responses have been found within GCM experiments using IC ensembles with ≥40 members (Stainforth et al 2007, Deser et al 2012 which suggests that similar results might plausibly be expected in more complicated models. If they were not found in such models it might be reasonable to conclude that small IC ensembles are sufficient to quantify their behaviour but it could also be taken as raising a concern that such models are overly constrained in their potential behaviour. GCMs are, of course, much more expensive to run than the model presented here but the impetus to provide probabilistic forecasts to guide climate change adaptation policy and to understand model behaviour to guide their future development, provides ample justification for many hundred member ensemble experiments. These results are likely to be particularly relevant at local and regional scales, where the influence of nonlinearities and modes of internal variability are often more pronounced, and therefore have consequences for impacts assessments and adaptation planning. At larger scales, however, the implications for quantifying the importance of low probability high impact events could have implications for mitigation policy. There is clearly a need to balance resources between the exploration of model uncertainty, the exploration of IC uncertainty and the desire to run models containing the widest possible range of processes at the highest feasible resolution given current computational facilities. These results imply that large ensembles of lower resolution, lower complexity GCMs, have the potential to be more informative for certain questions than small ensembles of high resolution, high complexity models. High complexity, high resolution models undoubtedly have benefits for some questions relating to both model development and their use as 'laboratories' (Cookson 2008) for understanding specific aspects of the climate system. But if the ensembles are too small to reliably quantify the changing climate within the model, they must also be considered inadequate for providing information about real-world probabilities; an aspect important both for providing climate services (WMO 2011) and for understanding many aspects of the system. The balance in computational resource allocation therefore depends on the aim of the ensemble experiment. These results suggest that existing judgements about the relative importance of different sources of uncertainty are founded on ensemble experiments which are unlikely to accurately reflect at least one of those sources, namely initial conditions, and are therefore unreliable as a basis for ensemble design. Finally, reliance on the kairodic assumption risks providing misleading probabilities for both science and policy, and, since complex models could feasibly show similar behaviour to the idealized model investigated herein, without larger IC ensembles model experiments cannot be assumed to reliably quantify the details of modelled climate under climate change.