Quantifying Sources of Uncertainty in Projections of Future Climate*

A simple statistical model is used to partition uncertainty from different sources, in projections of future climate from multimodel ensembles. Three major sources of uncertainty are considered: the choice of climate model, the choice of emissions scenario, and the internal variability of the modeled climate system. The relative contributions of these sources are quantiﬁed for mid- and late-twenty-ﬁrst-century climate projections, using data from 23 coupled atmosphere–ocean general circulation models obtained from phase 3 of the Coupled Model Intercomparison Project (CMIP3). Similar investigations have been carried out recently by other authors but within a statistical framework for which the unbalanced nature of the data and the small number (three) of scenarios involved are potentially problematic. Here, a Bayesian analysis is used to overcome these difﬁculties. Global and regional analyses of surface air temperature and precipitation are performed. It is found that the relative contributions to uncertainty depend on the climate variable considered, as well as the region and time horizon. As expected, the uncertainty due to the choice of emissions scenario becomes more important toward the end of the twenty-ﬁrst century. However, for midcentury temperature, model internal variability makes a large contribution in high-latitude regions. For midcentury precipitation, model internal variability is even more important and this persists in some regions into the late century. Implications for the design of climate model experiments are discussed.


Introduction
Currently, most projections of future global and regional climate are derived from the outputs of coupled atmosphere-ocean general circulation models (GCMs). Projections have historically been conditioned upon ''scenarios'' of greenhouse gas emissions, each associated with a particular ''storyline'' of economic and societal development worldwide throughout the twentyfirst century (Naki cenovi c and Swart 2000). Although the most recent set of climate model runs have replaced these storylines with a set of representative concentration pathways (RCPs) that are not associated explicitly with socioeconomic storylines (van Vuuren et al. 2011), from a user perspective their role is similar. In particular, the choice of RCP, like the choice of storyline, will usually be a source of uncertainty in climate projections in the sense that different RCPs or storylines will lead to different projections. From here on, we refer to this uncertainty as ''scenario uncertainty.'' Other sources of uncertainty include the choice of GCM (different GCMs yield different projections for the same emissions scenario) and the choice of initial conditions for the GCM runs (different initial conditions yield different results). The extent to which results are dependent upon initial conditions can be regarded as a measure of internal variability within the modeled climate system.
The need to characterize uncertainty in projections of future climate is widely accepted, and this requires the use of multiple models, scenarios, and runs to explore the future climate response. However, it is expensive and time consuming to produce projections using a GCM, and it is therefore useful to identify which are the dominant sources of uncertainty in order to understand where to Denotes Open Access content. * Supplemental information related to this paper is available at the Journals Online website: http://dx.doi.org/10.1175/JCLI-D-14-00265.s1. focus resources. If, for example, internal variability is relatively unimportant as a source of uncertainty, then it may be better to use resources to consider alternative scenarios and GCMs, rather than to produce many runs from the same GCM-scenario combination.
The problem of partitioning uncertainty in climate projections has been considered by several authors, notably Hawkins and Sutton (2009), who characterized projection uncertainty using heuristic measures of variability in ensembles of projections. Yip et al. (2011) took a more formal approach, carrying out an analysis of variance (ANOVA) to partition variability into contributions from different sources. ANOVA is a standard statistical technique for this task and, for balanced data, the decomposition of variability is unique and uncontroversial. In the current context a simple way to create balance is to stipulate that there are equal numbers of runs at each GCM-scenario combination. However, when data are unbalanced it is not clear how best to implement traditional ANOVA since the usual decomposition of variability is not unique (Searle et al. 2006, section 2.3b) so that it can be difficult to identify which sources are dominant.
In this study, we use data from the phase 3 of the World Climate Research Programme (WCRP) Coupled Model Intercomparison Project (CMIP3) multimodel dataset (Meehl et al. 2007), downloaded via the Program for Climate Model Diagnosis and Intercomparison (PCMDI) website (http://www-pcmdi.llnl.gov/ipcc/about_ipcc.php). Yip et al. (2011) used a subset of these data to partition uncertainty in projections of global temperature, based on a classical ANOVA. However, the full dataset is highly unbalanced. Yip et al. (2011) dealt with this by considering only the seven GCMs that have more than one run for each scenario and by choosing exactly two runs for each GCM-scenario combination: a lot of relevant information was therefore discarded in their analysis.
We take a different approach to the problems caused by lack of balance in the data. We use a random-effects ANOVA (Searle et al. 2006;Gelman 2005). The effects (deviations from an overall mean) of individual GCM (G), scenario (S), and GCM-scenario (GS) combinations are treated as being randomly sampled from probability distributions representing respective superpopulations of effects, an idea mentioned in section 2c of Yip et al. (2011). One possible interpretation of such a superpopulation is that it represents the collection of all potential GCMs that could be constructed using combinations of model components and modeling decisions that are consistent with our current understanding of the climate system [see Stephenson et al. (2012) for more discussion of this point]. The superpopulation standard deviation (SD) s G of this distribution summarizes variability of effects in the superpopulation of GCMs. We can also estimate the finite-population standard deviation s G : that is, an SD summarizing variability across the particular GCMs in the ensemble at hand. Section 3b gives more details. Both types of SD are useful: the finitepopulation SDs summarize variability in the effects of the particular GCMs that have been included in the ensemble under consideration and are thus analogous to the estimates of uncertainty from a classical ANOVA approach (Yip et al. 2011;Hingray et al. 2007;Raisanen 2001). The superpopulation SDs represent variability among the wider population of (actual and notional) GCMs from which the ensemble at hand is considered to be drawn.
In a similar spirit, the random effect associated with a scenario represents variability in a notional population from which the three scenarios represented in the ensemble are considered to have been drawn: the use of a random effect acknowledges that alternative scenarios are possible. With only three scenarios available in the ensemble at hand, however, it is hard to estimate superpopulation quantities with any great precision; thus, we expect, for example, that estimation of the superpopulation SD s S here will be subject to much greater uncertainty than that of the finite-population SD s S .
In section 2, we describe the data and the climate change indices we derive from them. In section 3, we outline some existing approaches to partitioning uncertainty in climate projections and we describe a randomeffects ANOVA model. In section 4, we fit this model to indices of mid-and late-twenty-first-century global temperature change. In section 5, we repeat this analysis at a regional scale for each of 22 regions, considering changes both in surface temperature and in precipitation. In section 6, we discuss some implications of our findings for the design of climate model experiments. Computer code to implement this methodology is available online (at http://www.homepages.ucl.ac.uk/;ucakpjn/).

CMIP3 data
Since 1992, the CMIP has coordinated several sets of climate model runs from modeling centers around the globe. Although the most recent set is phase 5 of the Coupled Model Intercomparison Project (CMIP5), released in 2013, in the work reported here we analyze the data from CMIP3 to provide absolute comparability with the work of Yip et al. (2011). The generic issues are exactly the same for any ensemble of GCM runs.
The CMIP3 multimodel dataset provides twenty-firstcentury climate projections from 24 GCMs under three future emissions scenarios developed by the Intergovernmental Panel on Climate Change (IPCC) Special Report on Emissions Scenarios (SRES) (Naki cenovi c and Swart 2000). These scenarios are generally referred to as A1B, A2, and B1 and may be interpreted as relating to low (B1), moderate (A1B), and high (A2) emissions of greenhouse gases over the twenty-first century. Table 1 gives the number of runs available (for surface air temperature) for each GCM-scenario combination. A total of 145 runs are available.
The numbers of runs for each scenario reflect choices made by individual modeling groups. Some GCMs have multiple runs per scenario; some have none. As noted above, this lack of balance complicates analysis of the data: the presence of zero cells in Table 1, corresponding to GCMs that provided no runs for a particular scenario, is particularly problematic in this respect. To deal with this in their analysis, Yip et al. (2011) included only seven GCMs (GCMs 2, 17-21, and 24 in Table 1) and chose exactly two runs for each GCM-scenario combination. In section 3b, we consider how to account for the lack of balance and the sparseness of the data in order to utilize all the CMIP3 data.
We consider two climate variables, (surface air) temperature (in 8C) and precipitation (converted to mm day 21 ), because they are the most frequently studied (Giorgi and Francisco 2000a;Giorgi and Mearns 2002;Tebaldi and Knutti 2007). We define indices of change for each variable. For temperature, the indices are the changes in mean temperature in the periods 2020-49 (midcentury) and 2069-98 (late century), each relative to the mean temperature in 1970-99. We use 2098 rather than 2099 because some GCMs did not provide runs for the whole of 2099. The precipitation indices are defined similarly except that we use the percentage-rather than absolute-change from the baseline 1970-99 period. In almost all cases, to ensure that each of our change indices can be regarded as if it was derived from a single long run, we take the 1970-99 data from the twentieth-century climate simulation (20C3M) model runs that were used to initialize the corresponding twenty-first-century runs (see http://www-pcmdi.llnl.gov/ipcc/time_correspondence_ summary.htm). There are two exceptions. The first is GCM 15, for which there was an error in the forcings for the 20C3M run used to initialize the scenario runs (here, we have used the corrected 20C3M run, noting that, according to the URL given above, the climate of the year 2000 is very similar in the two runs). The other exception is GCM 24, for which the twenty-first-century runs were not initiated from a twentieth-century run. For this reason (i.e., to avoid using these data inappropriately), we will not include any data from GCM 24 in our analyses. We do, however, include values from GCM 24 in Fig. 1, because it has some relevance to comparisons with the results of Yip et al. (2011) made at the end of section 3a.
In section 4, we consider indices relating to changes in global mean temperature. In section 5, we carry out regional-scale analyses separately for each of the 22 regions considered by Giorgi and Mearns (2002). Giorgi and Francisco (2000b) provide the definitions of 20 of these regions in terms of latitude and longitude. The definitions of the two remaining regions, northern Australia and southern Australia, are given by the IPCC Data Distribution Centre (http://www.ipcc-data.org/ sres/scatter_plots/scatterplots_region.html).
The raw data are monthly averages generated on a coarse GCM-specific spatial grid. Following Giorgi and Mearns (2002), for each GCM the data for a given month are spatially interpolated onto a common 0.58 grid using the bicubic spline interpolation function interp() in the R library akima (Gebhardt et al. 2013), before being averaged over each region of interest. Then the monthly averages are converted into averages over the time periods of interest, weighted by the cosine of the latitude of the grid point location, from which the respective indices of change are derived. Figure 1 summarizes the results of this procedure when applied to surface air temperature over the entire globe. For a given GCM the values under scenario B1 are generally lower than under scenarios A1B and A2. The exception is GCM 24, for which there are two unusually large values for 2020-49: as discussed above, the reason for this is that the twenty-first-century runs for this GCM were not initialized using the corresponding twentieth-century runs and are therefore not directly comparable. Overall, the modeled temperatures for scenarios A1B and A2 are similar for 2020-49, but A2 tends to produce larger values than A1B for 2069-98.

Statistical models for partitioning variability
In the following, we let Y ijk ; i 5 1, . . . , n G ; j 5 1, . . . , n s ; and k 5 1, . . . , K ij be an index of change for GCM i, scenario j, and run k. For the CMIP3 data, n G 5 23, n S 5 3, and K ij varies between 0 and 8 (see Table 1).

a. A fixed-effects ANOVA model
Hawkins and Sutton (2009) proposed some heuristic ways of partitioning variability. Yip et al. (2011) put that work on a more formal footing using a statistical model: namely, a two-way fixed-effects ANOVA, where m is the overall mean change in the index, over all GCMs and scenarios; a i is an adjustment for GCM i; b j is an adjustment for scenario j; g ij is a scenario-specific additional adjustment for GCM i; and the error terms ijk are independent identically distributed random variables with mean 0 and variance s 2 , representing residual variability between runs: this can be considered as variability that is internal to the modeled system. The GCM-scenario interaction effects fg ij g are referred to as interaction effects and measure how variability over GCMs changes with scenario or, equivalently, how the variation between scenarios differs between GCMs: as, for example, with the mid-twenty-first-century projections of global temperature in Fig. 1, where one GCM seems to rank scenario B1 differently from most of the others, as discussed above.
Consider a balanced design with K ij 5 K . 1 for all i, j (i.e., K runs for each GCM-scenario combination) and constraints å i a i 5 å j b j 5 0, å j g ij 5 0 for i 5 1, . . ., n G and å i g ij 5 0 for j 5 1, . . . , n S to avoid parameter redundancy. If K 5 1 it is not possible to estimate both the interaction effects and the error variance s 2 without making extra assumptions about the form of the interaction effects.
We define the overall mean Y ... 5 (1/n G n S K) å i å j å k y ijk . Using a similar ''bar-dot'' notation, where dots are used to indicate suffices over which averaging has taken place, GCM-specific means are Y iÁÁ , i 5 1, . . . , n G ; scenariospecific means are Y ÁjÁ , j 5 1, . . . , n S ; and means for specific GCM-scenario combinations are Y ijÁ , i 5 1, . . . , n G , j 5 1, . . . , n S . The least squares estimates of the quantities in (1) (Yip et al. 2011). The quantities used by Yip et al. (2011) to quantify uncertainty attributable to model, scenario, modelscenario interaction, and internal (between run) varia- and V 5 å i å j å k ijk /n G n S K, respectively. These quantities are proportional to the usual mean squares in an ANOVA framework. However, standard formulas for the expected values of these mean squares show that interpretation of these quantities is not as straightforward as it first appears. For example, to compare the relative contributions of scenario choice and internal variability to the total uncertainty under model (1), a natural measure is [n 21 S å j b 2 j ]/s 2 , and it is tempting to estimate this as S/V. However, the results of Searle et al. (2006, section 4.3) show that the expected value of S is (n S 2 1)(n G n S K) 21 s 2 1 n 21 S å n G i51 b 2 j , and the expected value of V is (K 2 1)K 21 s 2 . It is clear that the ratio S/V will tend to overestimate the quantity of interest, therefore, because of the presence of (n S 2 1) (n G n S K) 21 s 2 in the expected value of S and the factor (K 2 1)K 21 in the expected value of V. Bias in the estimation of n 21 S å j b 2 j by S may be small if s 2 is small. However, V may very substantially underestimate s 2 if K is small (e.g., by a factor of 2 if K 5 2), an issue noted by Déqué et al. (2007). Similar issues arise in the comparison of other measures of variability from model (1).
When the design is unbalanced interpretation of M, S, I, and V is even less clear. This motivates use of a related framework under which these problems can be addressed without discarding data. We achieve this using a randomeffects ANOVA model (see section 3b), defining explicitly the quantities to be inferred from data.
In Fig. 2, we compare the results from the methodology of Yip et al. (2011, their Figs. 3a and 4b) with the corresponding plots obtained using our approach described below. In reproducing the Yip et al. (2011) plots we have used a slightly different baseline period (1970-99 rather than 1971-2000), because for some of the models the SRES experiment data start in 2000; further, since they included runs from GCM 24, we have done the same here, selecting the two runs that did not result in obviously anomalous behavior. Although there are some differences in the relative contributions from internal variation, in this case s 2 is small enough that its presence in the expected value of S has little impact. The most interesting difference is that in Yip et al. (2011) (andSutton 2009) scenario uncertainty becomes more important than GCM uncertainty at approximately 2050, whereas under our approach this does not occur until approximately 2070. These differences are the result of using different statistical approaches and different data and will depend on the behavior of the datasets involved. However, the plots in the top row of Fig. 2 suggest that, in this particular case, the overall effect of including data from more models, as well as more runs from existing models, is to increase estimates of model uncertainty and (toward the end of the twentyfirst century when the differences between scenarios become apparent) scenario uncertainty.

b. A random-effects ANOVA model
In the fixed-effects ANOVA model (1), standard analysis methods focus on the specific effects fa i g, fb j g, and fg ij g for the ensemble under consideration. By contrast, in a random-effects version of the same model (see, e.g., Searle et al. 2006) individual effects are not of direct interest but rather are considered to be sampled from some larger population, and it is the variability within this larger population that is the focus of the analysis.
Equation (1) still applies: that is, j 5 1, . . . , n S , k 5 1, . . . , K ij , but now a i is considered to be a normally distributed random variable with mean zero and variance s 2 G [we write a i ; N(0, s 2 G ), with similar notation for other random variables]; b j ; N(0, s 2 S ); g ij ; N(0, s 2 GS ); ijk ; N(0, s 2 R ); and all random variables are assumed to be independent.
The terms in (2) have the same interpretation as those in (1) but now, instead of focusing on the individual effects (the a, b, and g) we are interested in (the relative magnitudes of) the superpopulation SDs s G , s S , s GS , and s R as representing a partitioning of uncertainty that acknowledges the potential for additional GCMs and emissions scenarios that are not represented in the data available. We also examine the relative magnitudes of the finite-population SDs s G , s S , s GS , and s R , where, for example, s G is defined via s 2 G 5 (1/22)å 23 i51 (a i 2 a) 2 and a 5 (1/23)å 23 i51 a i , as these summarize variabilities within the ensemble at hand. The presence of the error terms ijk in (2) mean that parameters in the finitepopulation SDs cannot be observed exactly; they can only be estimated and the estimates have some uncertainty associated with them. However, a i , for example, may be estimated precisely if there are many runs under GCM i. For a discussion of superpopulation and finite-population effects, see Gelman and Hill (2003, section 21.2).

c. Statistical inference for random-effects ANOVA
In situations where there are reasonably large numbers of groups (corresponding to GCMs and scenarios here) but where the data are unbalanced, random-effects models such as (2) are often fitted using restricted maximum likelihood (REML) estimation (Patterson and Thompson 1971;Harville 1977), with standard errors and confidence intervals computed using simulation (e.g., a 95% confidence interval for a particular parameter such as s G is obtained from the 2.5% and 97.5% sample percentiles of fits from simulated datasets). We illustrate this procedure below, using the R library lme4 (Bates et al. 2014) to perform REML estimation.
However, with only three scenarios there is little information in the data about s S . In such situations Gilmour and Goos (2009) argue against the use of REML because s S can be underestimated, estimates of zero may be produced, and estimates of uncertainty tend to underestimate the true uncertainty. In such situations, a Bayesian analysis may be preferable.
In Bayesian inference (Bernardo and Smith 2003) the parameter vector u of a model, here u 5 (m, s G , s S , s GS , s R ), is treated as a random variable. A prior distribution p(u), representing uncertainty about u in the absence of the data y, is specified. Let L(u; y) denote the likelihood function: that is, probability density of y as a function of u. Then inference is based on the posterior distribution which is proportional to L(u; y)p(u). In all but the simplest problems, an explicit expression for the posterior distribution is not available. However, samples from it may be obtained using Markov chain Monte Carlo (MCMC) techniques (Gilks et al. 1996;Gelman and Rubin 1992); by drawing sufficiently large samples, we may characterize any aspect of the posterior distribution (e.g., the mean, median, and percentiles) to any desired degree of accuracy. In particular, a 95% credible interval for any quantity of interest is determined by the 2.5% and 97.5% percentiles of its posterior distribution. In the work reported below, MCMC sampling is carried out using WinBUGS (Lunn et al. 2000) via the R library arm (Gelman et al. 2010).
Operationally, perhaps the most obvious difference between Bayesian and other methods of statistical inference is the incorporation of the prior distribution p(u). Ideally, this represents the analyst's uncertainty about the model parameters u in the absence of any data; often a noninformative prior is used in an attempt to ensure that the results are not influenced by what are seen as the analyst's subjective judgments. In situations such as that considered here, however, the data themselves may provide relatively little information about some model parameters such as s S . In such cases, it may be worth specifying a weakly informative prior (Gelman 2006) that encapsulates some basic constraints on the parameters but that otherwise allows the likelihood component to dominate the posterior distribution. If the prior is approximately flat over the range of u values that are consistent with the data, then the results from a Bayesian analysis will be dominated by the contribution of the likelihood to the posterior in this range so that the influence of the prior can be considered unimportant: in such situations, we expect good agreement between REML estimates and the mode of the posterior distribution [i.e., the value u for which the posterior density p(u j y) is maximized]. In a Bayesian setting however, estimation is usually based on the mean of the posterior distribution rather than its mode [the reasons for this are set out in Gilmour and Goos (2009)] and, if the posterior is highly skewed, its mean and mode can be very different. This in itself can account for differences between Bayesian and frequentist analyses. In the current context, the posterior distribution of s S is highly positively skewed (see Figs. 3 and 4), so the posterior mean is much greater than the posterior mode and a Bayesian analysis is probably preferable, as explained in Gilmour and Goos (2009). Gelman (2006) considers what kind of prior distribution should be placed on a variance component s when the available data provide only limited information about it: in the present context, this is the situation for s S because data are available for just three emissions scenarios. He shows that a weakly informative prior is necessary, to downweight the posterior probability of physically implausible values. Gelman (2006) and N. G. Polson and J. G. Scott (2012) demonstrate that a half-Cauchy (A) prior, with probability density function (pdf), is more appropriate than the commonly used uniform, log-uniform, or inverse-gamma priors. For a suitable value of A the half-Cauchy prior encourages the posterior distribution for s to place high probability on a realistic range. However, the prior has a ''heavy tail'' (i.e., the pdf decays slowly as s increases), and this prevents the prior from having an undue influence if the data suggest a larger than anticipated value of s. This behavior is demonstrated in the plots relating to s S in Figs. 3 and 4.

Global temperature change
In this section, we present the results of REML and Bayesian analyses of the global temperature data using model (2). For the latter, we use independent half-Cauchy priors for the variance components, choosing the scale parameter A to provide weak prior information. For the 2020-49 indices we use A 5 0.58C and for 2069-98 we use A 5 18C, based on the following reasoning: Suppose that two of the sources (GCM, scenario, and simulation run) of uncertainty are kept constant while the other is varied. It would be very unlikely that the resulting projections would have a range of more than, say, 108C in their temperature projections by the midcentury and more than, say, 208C by the end of the century. Under model (2), as a rule of thumb the range of the random effects from each source of uncertainty can be considered to correspond very roughly to four standard deviations (or 4s); thus, for each source of uncertainty, we judge that the corresponding random-effects standard deviation s does not exceed 2.58C for the midcentury projections and 58C for end-of-century projections. The chosen values of A place small ('0.13) prior probability on these eventualities. We use the same prior distribution for all the s parameters. We use a noninformative N(0, 10 6 ) prior for the mean parameter m. inferred from the fact that the priors are virtually flat over intervals for which the posteriors are nonnegligible. This is not the case for s S : the half-Cauchy prior provides weak information to prevent very unrealistic values of s S from appearing in the posterior simulations. Table 2 summarizes inferences from the REML and Bayesian analyses. As anticipated from the discussion in section 3c, the REML point estimates of s S are much smaller than the Bayesian point estimates (and similar to the posterior modes in Figs. 3 and 4) and the REMLbased confidence intervals are much narrower than their Bayesian equivalents. As noted by Gilmour and Goos (2009), these features of the REML inferences are not desirable: they reflect a lack of information about s S rather than strong evidence that its value is small. A further consequence is that the Bayesian interval estimates for m are wider than those from the REML analysis. For the other variance components there are  only small differences between the REML and Bayesian inferences. Figure 5 summarizes the posterior distributions of the superpopulation and finite-population standard deviations. The posteriors of the superpopulation SDs are positively skewed: relatively small values of these quantities are inconsistent with the data, but relatively large values cannot be ruled out. As expected (section 3c), the finite-population SDs have smaller posterior medians and narrower interval estimates than the superpopulation SDs. This is especially pronounced for scenario (s S and s S ).

a. Results: Sources of uncertainty
In any Bayesian analysis, it is helpful to explore the sensitivity of results to a plausible range of prior distributions. Here, we have repeated the analysis above for different values of A in the prior (3). We find that only inferences about s S are sensitive to this choice. For example, for 2020-49, the estimated median and 95% credible interval for s S are 0.1368C and (0.054, 0.542)8C for The superpopulation standard deviation estimates in Table 2 and Fig. 5 reveal the dominant sources of uncertainty in projections of global mean temperature. For the 2020-49 time horizon, s G has the largest posterior mean value, closely followed by s S . This suggests that over this time period the dominant source of uncertainty is the choice of GCM, followed by the choice of emissions scenario. Later in the century (2069-98), however, variability over scenarios is greater than variability over GCMs. The same general findings apply if we compare finite-population SDs, although the importance of scenario is reduced because the posterior medians of s S are smaller than the respective posterior medians of s S . In both time periods, internal variability (represented by the residual standard deviation s R ) is estimated to be much smaller than variability over GCMs and over scenario. Internal variability is estimated to be smaller in 2069-98 than in 2020-49. It could be that global warming reduces variability in global temperature. By 2069-98, variability attributable to GCM-scenario interaction has overtaken internal variability.

b. Results: Individual GCMs and scenarios
The top plots in Figs. 6 and 7 summarize the posterior distributions of the effects of individual GCMs and scenarios on our climate indices: that is, (m 1 a 1 , . . . , m 1  a 23 ) and (m 1 b 1 , m 1 b 2 , m 1 b 3 ), respectively. We can see that GCM 16 [MIROC3.2(hires)] gives atypically high projections of global temperature change and that projected temperature changes are greatest under scenario A1B in the midcentury and under scenario A2 in the late century. The plots on the bottom left in these figures show how GCM-specific effects [the terms fm 1 g ij g in model (2) have plotted only the six GCMs considered by Yip et al. (2011) that remain after the exclusion of GCM 24.
The bottom right of Figs. 6 and 7 are normal quantilequantile (QQ) plots of the posterior medians of the datalevel errors ijk 5 y ijk 2 (m 1 a i 1 b j 1 g ij ); i 5 1, . . . , 23; j 5 1, 2, 3; and k 5 1, . . . , K ij : their purpose is to check the assumption in model (2) that these errors are normally distributed, since in this case the points on the QQ plots should lie roughly on a straight line. For 2020-49 (Fig. 6) the points lie remarkably close to the line. For 2069-98 (Fig. 7) the curvature might suggest a slight (positive) skew in the error distribution: this is driven by single runs with values that are greater than those of their counterparts (see, e.g., scenario A2 for GCM 21 and scenario A1B for GCM 19 in Fig. 1).

c. Model checking
We have also carried out posterior predictive checks (see, e.g., Gelman et al. 2003, chapter 6) to assess whether the model is consistent with the data. We compare the real data to 10 000 datasets simulated from the posterior predictive distribution under the model. The real data should not behave very differently to the simulated datasets. To examine this, we choose test summaries to reflect important aspects of the data. The test summaries we use are based on derived datasets containing (i) all responses; (ii) the 23 mean responses for each GCM; (iii) the 3 mean responses for each scenario; (iv) the 64 mean responses for each GCMscenario combination; and (v) for GCM-scenario combinations with more than 1 run, residuals, defined as the differences between the responses and the corresponding mean value in (iv). The datasets (i)-(v) are chosen as summaries of total variability, variability across GCM, scenarios, GCM-scenario combinations and runs respectively. For each simulated derived dataset we calculate eight statistics: minimum, the quartiles, maximum, interquartile range, mean, and standard deviation. We also calculate these statistics for the derived datasets based on the real data. For each statistic, we calculate the proportion of the simulated values that are greater than the corresponding statistic from the real data to give a posterior predictive p value. Formal treatment of these p values is complicated by the fact that if the model is true the p value is more likely to be near 0.5 than near 0 or 1 (Meng 1994), but values near 0 or 1 may highlight a potential discrepancy between model and data. We find (details are provided as supplementary material) that these checks indicate good agreement between model and data, lending support to the conclusions from the modeling exercise.

Regional analyses of temperature and precipitation
For many purposes, uncertainties in projections of global temperature change are less relevant than those in projections of regional changes; regional precipitation changes are also likely to be critically important. In this section, therefore, we repeat the analysis of the previous section for both temperature and precipitation changes, within each of the 22 regions outlined in section 2. a. Regional temperature We repeat within each region the Bayesian analysis of section 4 using the same half-Cauchy priors: A 5 0.5 for 2020-49 and A 5 1 for 2069-98. Figure 8 summarizes (using the posterior median and 50% central credible intervals) the estimated posterior distributions of the superpopulation standard deviations s G , s S , s GS , and s R , globally and in each region, for 2020-49 and 2069-98. We use 50% intervals to prevent the large uncertainty in s S from dominating the plots. If we compare the posterior medians of the superpopulation SDs we find that, for 2020-49, variability over GCMs is greater than variability over scenarios and runs for each region and variability over runs is greater than variability over scenarios in some regions, predominantly in the north. For 2069-98 we find that the scenario is a greater source of variability than earlier in the century, and the scenario contributes at least as much variability as GCM in most regions and much more in many regions. The corresponding figures for the finite-population standard deviations s G , s S , s GS , and s R (not shown but available FIG. 8. Regional analyses of change in mean surface temperature from 1980-99 to (top) 2020-49 and (bottom) 2069-89. Posterior quartiles: median (dots) and central 50% credible intervals of the superpopulation standard deviations. The global analysis is summarized in the bottom left. From left to right, the ordering is GCMs, scenarios, GCM-scenario interaction, and runs. The vertical scales are different for the global and regional analyses. as supplementary material) provide the same general findings.

b. Regional precipitation
We repeat the Bayesian analyses for the precipitation indices (percentage changes from the 1980-99 mean), using a half-Cauchy scale parameter of A 5 2.5 for 2020-49 and A 5 5.0 for 2069-98. These values are chosen using the same argument used for temperature in section 4: here we consider as very unlikely a range of 50% points by midcentury and 100% points by the end of the century.
There are four fewer runs for precipitation than temperature: under scenario A1B, GCM 11 (GISS-ER) has three fewer runs and GCM 12 (FGOALS-g1.0) has one fewer. Figure 9 summarizes the estimated posterior distributions of the superpopulation standard deviations s G , s S , s GS , and s R globally and in each region, for 2020-49 and 2069-98.
The findings are quite different to those for temperature. For 2020-49, globally variability over GCMs is greatest, but there is relatively high variability over different runs from the same GCM. Regionally, there is a similar picture, but in many areas variability over runs FIG. 9. Regional analyses of percentage change in mean precipitation from 1980-99 to (top) 2020-49 and (bottom) 2069-89. Posterior quartiles: median (dots) and central 50% credible intervals of the superpopulation standard deviations. The global analysis is summarized in the bottom left. From left to right, the ordering is GCMs, scenarios, GCM-scenario interaction, and runs. The vertical scales are different for the global and regional analyses. is similar to variability over GCMs (in western North America the former is greater than the latter) and in many regions variability over scenarios seems relatively unimportant. For 2069-98, globally the scenario is more important than in 2020-49, but in many regions the scenario still seems relatively unimportant. A reviewer has pointed out that scenario uncertainty is very low in regions, such as Southeast Asia and South Africa, where large precipitation changes are projected but there is no consensus among the models on the sign of the change. The result is a multimodel mean that is close to zero under all scenarios. In such regions the uncertainty attributed to model-scenario interaction tends to be greater than scenario uncertainty, suggesting that uncertainty due to model depends on scenario. In contrast, in areas like Alaska and Greenland, all models indicate an increase in precipitation that increases with increasing greenhouse gas emissions (analogous to the situation that applies, in all regions, for temperature), leading to large scenario uncertainty. The corresponding figures for the finite-population standard deviations s G , s S , s GS , and s R (not shown but available as supplementary material) provide the same general finding. These results show that relative contributions to climate uncertainty of GCM, scenario, and internal variability depend on climate variable, region, and time horizon.

Discussion
Running climate simulations is a time-consuming exercise so it is important to make the outputs as useful as possible. Statistical models, with parameters that relate to scientific questions of interest, can help to inform the design of future climate experiments. They can answer questions like the following: How can fixed computational resources be allocated in order to estimate parameters with greatest precision? What data would be needed to estimate the parameters with desired precision? One possible objection to the models of the type we consider is that the uncertainties due to, for example, scenario and scenario-specific GCM run are fundamentally different in their nature. However, our analysis does quantify the implications of making choices between different models, scenarios, and simulation runs.
For model (2) in section 3b, choosing a good design is difficult because optimal designs depend on the relative sizes of the superpopulation SDs, which are unknown (Khuri 2000). Thus, some prior information (perhaps based on the results in sections 4 and 5) or a design that adapts to incoming data is necessary. In the current context, for situations where internal variability is relatively unimportant, it is not worth running many simulations per GCM-scenario combination.
The results in this paper can be used to infer where the major sources of variation lie. For example, the analysis of global temperature in section 4 suggests that variability between runs, for a given GCM-scenario combination, is far smaller than between GCMs and scenarios. Therefore, it is more important to devote resources to quantifying variation over GCMs and scenarios than over such runs. For global temperature it is better to use multiple GCMs and scenarios than multiple runs at single GCM-scenario combinations. However, the analyses reported in section 4a show that for some regions variability over runs is greater than variability over scenarios, particularly for 2020-49; multiple runs for each GCM-scenario combination are therefore desirable to quantify uncertainty if interest lies in these regional quantities. In the precipitation analyses of section 5b we find that variability over runs is generally of greater importance than in the temperature analyses. In some instances it is the largest source of variability and in some regions it is a greater source of variability than scenarios even in 2069-98.
Thus, different climate variables can have competing design requirements and compromise may be necessary in designing climate experiments to meet several objectives. These results do not provide any clear guidance for something like the CMIP experiments, which have multiple potential uses. However, they do provide guidance for users who might want to select a small subset of CMIP runs to assess, for example, the potential impacts of climate change. Impacts studies often involve the selection of a relatively small number of GCM runs to drive their models: the methodology introduced here can provide guidance on how to ensure that the dominant sources of uncertainty are represented in such an exercise.