Sources of Uncertainty in Multimodel Large Ensemble Projections of the Winter North Atlantic Oscillation

Projections of the winter North Atlantic circulation exhibit large spread. Coupled Model Intercomparison Project archives typically provide only a few ensemble members per model, rendering it difficult to quantify reducible model structural uncertainty and irreducible internal variability (IV) in projections. We estimate using the Multimodel Large Ensemble Archive that model structural differences explain two‐thirds of the spread in late 21st century (2080–2099) projections of the winter North Atlantic Oscillation (NAO). This estimate is biased by systematic model errors in the forced NAO response and IV. Across the North Atlantic, the NAO explains a substantial fraction of the spread in mean sea level pressure (MSLP) projections due to IV, except in the central North Atlantic. Conversely, the spread in North Atlantic MSLP projections associated with model differences is largely unexplained by the NAO. Therefore, improving understanding of the NAO alone may not constrain the reducible uncertainty in North Atlantic MSLP projections.

This study aims to advance understanding of the roles of model structural error and IV in North Atlantic circulation projections. To achieve this, we use the recently available Multimodel Large Ensemble Archive (MMLEA, Deser et al., 2020) and data from CMIP5/6. We focus on the leading mode of variability in the North Atlantic circulation, the North Atlantic Oscillation (NAO), which is associated with changes in the strength and latitude of the eddy-driven jet (Woollings et al., 2010). To guide our investigation, we address the following questions: 1. What are the relative contributions of IV and model structural uncertainty to spread in NAO projections? 2. When do the forced NAO response and model differences in this response emerge from IV in the 21st century? 3. What is the minimum number of ensemble members required to separate the forced NAO response, and model differences in this response, from IV? 4. To what extent is spread in North Atlantic circulation projections explained by the NAO?
Addressing these questions will aid the interpretation of North Atlantic circulation projections, improving their utility, and provide guidance for designing future model experiments.

Datasets
The MMLEA contains large (16-100 members) initial-condition ensembles for seven comprehensive climate models (Table S1, Hazeleger et al., 2010;Jeffrey et al., 2013;Kay et al., 2015;Kirchmeier-Young et al., 2017;Maher et al., 2019;Rodgers et al., 2015;Schlunegger et al., 2019;Sun et al., 2018). We use historical and Representative Concentration Pathway (RCP) 8.5 simulations from the MMLEA models for the common period 1950-2099. RCP8.5 was chosen because only a small subset of the models is available for other RCPs. Since GFDL-ESM2M and GFDL-CM3 have similar atmosphere, ocean, sea-ice, and land components (Maher et al., 2021), and give similar results, we discard the smaller GFDL-CM3 ensemble from the MMLEA analysis. The winter North Atlantic circulation is described using monthly mean sea level pressure (MSLP) data averaged over December to February (DJF). Following Collins et al. (2013), the long-term climate response is computed as the 20-year epoch difference between a future period and a near-present-day period (updated to 1995-2014; year is for January).
We also use historical and RCP8.5 simulations from 39 CMIP5 models (Taylor et al., 2012), and historical and Shared Socioeconomic Pathway (SSP) 5-8.5 simulations from 36 CMIP6 models (Eyring et al., 2016, Table S2). The forcing scenarios changed in CMIP6, where SSP5-8.5 has the most similar total end-of-century radiative forcing to RCP8.5 (Meinshausen et al., 2020). However, there are differences in the mix of forcings between the RCP and SSP scenarios (Meinshausen et al., 2011(Meinshausen et al., , 2020 to be borne in mind when comparing results.
Generally, only a few ensemble members are available for the CMIP5/6 simulations, so we estimate IV using the preindustrial control (piControl) runs. Model drift is eliminated by subtracting each run's longterm linear trend. Various observation-based datasets are used to evaluate the spread in model projections against observed IV. Since multidecadal timescales are our focus, we use two centennial-scale reanalysis datasets: the NOAA-CIRES-DOE 20th Century Reanalysis version 3 (20CRv3; Compo et al., 2011;Slivinski et al., 2019) and the ECMWF 20th Century Reanalysis (ERA20C, Poli et al., 2016). A 1000 member "Observational Large Ensemble" (Obs LE; is also used, which contains synthetic historical trajectories produced by a statistical model based on observed climate statistics. We use the full extent of Obs LE (1921LE ( -2014, and the longer common period of 1900-2010 for 20CRv3 and ERA20C to minimize sampling issues. Forced trends in 20CRv3 and ERA20C are estimated and removed using linear least squares regression; Obs LE by construction has no forced MSLP trend .
All model and observation-based data were bilinearly interpolated onto a common 2° horizontal grid; this procedure does not alter our results.

NAO Definition
Following Stephenson et al. (2006) and Baker et al. (2018), the NAO index is defined as the difference in area-averaged MSLP between a southern box (90°W-60°E, 20°N-55°N) and a northern box (90°W-60°E, 55°N-90°N) in the North Atlantic. This index is less sensitive to differences in centers of action between observations and models than the station-based index (Hurrell et al., 2003;Stephenson et al., 2006) and is also less variable enabling easier detection of a forced NAO response from IV. Furthermore, it is less affected by issues of interpretability that occur when using a mathematically constructed empirical orthogonal function (EOF)-based index (Ambaum et al., 2001;Dommenget & Latif, 2002;Stephenson et al., 2006).
Each MMLEA model's historical NAO pattern ( Figure S1) is constructed from the regression slopes obtained by regressing historical  timeseries of DJF MSLP at each grid-point onto the NAO index timeseries. Note that using a future period gives similar results and that all timeseries are first linearly detrended. The pattern is defined separately for each ensemble member and then the ensemble mean is calculated (Simpson et al., 2020). The NAO-congruent part of an MSLP anomaly map is obtained by multiplying the historical NAO pattern by the NAO index anomaly. Figure S1 also shows observation-based and CMIP5/6 multimodel mean (MMM) historical NAO patterns; largely, the modeled and observation-based patterns are highly correlated.

Statistical Methods
In each MMLEA model, uncertainty due to IV is mainly estimated as the standard deviation across ensemble members (Deser et al., 2012). The externally forced response is estimated using the ensemble mean. The percentage variance contribution of IV (%U IV ) and of model structural differences (%U MD ) to the total uncertainty in MMLEA projections is quantified following Maher et al. (2021; Text S1).
A forced response is described as "robust" if it is statistically detectable from IV at the 95% confidence level. Two-sided confidence intervals for a forced response (μ) are calculated as P V r t N / (von Storch & Zwiers, 1999). t is the Student's t-distribution value for p = 0.025 and N − 1 degrees of freedom, σ is the intermember standard deviation of the epoch difference, and N is the ensemble size.
To estimate the minimum ensemble size (N min ) required to detect a robust forced NAO index response of a given magnitude (X) between any two 20-year epochs, we follow Screen et al. (2014) and rearrange a two-sided Student's t-test for a difference of means (Text S2): t c is for p = 0.025 and 2N min − 2 degrees of freedom, and σ is the standard deviation of 20-year epoch means due to IV. N min is calculated for a difference in forced response (X) where σ is for differences in 20-year means. Figure 1 shows winter NAO index anomalies between 2080-2099 and 1995-2014 in the CMIP6, CMIP5, and MMLEA models. For both CMIP5/6 ensembles, the MMM response in the NAO index is ∼1.5 hPa. However, the MMM responses are generally small compared to the spread across the individual models. While some models have large positive NAO anomalies exceeding their modeled range of IV, most modeled anomalies are smaller than IV. The range of NAO anomalies is 6 hPa in CMIP6 and 7 hPa in CMIP5, comparable to the observed range of NAO variability ( Figure 1, gray box), where 86% and 79% of models agree on sign respectively. Given many CMIP5/6 models only have one ensemble member available, it is impossible to separate the spread in projections into parts due to structural model differences and IV. Despite this limitation, uncertainty in projections is often examined using these models (e.g., Hawkins & Sutton, 2009). The MMLEA models suggest there are indeed substantial intermodel differences in the forced response of up to 5 hPa ( Figure 1, colored circles). Using Maher et al. (2021)'s uncertainty decomposition, we find that model structural differences and IV contribute to 66% and 34%, respectively, of the total uncertainty in MMLEA NAO projections. The following sections examine each source of uncertainty in detail.

Uncertainty From Internal Variability
In several MMLEA models, the forced winter NAO response is smaller than IV as measured by the ensemble spread ( Figure 1). Using the ensemble spread to assess the range of possible futures assumes that the models adequately represent observed NAO variability. However, as in previous studies (Bracegirdle et al., 2018;Kim et al., 2018;Kravtsov, 2017;Simpson, Deser, et al., 2018;Wang et al., 2017), we find that most CMIP5/6 and MMLEA models underestimate low frequency NAO variability compared to observation-based datasets (Figure 1, black whiskers vs. gray lines; Tables S1 and S2). The model projections may therefore be overconfident: that is, a larger part of the uncertainty in the future real-world NAO response may be from IV. When model-based estimates of IV are adjusted to an observation-based estimate (Text S1), IV and model structural differences each contribute to half of the total uncertainty in the adjusted MMLEA projections. These estimates also depend on the models simulating a realistic forced NAO response; Section 4 discusses this further. Now, we ask to what extent the NAO explains uncertainty in North Atlantic circulation projections due to IV. Figure 2 presents for each MMLEA model a decomposition of the total ensemble spread in MSLP (top row) into an NAO-congruent part (second row) and a residual (third row). The total uncertainty from IV is generally largest at high northern latitudes, extending from Greenland to Northern Europe, as well as in the central North Atlantic. There is also larger uncertainty from IV in north-eastern North America and continental Europe. The NAO contributes to a large proportion (>50%; Figure 2 circulation variability in the North Atlantic sector (Barnston & Livezey, 1987;Moore et al., 2011;Wallace & Gutzler, 1981).

Uncertainty in the Forced Response
Figure 1 shows structural differences in the late 21st century forced NAO response across the MMLEA models. Here, we ask: When do the forced NAO response and model structural differences in the response become detectable from IV? In the early-to-mid 21st century, most individual model responses are small and nonrobust (Figures 3a and 3b). GFDL-ESM2M is one exception, having a relatively large and robust positive NAO response by 2020-2039. By 2060-2079, most of the model responses become large enough to be detected from IV, except for EC-EARTH due to its smaller response and ensemble size (Figure 3c). Regarding detection of model differences in response, in the mid-21st century only GFDL-ESM2M is robustly distinguishable from the other models (Figure 3b). By 2060-2079, the only model with a negative NAO response (CanESM2; Böhnisch et al., 2020) becomes distinct from other models (Figure 3c). By 2080-2099, CSIRO-Mk3.6 and MPI-ESM-LR develop stronger positive responses and become distinct from CESM1-CAM5 and EC-EARTH (Figure 3d). In short, most of the models simulate a robust forced NAO response by 2060-2079. However, most intermodel differences in the forced response are only detectable by 2080-2099, when %U MD first dominates over %U IV (Figures 3a-3d). This largely holds when the model-based IV estimates are adjusted to an observation-based estimate ( Figure S3).
We now calculate the minimum ensemble size (N min ) required to robustly detect a forced NAO index response, and model differences in this response, given a certain magnitude of IV. First, note that N min is larger when identifying differences in forced response between models than when identifying a response of equivalent magnitude in one model (Figures 3e and 3f). This is consistent with intermodel differences in forced response emerging from IV later. An NAO index response of 0.5 hPa, typical of early-to-mid 21st century MMLEA responses (Figures 3a and 3b is ∼4 hPa in the observation-based datasets. N min is doubled to 20, 40, or 80 to detect a difference in NAO index response of 0.5 hPa between two models. N min for a high IV model is similar to N min calculated using observation-based IV estimates. All subsequent results use the high IV estimate and thus provide an upper bound on N min . To detect larger NAO responses of 1 hPa and 2 hPa, typical of late 21st century MMLEA responses (Figures 3c and 3d), at least 15 or 5 members are required, respectively. This becomes 30 or 10 members for a difference in response. The largest MMLEA model response, and difference in response, of ∼4 hPa in 2080-2099 ( Figure 3d) requires only three members to detect. N min is first minimized at 2 for a response of 5 hPa or a difference in response of 7 hPa. Therefore, when considering more realistic IV estimates, most NAO anomalies and model differences in Figure 1 are nonrobust in CMIP5/6 models with only one ensemble member.
Finally, we ask to what extent intermodel spread in the forced response of North Atlantic circulation projects onto the NAO structure and, therefore, reflects differences in the response of the NAO to external forcing. The forced MSLP response is rather different across the MMLEA models (Figure 4, top row). For example, in CSIRO-Mk3.6, GFDL-ESM2M, and MPI-ESM-LR there is a north-south dipole in pressure anomalies, which is not present in CanESM2, CESM1-CAM5, and EC-EARTH. This is associated with inter-model spread in the NAO-congruent MSLP response (Figure 4, middle row). However, while a substantial portion of the forced North Atlantic MSLP response is NAO-congruent in some models (e.g., 80% in GFDL-ESM2M), this is not true of other models (e.g., EC-EARTH), and there are large residuals in all models (Figure 4, bottom row). Besides limited regions at high latitudes and in Southern Europe, the MSLP residuals contribute to the majority of the intermodel spread in the forced MSLP response (e.g., see Greenland, eastern North America and central Europe; Figure 4, far-right column).
MCKENNA AND MAYCOCK 10.1029/2021GL093258 6 of 10 , N min required to detect a forced NAO response of a given magnitude at the 95% confidence level based on IV estimates from MMLEA models, CMIP5/6 models, and observation-based datasets (Text S2 and S3). (f), As in (e) but for detecting a difference in forced response; note different y-axis scale. Single CMIP5/6 models can be located within the gray plumes using Table S2.

Discussion and Conclusions
The results presented here have improved our understanding of North Atlantic circulation projections in various ways.
First, while the CMIP5/6 models under RCP8.5/SSP5-8.5 show a mean response in the winter NAO index of ∼1.5 hPa during the late 21st century (2080-2099) compared to near-present-day (1995)(1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014), the individual model responses span 6-7 hPa and less than 90% of models agree on the sign of response. The MMLEA models suggest that approximately two-thirds of the large intermodel spread in CMIP5/6 could be explained by potentially reducible model structural differences and one-third by irreducible uncertainty from IV. While previous studies have noted the large spread in North Atlantic circulation projections (Section 1), this study is the first to quantify these sources of uncertainty using large initial-condition ensembles performed by a subset of CMIP5 models. The real-world relevance of this separation relies on models correctly reproducing the observed magnitude of low frequency IV and forced NAO response. We find the former is generally underestimated in models as in previous studies, but note the latter may also be underestimated (see below).
Second, as expected from the relatively large IV of the winter NAO, we find a relatively long time horizon for detecting a forced NAO response. The MMLEA models suggest that the forced NAO response is only detectable from IV by 2060-2079 and that model structural uncertainty in the forced response is detectable by 2080-2099. Uncertainty in NAO projections is therefore largely irreducible for most of the 21st century. While individual MMLEA models have larger NAO responses that are distinct from IV and other models earlier, this is generally not the case. This highlights a benefit of using the new MMLEA archive here, whereas previous studies have been limited to using a single-model large initial-condition ensemble to quantify the time of emergence of a forced circulation response (Deser et al., 2012(Deser et al., , 2017. Third, we show that a relatively large ensemble size is required to robustly separate the forced NAO response, and model differences in this response, from IV. A typical response (or model difference) of 1-2 hPa over the 21st century requires at least 15-5 (30-10) ensemble members to detect based on realistic estimates of IV. Even for very large responses (model differences) of around 5 hPa (7 hPa), two members are required for detection, meaning the majority of model responses and differences are nonrobust in CMIP5/6 models with only one ensemble member. This result is relevant to the growing application of emergent constraint techniques for narrowing uncertainty in future projections, as this relies on knowledge of the forced response and differences in forced responses across ensembles of models. Future model intercomparison experiment designs should consider the required ensemble sizes for examining regional climate signals (e.g., Milinski et al., 2020).
MCKENNA AND MAYCOCK 10.1029/2021GL093258 7 of 10 Finally, we have examined the extent to which the spread in North Atlantic MSLP projections is NAO-congruent. Regarding spread from IV, this is large in most North Atlantic regions and surrounding land areas, where the NAO explains over 50% of the inter-member spread in individual MMLEA models at higher latitudes and up to 50% around the Mediterranean region. The residual spread in the central Atlantic and western Europe is largely explained by the EA pattern. That the spread in projections from IV is largely explained by dominant modes of atmospheric variability agrees with Deser et al. (2012). These results build on those of Deser et al. (2017), who only analyzed the NAO contribution to spread in projections from IV.
Regarding intermodel spread in the forced North Atlantic MSLP response, while this is largely NAO-congruent at high latitudes and in Southern Europe, the majority of the spread is not NAO-congruent. Therefore, improving understanding of the NAO alone may not constrain the reducible uncertainty in North Atlantic MSLP projections. This is surprising considering previous work demonstrating the resemblance of externally forced model responses to the dominant modes of IV (Deser et al., 2004(Deser et al., , 2012. The large residual uncertainty in the forced MSLP response over Greenland may be associated with local near-surface temperature changes over orography and/or the extrapolation of pressure to mean sea level. These results have some limitations. First, MSLP only provides one perspective of the circulation. When using the zonal wind at 850 hPa (U850), which is related to the meridional pressure gradient, we find a shift in the regions with large uncertainty from IV ( Figure S4). Furthermore, intermodel spread in the forced U850 response appears more NAO-congruent over Europe than for MSLP ( Figure S5).
Second, models appear to underestimate predictable forced NAO variations by a factor of 2 on seasonal timescales (Baker et al., 2018;Dunstone et al., 2016;Eade et al., 2014;Scaife & Smith, 2018;Scaife et al., 2014) and by a factor of 10 on decadal timescales (Smith et al., 2020). This issue may also affect multidecadal NAO projections, though given the limited temporal extent of the observational record this is difficult to assess. If it does, this implies an underestimation of model differences in the forced NAO response and therefore the contribution of the NAO to intermodel spread in the forced circulation response, as well as an overestimation of the time horizon and "true" ensemble size required to detect a forced NAO response from IV. A further limitation of our analysis is that the MMLEA models may not span the full range of forced NAO responses in the CMIP5/6 models. However, it is difficult to assess this given the small ensemble sizes for most CMIP5/6 models.
The dynamical mechanisms responsible for intermodel spread in the forced North Atlantic circulation response need to be understood to identify potential physical constraints on the spread. Oudar et al. (2020) identified various mechanisms within CMIP5/6 projections, but could not determine which are relevant for spread from IV and/or model differences. Harvey et al. (2020)'s results suggest that mean state biases in the North Atlantic jet do not provide a useful constraint. Future studies could utilize MMLEA to investigate the dynamical mechanisms further.