The Southern Annular Mode in 6th Coupled Model Intercomparison Project Models

I analyze trends in the Southern Annular Mode (SAM) in CMIP6 simulations. For the period 1957–2014, simulated linear trends are generally consistent with two observational references but seasonally in disagreement with two other representations of the SAM. Using a regression analysis applied to model simulations with interactive ozone chemistry, a strengthening of the SAM in summer is attributed nearly completely to ozone depletion because a further strengthening influence due to long‐lived greenhouse gases is almost fully counterbalanced by a weakening influence due to stratospheric ozone increases associated with these greenhouse gas increases. Ignoring such ozone feedbacks would yield comparable contributions from these two influences, an incorrect result. In winter, trends are smaller but an influence of greenhouse gas‐mediated ozone feedbacks is also identified. The regression analysis furthermore yields significant differences in the attribution of SAM changes to the two influences between models with and without interactive ozone chemistry, with ozone depletion and GHG increases playing seasonally a stronger and weaker, respectively, role in the chemistry models versus the no‐chemistry ones.

The 6th Coupled Model Intercomparison Project (CMIP6) brings together the latest generation of climate models to produce simulations informing the upcoming 6th Assessment Report of IPCC (Eyring et al., 2016). These models have generally undergone further development since CMIP5. Also there is a much more diverse range of sensitivity experiments available than under CMIP5, targeting a large variety of forcing types and processes aimed at better characterizing and improving understanding of how models respond to forcings. Using three of these experiments and many "historical" all-forcings simulations, here I will conduct a seasonally resolved attribution of trends in the SAM to the two leading influences, ODSs and GHGs.

Definition of the SAM
While not perfectly "annular", the SAM is characterized by a large zonally symmetric component. Hence for simplicity, I here follow Gong and Wang (1999) and define the SAM index to be the difference in monthly and zonal-mean sea-level pressure (psl), or for above-surface features geopotential height on pressure levels (zg), between 40°S and 65°S. For every simulation the SAM index is smoothed with a 3-months boxcar filter (such that the SAM index represents the seasonal mean centered on a given month). The first and last months of each data set are invalidated.
The results are, for every model, month of the year, and simulation covered by that model, a timeline of seasonal SAM indices covering 1850-2014(1851-2014for DJF, 1850 for NDJ).

Method of Data Analysis
For the purposes of attributing any long-term trends in these data using a multiple-linear regression analysis, I form two regressor functions (Morgenstern et al., 2018), both using forcing data provided by Meinshausen et al. (2017): • Equivalent chlorine (Cl eq ) is the sum of the surface abundances of all chlorinated or brominated ODS gases weighted with the numbers of chlorine and bromine atoms per molecule, and additionally multiplying any bromine source gas by a factor of 60 to account for the larger per-atom depletion of ozone caused by bromine than by chlorine (Newman et al., 2007). Cl eq is shifted by four years to account for the time it takes for the ODSs to be delivered into the stratosphere. • Equivalent CO 2 ( 2 CO eq ) is the sum of the surface abundances of all long-lived GHGs weighted by their specific radiative efficiencies divided by that of CO 2 , with specific radiative efficiency coefficients taken from Table 8.A.1 of AR5 (Myhre et al., 2013).
Both regression functions are normalized such that the functions equal 0 at the start of the record (1850) and increase to 1 in 2014 ( Figure 1). Cl eq = 0 until about 1950 followed by a ramp-up in the 1970s-1990s to values exceeding 1 and a slow decay in the 21st century, reflecting the removal of ODSs from the atmosphere after emissions of ODSs have mostly ceased. By contrast, 2 CO eq is increasing throughout the "historical" period, with substantial increases already between 1850 and 1950 and a speed-up in the latter decades. The shapes of these two functions mean that any trend in the SAM index during 1850-1950 will be reflected in a projection onto 2 CO eq . As an aside, Figure 1 implies that a regression analysis using these functions over a much more restricted period (e.g., post-1950, when observational uncertainties are smaller, see below) would be impractical because the two regressor functions Cl eq and 2 CO eq become too similar. I thus determine, using least squares linear regression, coefficients S 0 , S 1 , and S 2 in the function such that the residual ϵ is minimized in the root-mean-square metric.
Here, S is an observed or modeled SAM index, a function of month m and calendar year y. I interpret the terms as S 0 standing for the baseline mean seasonal cycle of the SAM index in preindustrial times, S 1 the change in the SAM index driven by long-lived greenhouse gases, reflecting in particular any trends in the first 100 years when Cl eq = 0, S 2 the change in the SAM index driven by ODSs which have been elevated from the 1960s onwards, and S 1 + S 2 the total strengthening due to both anthropogenic causes in 2014. S 0 , S 1 , and S 2 are seasonally resolved for the 12 overlapping seasons of the year. I furthermore model the residual ϵ using an analogous least squares regression approach: In this approach, ϵ 2 is assumed to also vary with the two regressor functions. Where the regression fit produces negative numbers for any model or season (which would indicate the regression fit is poor) I replace , the regression becomes a constant in time). For the ensemble-mean SAM indices derived from "historical" simulations detailed below, this is the case only for one month each in two models (FIO-ESM-2-0 in November, NorCPM1 in March; see below).
If the regression model (Equation 1) was perfect and ϵ was normally distributed, the variance terms would scale with the inverse of the ensemble size n where single-model ensemble means are considered: E i ∼ 1/n.
The regression model developed above is complemented with a simpler linear trend analysis conducted on the period since 1957 when permanent meteorological observations started in Antarctica. If T ij is the linear trend in ensemble member j of model i, and σ ij its uncertainty at 68% confidence, then is the probability that a given trend T is larger than the best-estimate trend, T ij . Here, G is the Gaussian integral, I thus form the weighted mean of this distribution over all CMIP6 historical simulations: Here, P(T) is the cumulative probability distribution function that any given trend T exceeds the trend derived from CMIP6 model simulations, jointly accounting for model as well as statistical uncertainties in these trends. m is the number of models in the ensemble and n i is the number of historical simulations provided by model i.

Data and Models
Models used in the below analysis are listed in Table 1. I use almost all simulations available at the time of download for the "historical", hist-GHG, hist-stratO3, and hist-1950HC experiments (for definitions of the experiments see below). To limit model redundancy, some model variants that are nearly identical to one of the models used here are not included, for example high-resolution versions of some models. Also models for which only one historical simulation is available are generally not used, with the exception of GFDL-CM4 which is retained because it is the basis of the chemistry-model GFDL-ESM4. In total, I consider 282 historical, 61 hist-GHG, 28 hist-stratO3, and 11 hist-1950HC simulations.
Briefly, the experiments are characterized as follows: • "historical": This all-forcings experiment covering 1850-2014 is conducted by all CMIP6 models. Models and simulations used here are listed in Table 1. No-chemistry models are driven by transient historical mixing ratios of long-lived GHGs and usually the four-dimensionally varying CMIP6 ozone data set (Checa-Garcia et al., 2018), whereas in chemistry models CMIP6 ozone forcing is replaced with global-mean mixing ratios of ODSs imposed at the surface and transported into the stratosphere where increases of ODSs cause stratospheric ozone to decrease in the late 20th century. • "hist-GHG": In this experiments all forcings are kept at their 1850 values except for greenhouse gases.
Ozone is also kept invariant in this experiment. I consider nine no-chemistry models that have participated in this experiment. • "hist-1950HC": This experiment is identical to "historical" except it covers only 1950-2014 and ODSs are kept at their 1950 abundances. The six models participating in this experiment all have interactive stratospheric ozone chemistry, thus consistently representing the impact of ODSs. Note a key difference w.r.t. hist-GHG is that increasing GHG forcing drives trends in ozone causing further dynamical feedbacks in this experiment that are absent in hist-GHG. • "hist-stratO3": In this experiment designed for no-chemistry models all forcings are kept at their 1850 status except for stratospheric ozone which follows the CMIP6 ozone climatology. This experiment covers 1850-2020; four no-chemistry models are used here.
In addition to CMIP6 simulations, I also consider various observational references for the SAM (Table 2). Figure 2 shows that even in the decades since 1957 some disagreements occur between the observational references (Schneider & Fogt, 2018). Apart from some offsets between the datasets, in particular HadSLP2 displays stronger increases in all seasons in the 21st century than the other datasets. In DJF, NOAA-20CR appears to agree well with CERA-20C, but less so in winter (JJA). In the 19th century, some substantial discrepancies appear between the two datasets covering this period (HadSLP2 and NOAA-20CR), reflecting the paucity of data coverage during this period (Fogt et al., 2009). A substantial strengthening of the SAM index is evident from the 1950s onwards.

Synoptic Analysis of the SAM in CMIP6 Simulations
In a first step, I consider the seasonally resolved linear trend in the CMIP6 simulations and the observations for the period of 1957-2014. I stipulate that before 1957, there were no continuous meteorological observations made in Antarctica and therefore SAM reconstructions become more uncertain. The Note. The first four are reanalysis products. ERA5 and NCEP/NCAR are reanalyzes using a wide variety of observations, whereas CERA-20C and NOAA-CIRES-DOE 20CRv3 (abbreviated here to "NOAA-20CR") are using more restricted observations. HadSLP2 is a gridded interpolation of SLP observations using station and ship data. The Marshall index is a representation of the SAM index using observations from 12 stations only. SAM, southern annular mode.

Table 2
Global Gridded SLP Reconstructions Used Here around 0.03 ± 0.04 hPa a −1 in winter (the best-estimate thick black contour in Figure 3; uncertainties are for the 68% confidence level). In summer, these trends are in general agreement with the observational references, but in winter, only the trends derived from the Marshall index and from NOAA-20CR are within the 95% probability interval of the CMIP6 trends for all seasons, whereas HadSLP2 and particularly NCEP/NCAR MORGENSTERN 10.1029/2020JD034161 6 of 20 have trends that are outside this range in winter, making them very likely inconsistent with the CMIP6 distribution of trends. This result is similar to an earlier finding based on CMIP5 data (Swart et al., 2015) in that for DJF trends derived from CMIP5 simulations are also consistent with observations, and simulated JJA trends are not. However, Swart et al. (2015) study a slightly different period and use an older version of NOAA-20CR, making this a skewed comparison. The agreement found here is better than found by Swart et al. (2015), partly because of the inclusion in the analysis of the Marshall index and because here I use a newer version of NOAA-20CR. General issues with observational datasets, especially the exaggerated wintertime trends, are well known (e.g., Fogt et al., 2018Fogt et al., , 2009Marshall, 2003;Schneider & Fogt, 2018); Gillett and Fyfe (2013) find similar discrepancies for JJA between the trends in CMIP5 models and those in HadSLP2 and an earlier version of NOAA-20CR.
Given the simplicity of construction of the Marshall (2003) index, and the fact that HadSLP2 shows anomalously high values also for the most recent decades characterized by comparatively excellent data coverage, in the following I will only consider the NOAA-20CR reanalysis for the regression analysis outlined above which uses the full 165-years historical simulations. I note however persistent reservations also about the quality of this data set for the first century (1851-1956) for which few actual southern high-latitude observations have entered this reanalysis (Schneider & Fogt, 2018).
Moving now to analyzing the SAM index for the whole "historical" period in the CMIP6 ensemble for the austral summer and winter seasons (Figure 4), it is clear that the regression model (Equation 1) well approximates the multi-model mean "historical" evolution of the SAM index for all subsets of models displayed.
In summer, all four model subsets in the multi-model mean (MMM) exhibit about the same strengthening in 2014 (∼4 hPa) which is larger than in winter ( Figure 4). In summer, the groups are in general agreement with NOAA-20CR which also shows strengthening, albeit of slightly larger magnitude than the multi-model mean. The regression analysis as well as the hist-1950HC experiment (for the chemistry models) indicate that the strengthening in summer is predominantly caused by the growth in ODSs from about 1960 onwards. However, the hist-GHG experiment indicates substantial strengthening (exceeding ∼2 hPa in the MMM) also when all forcings other than the GHGs but including ozone are held invariant, and the hist-stratO3 experiment yields a strengthening in summer which is smaller than in the equivalent historical ensemble. I will discuss all of these findings in more detail in Section 6.
For the chemistry group, the amount of strengthening found here ( Figure A3) aligns broadly with the amount of Southern-Hemisphere ozone depletion, with models simulating strong ozone loss (UKESM1, CNRM-ESM2-1) showing a larger strengthening than those with comparatively weak ozone loss (MRI-ESM2, CESM2-WACCM; . In winter, all model groups indicate much weaker strengthening than in summer which however is in disagreement with a ∼7 hPa strengthening in NOAA-20CR (Figure 4). Given the good agreement between the trends in NOAA-20CR and the CMIP6 models for the period post-1957 during which much of the anthropogenic forcing was established (Figure 3), it is plausible that the disagreement found here is an artifact of the NOAA-20CR data. The regression analysis suggests only a small to no role for ozone depletion. Both the strong influence of ozone depletion in summer and the weak one in winter are in agreement with previous studies (e.g., Morgenstern et al., 2014Morgenstern et al., , 2018Son et al., 2010).
In the following section I will address three questions, motivated by this analysis: 1. Are the CMIP6 model simulations, as an ensemble, consistent with NOAA-20CR under the regression analysis laid out above? 2. Are there any statistically robust differences between the chemistry and no-chemistry groups of models?  3. How can I reconcile the apparently contradictory findings regarding what is driving the summertime SAM derived from the hist-1950HC, hist-GHG, and hist-stratO3 experiments?

Validation of the CMIP6 Historical Simulations versus NOAA-20CR
A fundamental problem here is that there are hundreds of CMIP6 "historical" simulations, but only one realization produced by nature, approximated by NOAA-20CR. I thus need to adequately consider natural variability when comparing observations to CMIP6 simulations. Figure 5 shows density plots for the three MORGENSTERN 10.1029/2020JD034161 8 of 20 Note that in (c)-(h) the models are grouped such that the historical MMMs are constructed from the same models as the sensitivity MMMs (although the 21st century strengthening in the historical MMM is largely insensitive to this grouping). A version of this figure with individual models identifiable by color is appended ( Figure A3). GHGs, greenhouse gases; SAM, southern annular mode.
regression coefficients and their sum S 1 + S 2 as derived from individual historical CMIP6 simulations, with every simulation weighted with 1 i n , with n i being the number of simulations in the historical ensemble of model i. An inspection of the results for individual models (not shown) indicates that the spread in S 0 evident in Figure 5 is not the result of natural variability (i.e., spread within single-model ensembles) but rather reflects the differences in the base states of individual models that are seen in all ensemble members with little random variability. However, the models generally well reflect the two maxima in spring and autumn (known in Aotearoa New Zealand as the "windy seasons"), in agreement with NOAA-20CR.
For the component S 1 of the SAM that goes with 2 CO eq , in DJF the CMIP6 simulations are roughly evenly divided between positive and negative values, but in austral winter, they favor a positive contribution to the SAM (i.e., increasing GHGs drive a strengthening of the SAM). An inspection of individual model results shows that the spread in S 1 has contributions due to both natural variability and inter-model disagreements. NOAA-20CR is generally within the CMIP6 range for all months but in spring (SON) tracks close to the upper end of the range spanned by the CMIP6 models.
For the component S 2 reflecting the influence of ODSs, the CMIP6 historical ensemble shows predominantly positive influences during summer (i.e., ODSs driving a summertime strengthening of the SAM), in agreement with NOAA-20CR. During winter, CMIP6 favors a negative influence (weakening) of ODSs on the SAM, but a substantial fraction of simulations also shows a strengthening. NOAA-20CR is generally in agreement with this behavior. S 2 exhibits a generally better agreement between NOAA-20CR and the CMIP6 models than S 1 , likely because ODSs only started to increase around the time of the onset of measurements in Antarctica, and thus trends driven by ODS increases are better captured by observations than earlier, possibly spurious trends projecting onto GHG increases in NOAA-20CR which occurred before these observations started.
The sum of the regression coefficients, S 1 + S 2 , shows less variability than the two coefficients individually. This is the result of nonzero correlation between the two regressors. S 1 + S 2 is therefore more robustly diagnosed from both the references and the models than both indices individually. The much larger wintertime strengthening in NOAA-20CR, which is outside the range spanned by the CMIP6 ensemble, is evident here as well.
I conclude from this analysis that (a) during summer the CMIP6 simulations are broadly in agreement with NOAA-20CR regarding the influences of ODSs and GHGs and their combination, whereas (b) during winter and spring the total best-estimate strengthening of the SAM evident in NOAA-20CR is irreconcilable with the CMIP6 ensemble. Given the better agreement between CMIP6 and NOAA-20CR, in all seasons, for the SAM strengthening in 1957-2014 ( Figure 5), I conjecture that this may well reflect spurious variations in the NOAA-20CR reanalysis for the period before 1957.
Next I perform a formal analysis of differences between the behaviors of chemistry-and no-chemistry models, with a particular emphasis on an uncertainty calculation.

Chemistry Versus No-Chemistry Models
In comparing the two model groups, I consider two sources of statistical uncertainty: The first is the random-noise uncertainty associated with the regression approach itself. To account for this, for every model MORGENSTERN 10.1029/2020JD034161 9 of 20 I then form the six-model mean and statistical uncertainty range of this data from the six chemistry models.
Here the uncertainty only refers to the statistical component reflecting natural variability, which due to the six-model averaging is smaller than for individual models.
For the no-chemistry group, I note that there are 100,947 distinct 6-model subsets that can be formed from the 23 no-chemistry models. I thus form those 100,947 distinct six-model averages of the s 0 , s 1 , and s 2 coefficients. Multiplied by the 1000 Monte-Carlo realizations each, this yields a total of 100,947,000 regression realizations. The distribution functions of s 0 , s 1 , and s 2 define the uncertainties in 6-model-mean regression coefficients derived from a randomly chosen 6-model subset of the no-chemistry models. This analysis is accounting for statistical/random noise as well as model-selection uncertainties. The presence of both these types of uncertainty in the no-chemistry group means that the no-chemistry 6-model-mean coefficients are subject to larger total uncertainties than those of the chemistry group (see below). I then, for both ensembles, reduce the 1000 and 100 million realizations, respectively, to their means and 2.5, 16, 84, and 97.5 percentiles, representing the one-and two-standard deviation uncertainty bounds of the 6-model-mean regression parameters (Figures 6 and 7). For the baseline SAM index S 0 , there are relatively small differences between the two model sets, although the large inter-model differences noted before produce a substantial spread in the uncertainty range for the no-chemistry ensemble which is absent in the chemistry group (as there is only one such ensemble possible, characterized by a very small random uncertainty). Except during austral spring, the SAM in the chemistry group tends to be slightly stronger than in the no-chemistry group with a probability exceeding 60% in February and September and a tendency for chemistry models to have a weaker S 0 in May-June than the no-chemistry models (Figure 7).
The GHG influence S 1 is generally stronger in the no-chemistry group (with probabilities of 90% or larger during summer and autumn [DJFMAM]). In DJF there is a disagreement in sign, with no-chemistry models favoring a strengthening influence of GHGs but chemistry models favoring a weakening. The ODS influence S 2 is stronger throughout most of the year in the chemistry versus no-chemistry group, with particularly large differences occurring in summer and autumn when the probability exceeds 90%. In winter, chemistry models show a zero influence but no-chemistry models display a weakening influence of ODSs. Both are much weaker than the GHG influence and insignificant at the 68% confidence level.
As for the combination of both influences S 1 + S 2 , no-chemistry models (with a probability of 60%-90%, depending on season) favor a larger strengthening than the chemistry models, although the difference is always less than 1 hPa.

Deep Coupling of the SAM
In the preceding section I have only studied the surface expression of the SAM. Here, I discuss briefly what the trends and the regression analysis yield when applied to the SAM index derived from geopotential height (zg) fields. In particular I will contrast the chemistry models with their near-equivalent no-chemistry variants (CESM2, CNRM-CM6-1, GFDL-CM4, GISS-E2-1-G, HadGEM3-GC31-LL; Table 1).
The 1957-2014 linear trends in the SAM (Figure 8) show in all models (both chemistry and near-equivalent no-chemistry models) a maximum in the strengthening of the SAM in late spring between 100 and 10 hPa. In the chemistry models, the amplitude of this maximum depends on the amount of ozone depletion simulated, with UKESM1-1-LL and MRI-ESM2-0 simulating the largest and smallest trends in zg for relatively large and small amounts of Southern-Hemisphere ozone loss, respectively . The strengthening disappears when halocarbons are suppressed in the hist-1950HC experiment; they are replaced with mostly insignificant trends of inconsistent sign. In three out of five cases where this comparison is possible, spring/summer trends in zg are larger in the chemistry models than in their no-chem-istry equivalents. The CESM2/CESM2-WACCM and GFDL-CM4/GF-DL-ESM4 pairs exhibit about the same strengthening. In the hist-GHG experiment (where ozone is prescribed and GHGs provide the only forcing) in some cases (particularly CESM2/CESM2-WACCM, GFDL-CM4/ GFDL-ESM4-1, and GISS-E2-1-G/GISS-E2-1-Gchem) the trends are larger than in hist-1950HC (where ozone is interactive), but but in other cases this difference is relatively small (CNRM-CM6-1/CNRM-ESM2-1, HadGEM3-GC31-LL/UKESM1-0-LL).
Extending now the regression analysis (covering 1850-2014) to the SAM index derived from zg fields, for five of the six chemistry models (except for MRI-ESM2-0) the regression analysis (Figure 9) shows that ozone depletion is driving a strengthening of the SAM during spring and summer, but there is a sizeable offset due to increasing GHGs. In MRI-ESM2-0 a weak strengthening of the SAM associated with the small ozone depletion characterizing this model  is not partially offset by a weakening influence of GHGs (panels q, r). There is substantial anticorrelation between the influences of both forcings also in other seasons. The no-chemistry equivalents largely show similar behavior although in two of the models (GISS-E2-1-G, HadGEM3-GC31-LL) the offsetting effect maximizes earlier in the year and is not perfectly aligned with the ozone depletion season, unlike in most chemistry models. However, in all cases, deep coupling is evident whereby relatively large SAM trends in the stratosphere, with a delay of a few months, drive corresponding zg-based SAM trends in the troposphere in summer which then manifest as the trends in the psl-based SAM discussed above.

Discussion
Validations of multi-model climate experiments versus observations, and comparisons of different groups of models, require a careful consideration of statistical uncertainty bounds. This is particularly the case for the SAM which for the period before the onset of routine meteorological measurements in Antarctica is subject to substantial uncertainties. For this reason, evaluations of trends in the SAM are often restricted to recent decades only, and often only involve calculating linear trends. In the face of large random variability and considerable differences in model behavior, such approaches can result in substantial uncertainties in the resultant trends and consequently only weak conclusions about any drivers of such trends.
To advance in the face of these issues, I here pursue an approach that (a) maximizes the usage of available simulations (i.e., I use almost all available CMIP6 historical simulations). (b) Using a regression model, I consider the whole "historical" period (1850-2014). The regression model accounts for the leading anthropogenic influences that modulate the SAM on decadal-to-century timescales. (c) In comparing the mean behaviors of two different groups of models (with and without interactive ozone) I form all possible subsets of the larger group that are of equal size to the smaller group. This ensures strict comparability.
For the period 1957-2014, during which continuous Antarctic meteorological observations exist, I find a mean linear strengthening trend in the CMIP6 ensemble with confidence bounds which in summer include the trends derived from four observational datasets. In winter however MORGENSTERN 10.1029/2020JD034161 11 of 20 Figure 6. Thick black: Mean parameters S 0 , S 1 , S 2 , and S 1 + S 2 from the no-chemistry model ensemble. Dark and light blue: Their 68% and 95% confidence intervals. Solid red: Mean parameters S 0 , S 1 , S 2 , and S 1 + S 2 from the chemistry model ensemble. Dotted red: Their 68% and 95% confidence intervals.

Figure 7.
Probabilities P that the multi-model mean of (black) S 0 (blue) S 1 (olive) S 2 and (light green) S 1 + S 2 is larger for the chemistry group of models than for a randomly chosen 6-model no-chemistry group. Figure 8. 1957-2014 trends in the ensemble-mean SAM index derived from geopotential height fields on pressure levels (m a −1 ) in the chemistry models and their no-chemistry equivalents (where available). "+" symbols denote insignificant trends (small "+": at the 68% confidence levels; large "+": 95% significance) (a, e, i, l, r) No-chemistry, historical (b, f, m, s) No-chemistry, hist-GHG (c, g, j, n, p, t) Chemistry, historical (d, h, k, o, q, u) Chemistry, hist-1950HC. SAM, southern annular mode. Figure 9. Regression coefficients S 1 and S 2 derived from the ensemble-mean SAM indices (in units of meters) in the chemistry models and their no-chemistry equivalents (where available) (a, e, i, m, s) S 1 , no-chemistry models (b, f, j, n, t) S 2 , no-chemistry (e, g, k, o, q, u) S 1 , chemistry (d, h, l, p, r, v) S 2 , chemistry. two of these datasets (HadSLP2, NCEP/NCAR) exhibit spuriously large trends and are thus removed from further analysis. The remaining two datasets (Marshall (2003) and NOAA-20CR) are consistent with each other and with the distribution of trends in the CMIP6 ensemble. This makes NOAA-20CR my primary observational reference data set as it extends back to 1851.
Forced multidecadal SAM variations in the four experiments analyzed here cannot be reconciled using the traditional approach of only accounting for ozone depletion and GHG influences. The hist-1950HC experiment clearly shows that in the absence of increases in ODSs causing ozone depletion, the models in the mean exhibit almost no strengthening in DJF. In this two-factor framework this would be in contradiction with the hist-GHG experiment which does produce a substantial MMM growth of the SAM index. Following Morgenstern et al. (2014), I therefore stipulate that a three-factor approach is needed to explain this, comprising the factors ozone depletion (ODS), greenhouse-gas induced warming (the "direct GHG" effect, dGHG), and the impact on dynamics of greenhouse-gas induced ozone changes (the "indirect GHG" effect, iGHG). These factors are variously taking effect in the four experiments considered here (Table 3). Table 3 represents an overdetermined system of four linear equations for the three factors ODS, dGHG, and iGHG. There are four three-equation subsets, one of which in DJF has no solution. The other three yield the ranges as indicated in the lower three lines of Table 3. In JJA, the four-equations system has the one solution indicated.
The three-factor qualitative model better explains the findings than the two-factor approach. In summer ozone depletion dominates the other influences in the CMIP6 ensemble. Here, the GHG influence is negligible because of a cancellation of two nonnegligible terms, dGHG and iGHG, which each are about a third to half the size of the ODS influence. In winter, the ODS influence is small, and there is some offset of dGHG by iGHG. These findings are corroborated by my upper-level analysis (Figure 8) which indicates for the hist-1950HC experiment essentially zero trends in the SAM index during the ozone-depletion season but persistent positive trends in the index for the hist-GHG experiment. The results corroborate Morgenstern et al. (2014), however now using a multi-model analysis.
The analysis presented here fundamentally relies on the presence of a subset of models with interactive ozone chemistry having completed the hist-1950HC experiment. In the hypothetical absence of this experiment, the "contradiction" mentioned above would be reduced to some nonlinearity between the historical, hist-GHG and hist-stratO3 experiments (i.e., the summertime strengthening in hist-GHG and hist-stratO3 not adding up to that in the historical experiment; Table 3), which however would not be a clear indication of a missing factor in the attribution analysis.
I also find some statistically robust differences in behavior between the 6-model chemistry and the 23-model no-chemistry ensembles. Corroborating previous literature on the differences between chemistry-and no-chemistry models (Haase et al., 2020;Haase & Matthes, 2019), I find stronger influences of ozone depletion offset by weaker influences of increasing GHGs in the chemistry group, in most seasons. A comparison of the chemistry models with their no-chemistry equivalents suggests that indeed the SAM strengthening in the stratosphere is mostly weaker in the no-chemistry counterparts. This behavior has previously been attributed to a misalignment of the ozone hole with the dynamical polar vortex in no-chemistry models, causing a systematically weaker polar vortex and a weaker influence of ozone depletion also on the tropospheric SAM (Haase & Matthes, 2019).
In summary, my study illustrates that using only no-chemistry models for attribution of trends in the SAM carries the risk of erroneous conclusions. CMIP6 marks the first time that a sizeable set of fully coupled climate models, with and without ozone chemistry, has been applied to a range of sensitivity scenarios required to identify and quantify the factors that drive the SAM. The results of my analysis align well with but also advance on previous analyses of the SAM.

Appendix A: The Monte-Carlo Based Comparison of Chemistry and Nochemistry Models
The method used to compute the data in Figure 6 is laid out in more detail here.
• For all models i, I form the single-model ensemble-means of the SAM index.
• The ensemble-mean SAM indices are decomposed into their regression components 0 i S to 2 i S , and the variance ϵ 2 is decomposed into its components 0 i E to 2 i E following Equations 1 and 2. • Based on these regression functions, for every model 1000 synthetic random realizations of the SAM index z ij (m, y) are produced. (j stands for one of the 1000 random realizations, and m and y are the month and year as before.) This process is illustrated in Figure A1 for the HadGEM3-GC31-LL model. Note that the spread in Figure A1 reduces with increasing ensemble size. zations for those 6-model means. From these, I derive their mean across the 1000 realizations and their 2.5, 16, 84, and 97.5 percentiles, marking the 68% and 95% confidence intervals as displayed in Figure 6. • For the no-chemistry subset of models, I form all 23!/(6! ⋅ 17!) = 100,947 distinct six-model subsets.
For every one of the subsets I follow the same process as above for the chemistry models, resulting in 100,947,000 six-model mean realizations of the regression coefficients for the no-chemistry group. I again reduce these to their means and 2.5, 16, 84, and 97.5 percentiles as above.
For the Monte-Carlo analysis of SAM indices to be valid, I require an autocorrelation length of less than 1 year for the residual ϵ. This would ensure that the uncorrelated random noise produced by the random-number generator is a good reflection of what is produced by the models. I have checked that for all individual models this is indeed the case. I furthermore require ϵ to be normally distributed. Following Morgenstern et al. (2014), Figure A2 indicates that the assumption of a normal distribution of the residual ϵ is good but not perfect. Relative to a perfect Gaussian distribution, large deviations from the mean in both directions (where G < 0.15 or G > 0.85) are slightly overrepresented and small deviations (where 0.2 < G < 0.8) are underrepresented, across practically all models. Morgenstern et al. (2014) found similar deviations from a normal distribution in their analysis.