The vertical profile of recent tropical temperature trends: Persistent model biases in the context of internal variability

Tropospheric and stratospheric tropical temperature trends in recent decades have been notoriously hard to simulate using climate models, particularly in the upper troposphere. Aside from the warming trend itself, this has broader implications, e.g. atmospheric circulation trends depend on latitudinal temperature gradients. In this study, tropical temperature trends in the CMIP6 models are examined, from 1979 to 2014, and contrasted with trends from the RICH/RAOBCORE radiosondes, and the ERA5/5.1 reanalysis. As in earlier studies, we find considerable warming biases in the CMIP6 modeled trends, and we show that these biases are linked to biases in surface temperature. We also uncover previously undocumented biases in the lower-middle stratosphere: the CMIP6 models appear unable to capture the time evolution of stratospheric cooling, which is non-monotonic owing to the Montreal Protocol. Finally, using models with large ensembles, we show that their standard deviation in tropospheric temperature trends, which is due to internal variability alone, explains ∼ 50% (± 20%) of that from the CMIP6 models.


Introduction
Since the pioneering work of Manabe and Wetherald (1975) climate models have consistently shown greater warming in the tropical upper troposphere than near the surface in response to increased CO 2 concentrations. This robust differential warming is understood to result from convection which, at low latitudes, tends to adjust the temperature profile to a moist adiabat Stouffer 1980, Santer et al 2005). In this context, the first paper to analyze atmospheric temperature trends inferred from satellitebased microwave sounders (Spencer and Christy 1990) came as a great surprise, as it reported a lack of warming in the free troposphere over the decade 1979-1988, questioning the reliability of climate models and radiosonde observations. That study generated a great deal of controversy, giving rise to dozens of papers, and two expert panel reports. The reader is referred to Thorne et al (2011) for the latest exhaustive, if not completely updated, review.
In brief, soon after that controversial paper it became clear that both satellite and radiosonde derived temperature trends suffered from considerable biases see, e.g. Karl et al (2006). A large effort, therefore, has gone into producing 'homogenized' data sets, from which instrumental artifacts are carefully and methodically removed. Nonetheless, much uncertainty remains as to the vertical structure of the observed temperature trends in the free-atmosphere since 1979, notably in the tropics. A more complete discussion can be found in the relevant section of the Fourth and Fifth Assessment Reports of the Intergovernmental Panel on Climate Change (IPCC, see Hegerl et al 2007, Hartmann et al 2013.
In tandem with the effort to put the observed trends on more solid grounds, climate models have greatly evolved since the early IPCC assessment reports. In the last two decades, most state-of-theart climate models discretize the atmosphere with dozens of vertical levels, have an accurate representation of the stratosphere, and are coupled to dynamic ocean, sea ice, and other components. In spite of these improvements, however, substantial discrepancies remain-between models and observationsin the vertical structure of atmospheric temperature trends in the tropics. For models participating in Phases 3 and 5 of the Coupled Model Intercomparison Project (CMIP3 and CMIP5), these discrepancies have been reported in numerous papers (see, e.g. Fu et al 2011, Po-Chedley and Fu 2012, Santer et al 2013, Santer et al 2017. In particular, it is worth recalling the findings of Mitchell et al (2013), hereafter referred to as M13. While reporting a considerable discrepancy between radiosonde and modeled trends over the period , that study highlighted the fact that an important source of the discrepancy rested in the modeled surface warming, which was larger than the observed one. M13 showed that the discrepancy between models and observations is greatly reduced in the atmosphere-only CMIP5 model simulations, in which surface temperatures are prescribed from observations.
Building on M13, the goal of this paper is to analyze the recently completed simulations performed under Phase 6 of the Coupled Model Intercomparison Project (CMIP6, Eyring et al 2016), and to explore whether the tropical temperature trends in these models are closer to observations than those of the CMIP5 models. We also address two novel aspects of the problem. First, mindful that the trends in atmospheric concentrations of many ozone depleting substances has peaked shortly before the turn of the century (as a consequence of the Montreal Protocol), we separately compute trends before and after the year 1998, seeking to document the role of ozone depletion on atmospheric temperature trends. Second, in the spirit of Hawkins and Sutton (2009), we take advantage of large ensembles of individual CMIP6 model simulations (as opposed to a single run from each model), and seek to document what fraction of the large spread across the CMIP6 models can be attributed to internal atmospheric variability, as opposed to inter-model differences.

Methods
To report the observed atmospheric temperature trends, we make use of three different data sets: two radiosonde data sets, the Radiosonde Innovation Composite Homogenization (RICH, v1.7) and the RAdiosone OBservation COrrection using REanalyses (RAOBCORE, v1.7) products (Haimberger 2007, Haimberger et al 2012, and one reanalysis data set, ERA5 (Hersbach et al 2020). Note that ERA5 assimilates the radiosonde data used here, as well as many other data sources. For simplicity, throughout this manuscript we will refer to these three data sets, collectively, as 'the observations' , even though we are well aware that ERA5 is a reanalysis, with observations assimilated into an underlying model.
The difference between the two radiosonde data sets resides in the procedures used for the homogenization; these are fully detailed in Haimberger et al (2012). Both radiosonde data sets have been updated to cover the period 1979-2019, at a resolution of 10 • × 10 • in horizontal directions, and 13 levels extending from 850 hPa to 10 hPa in the vertical direction. While temperature data are available at monthly resolution, we here construct annual averages, with the proviso that if more than 3 months of data are missing at a grid point in a given year, we count the entire year as missing. We note that both radiosonde data sets have the same resolution, and the same missing data. Figure S1 (stacks.iop.org/ERL/15/1040b4/mmedia) shows the coverage of available data at three of the pressure levels that we focus on in this study.
The ERA5 data set is a high resolution reanalysis produced by the European Centre for Medium-Range Weather Forecasts (Hersbach et al 2020). Its horizontal resolution is 0.28 • (in both latitude and longitude), with data available on 137 pressure levels from the surface to 0.01 hPa. Since ERA5 is at higher spatial resolution than the radiosonde data, we regrid it to the same resolution as the radiosondes using bilinear interpolation, and apply the same missing data mask used for the radiosondes. ERA5 data is available over the period 1979-2018; however, in this study, we substitute the years 2000-2006 with an updated product, ERA5.1. This is necessary as an error was identified in the original ERA5 lower stratospheric temperatures, due to an incorrect specification of the error covariance matrix in the assimilation scheme (Simmons et al 2020).
The primary model data used in this study consists of the historical simulations performed under CMIP6, which extend from 1979-2014. As this period is common amongst all the data sources, we use it for the bulk of our analysis. CMIP6 represents the current state-of-the-art in climate modeling, so most of the participating modeling groups have provided output from fully comprehensive earth-system models. It is important to stress that the historical simulations analyzed here were performed under identical greenhouse gas (GHG), aerosol, and natural forcings (Eyring et al 2016). As for ozone, some models use prescribed concentrations (Checa-Garcia et al 2018), while others include interactive chemistry schemes. As in the case of the ERA5 data, we have regridded the model output to the lower resolution grid of the radiosonde data sets, and applied the same missing data mask. A few CMIP6 models have missing data in the lower atmosphere as they do not interpolate below the ground level which, in some mountainous regions, is higher than the lower atmospheric pressure levels. Table 1. The CMIP6 models analyzed in this study. C indicates models with a fully-coupled dynamic ocean; P indicates atmosphere only models with prescribed SST; C/P models for which both simulations are available. For the single forcing simulations, GHG refers to greenhouse gas only forcings; AER refers to aerosol only forcings, and NAT refers to natural only forcings (see Gillett et al 2016, for details).

Large Ensemble Model
Ocean Type Single Forcings Size At the time of this writing, output from 46 models is available for the CMIP6 historical simulations, as listed in table 1 (with ocean type 'C'). Unless otherwise specified, we take only the first ensemble member of each model as we use individual members as opposed to ensemble means for a like-with-like comparison with observations, and want to ensure equal weighting across the set of models. In addition to the atmosphere-ocean coupled simulations, we also make use of the atmosphere-only version of the historical CMIP6 simulations (see table 1), which are forced with observed sea surface temperatures (SSTs). To put the CMIP6 models in the context of earlier intercomparisons, we also show results for the CMIP5 models.
To quantify the relative importance of the major forcings, we also make use of the model output produced by the Detection and Attribution Model Inter-comparison Project (DAMIP, Gillett et al 2016).
At this time a total of 7 models (listed in table 1) have made available the single-forcing simulations that we analyze here. Specifically, these are the historical 'GHG-only' simulations, forced only with wellmixed greenhouse gases, the 'aerosol-only' simulations, forced only with aerosols (BC, OC, SO 2 , SO 4 , NO x , NH 3 , CO, NMVOC), and the 'natural-only' simulations, forced only with solar irradiance changes and volcanic aerosols.
Finally, in order to quantify the contribution of internal variability to the spread across the CMIP6 models, we also analyse several 'large ensembles' that were performed as part of the CMIP6 historical experiments. We define a large ensemble as having 20 or more members: this allows us to analyze six different large ensembles (see table 1), ranging in size from 20 (GISS-E2-1-H) to 50 members (CanESM5). Large ensembles are also available for models other than those analyzed here (Deser et al 2020), but we have chosen to focus on the models that participated in CMIP6 to ensure all models forcings are the same in this study.

Analysis
In light of the most recent advances in Earth-system modeling and of the improved observational data sets available, we begin by updating the result of M13, and present the vertical profile of zonal mean, annual mean temperature trend from 1979 to 2014. As shown in figure 1(a), the overall trends consist of a cooling of the stratosphere and a warming of the troposphere, in both models and observations. This pattern is the well-known vertical 'fingerprint' of anthropogenic forcings, originally reported by Tett et al (1996) and Santer et al (1996). In the stratosphere, the coupled CMIP6 models (red bars) show cooling trends comparable to the observed ones (black lines). In the troposphere, however, the models show considerably larger warming than in the observations.
The warm trends bias in the models is seen throughout the entire troposphere, but is greatest in the upper troposphere (peaking around 200 hPa), where the modeled trends are-on average -4 to 5 times greater than the observations. We draw attention to the CanESM5 model: it simulates the greatest warming in the troposphere, roughly 7 times larger than the observed trends. We note this model is known to have a high climate sensitivity compared to others (Swart et al 2019, Forster et al 2019. Throughout the depth of the troposphere, not a single model realization overlaps all the observational estimates. However, there is some overlap between the RICH observations and the lowermost modelled trend, which corresponds to the NorCPM1 model. In M13, this considerable bias was attributed to the inability of the models to capture the observed sea surface temperature trends. The same applies to the CMIP6 models, as demonstrated by fact that when the models are forced with prescribed SSTs (blue bars in figure 1(a)) their trends are much closer to the observed values. Nonetheless, one can still see a systematic bias at most tropospheric levels. Between 200 and 100 hPa the differences between the CMIP and AMIP simulations are even more visible. It is important to note that ERA5/5.1 is warmer than the radiosondes in that region; this is likely due to the assimilation of radio occultation data, which shows more warming in the upper troposphere than the radiosondes. As such, the discrepancy in this region may be smaller than reported in previous studies, and very possibly due to observational uncertainty, rather than model biases. A comparison with trends that extend to 2019 is given in figure S2, with no change in these conclusions. Now, turning our attention to lower stratospheric trends (100-20 hPa), one may be tempted to conclude-from figure 1(a)-that modeled and observed trends are in good agreement. The story, however, is more complex, and requires a more nuanced analysis.
Recall that, unlike carbon dioxide which has been monotonically increasing since the pre-industrial era, ozone depleting substances, an important and often neglected anthropogenic forcing, exhibit a highlynon-linear evolution from 1979 to 2014: the usefulness of a single linear trend covering the entire period, therefore, is questionable. The non-linearity is due to the signing of the Montreal Protocol in 1989: as a consequence of that treaty, the atmospheric concentrations of many ozone depleting substances are no longer increasing. In fact, the trend in 'effective equivalent stratospheric chlorine' (EESC, a commonly used metric for the combined concentration of ozone depleting substances) has changed from positive to negative around the very end of the 20th century.
In view of this, following the latest Scientific Assessment of Ozone Depletion (WMO 2018), we split the 1979-2014 period into two parts: the ozone depletion period 1979-1997 (during which EESC was increasing) and the ozone recovery period 1998-2014 (during which EESC was in decline). Separate temperature trends for these two periods are shown in figures 1(b) and (c), respectively. It must be emphasized that these two periods are relatively short (less than two decades): hence much caution is called for in any analysis and interpretation.
Let us start by considering the observations. It is clear that the stratospheric cooling trends in the ozone-depletion period are greatly reduced in the ozone-recovery period. The RICH radiosonde data, in fact, indicates that stratospheric cooling trends have disappeared after 1998, although RAOBCORE and ERA5/5.1 still show a modest cooling. Results over this period are in agreement with the satellite and RAOBCORE1.7 radiosondes, and ERA5/5.1 reanalysis. The red box-and-whisker bars show trends for ocean-atmosphere coupled CMIP6 models (48 in total); the blue bars shows trends for CMIP6 models with prescribed sea surface temperature (28 in total); the red bars are plotted at the correct altitude, but the blue bars are slightly offset downwards to aid comparison; each box shows the lower-to-upper quartile of the modeled trends, and the whiskers show the full range of data up to 1.5 times the inter-quartile range away from the mean, in which case the points beyond are represented by coloured crosses. The model data and ERA5/5.1 data are masked with the same observational mask from RICH, including the variation in time and pressure of the mask. Monthly data are averaged to annual data; if more than 3 months of data are missing in any grid box in a given year, all months for that year are set to missing. Panel a.), b.) and c.) show trends from [1979][1980][1981][1982][1983][1984][1985][1986][1987][1988][1989][1990][1991][1992][1993][1994][1995][1996][1997]   observations, which show a completely flat temperature time series after 1996 in the lower stratospheric channel (the so-called MSU Channel 4), as reported in Mitchell (2016), Seidel et al (2016). It has been proposed that the near disappearance of cooling trends in the lower stratosphere is a simple consequence of the fact that ozone depletion is no longer occurring (see, e.g. figure 3.21 of WMO 2018). Some studies have also pointed to a role for SSTs in recent tropical lower stratospheric temperature trends (Shangguan et al 2019), e.g, although modelling results indicate that this effect is small (Polvani and Solomon 2012).
Turning now to the modeled trends, figures 1(b) and (c) reveal a considerable discrepancy with the observations. In the stratosphere, the majority of CMIP6 models cool too little in the ozone depletion period when compared with RICH and ERA5, although there is good agreement with RAOBCORE. For the ozone recovery period the models all cool too much, with the inter-quartile range not encompassing any observational product, and the total range not encompassing RICH at all. We suspect that these temperature biases might be due to a poor representation of stratospheric ozone forcing in the CMIP6 models. To this date, the methodology used to construct the ozone forcing for CMIP6 remains undocumented in the peer-reviewed literature, although Checa-Garcia et al (2018) have shown considerable uncertainty in the radiative forcing associated with ozone in the CMIP6 models. We have also not explored whether biases in stratospheric temperature trends are smaller for CMIP6 models with interactive ozone chemistry. It is also possible that the CMIP6 biases in stratospheric temperature trends stem from other sources, e.g. circulation changes that are inaccurately simulated in models, e.g. Garfinkel et al (2013). We note, however, that models with a realistic simulation of stratospheric ozone, and a good vertical resolution in the stratosphere, are perfectly capable of reproducing the observed stratospheric trends between 100 and 20 hPa over both periods separately (see, e.g. figure A3 of Randel et al 2017).
It is also instructive to contrast the tropospheric temperature trends in the ozone-depletion and ozone recovery period. Forster et al (2007) -on the basis of a purely radiative calculation with a fixed dynamical heating assumption-suggested that ozone depletion in the tropical stratosphere may lead to cooling in the tropical upper troposphere, due to a reduction in downwelling longwave radiation from the ozone above. However, using an atmospheric general circulation model with prescribed ozone concentrations, Polvani and Solomon (2012) showed that effects of stratospheric ozone depletion on tropical temperature trends do not extend much below the 100 hPa level. Given the observational uncertainties, it is difficult to discern a significant difference between figures 1(b) and (c) in the observed tropospheric trends.
As for the modeled tropospheric trends, however, the discrepancy with observations is much larger after 1998. We suspect that one cause of this discrepancy is related to the fact that the 1998-2014 period corresponds to the occurrence of the so called 'global warming hiatus' (see Fyfe et al 2016, for a recent update of this debate). If the hiatus is indeed related to an increased heat uptake by the oceans, as suggested by some studies (Meehl et al 2011, Meehl et al 2014, it cannot be considered an externally forced process: therefore, one would not expect it to be captured in the models over the same time period. Another contributing factor is that 1997/1998 had one of the largest El Niño events on record, which, given the short period the trend is calculated over, becomes important. Indeed, if the analysis is repeated for the 1999-2014 period (i.e. missing the large El Niño year), the tropospheric observational trends are higher, and in better agreement with the coupled model estimates (figure S3). Be that as it may, we note here for the record that from 1998 to 2014, the CMIP5 models warm, on average 4 to 5 times faster than the observations, and in one model the warming is 10 times larger than the observations.
To better quantify the relationship between the near surface and the upper tropospheric biases, which was already noted in M13, we illustrate their correlation in figure 2. For the CMIP6 models (red dots) the upper tropospheric (200 hPa) bias is very highly correlated with the near surface (850 hPa) bias, over 1979-2014: the Spearman correlation coefficient is r = 0.95. A similar number, r = 0.91, is calculated for the older CMIP5 models (blue), and the multimodel means are very close too. This indicates that there has not been any substantial improvement, in terms of tropical tropospheric temperature trends, between CMIP5 and CMIP6.
Next, we examine the source of the large spread in tropical temperature trends across the CMIP6 models. In particular, we examine separately the forced response and the internal variability. Starting with the former, the impacts of the different forcings on tropical atmospheric temperature trends is studied by analyzing the single-forced experiments that have been carried out under the Detection and Attribution Model Inter-comparison Project (DAMIP, Gillett et al 2016). Specifically, we make use of three separate experiments: the GHG-only simulations, the aerosolonly simulations, and the natural-only simulations (for more details, see section 2 of this paper, or Gillett et al 2016).
In figure 3 we illustrate how single forcings contribute to the total modeled trends, in both the stratosphere (30 hPa, top panel) and upper troposphere (200 hPa, bottom panel). By definition, the total trend (gray bars) is equal to 100%. Recall that, owing to data availability, only a subset of the CMIP6 models in figure 1 are used for this analysis (see table 1). The stratospheric cooling trend is dominated by the GHG  table 1). The bars represent the multi-model mean contribution to the trend, normalized by the total trend in the historical (All forcing) simulations; the grey bars are equal to 100%, by definition. The horizontal black lines show the individual model spread of the ensemble means, again, normalized by the gray bar. Positive/negative values represent warming/cooling. forcings, but also with a sizable component coming from natural forcings, most likely a cooling trend from volcanic emissions. Stratospheric ozone is prescribed to a pre-industrial climatology in these single forcing simulations, so cooling from ozone depletion is only present in the all-forcing (i.e. historical) simulations, and cannot be separately estimated using these specific DAMIP simulations.
In the upper troposphere (figure 3, bottom panel) GHGs are the overwhelming driver of temperature trends, with negligible contributions from aerosols and natural forcings. For the aerosol forcing, we note that only one model (MRI-ESM2-0) shows warming, and this warming trend skews the results considerably, providing a large positive error bar. Without that one model, the aerosol cooling is more substantial at this height. Needless to say, the ensemble size (7) is relatively small, and we hope more models will soon become available. Also, we note that these singleforcing simulations are not expected to sum to 100%, i.e. the sum of the green, yellow and blue bars to equal the gray bar, because 1) other forcings may be important, e.g. tropospheric and stratospheric ozone, 2) there may exist some non-linear interactions between different forcings, and 3) one cannot precisely estimate the forced signal with these DAMIP runs since a large ensemble of single-forcings simulations for each model is not available (the ensemble means would be estimates of the forced signal in each model). So, the results in figure 3 are contaminated by internal variability.
However, internal variability can be estimated by exploiting the fact that six CMIP6 models have made available large ensembles of historical integrations Two theoretical lines are also plotted in each panel in figure 4: the dry and moist adiabatic lapse rates (DALR and MALR, respectively), plotted as black lines. The MALR is computed using the following approximation (taken from Bakhshaii and Stull 2013) where c pd is the specific heat capacity for dry air at constant pressure, R d is the gas constant for dry air, L v is the latent heat of vaporisation, r s is the saturation mixing ratio, and ϵ = R d /R v is the ratio of ideal gas constants for dry air and water vapor. The MALR profiles are calculated by integrating this equation vertically, starting at 850 hPa, and using T(850 hPa) = 291 K, which we take from the ERA5 reanalysis. The DALR is obtained from the same formula, setting r s = 0, which then reduces to more common dT/dz = g/c pd .
Several interesting points should be noted in figure 4. First, there is a very strong correlation between the near surface and upper tropospheric trends, in all seven of the sets of models/ensembles: this confirms that the spread in upper tropospheric warming trends, in all cases, can be traced back to the spread in surface temperature. Second, the regression curves are close but not coincident with the theoretical moist adiabatic line: this indicates that the popular idea that tropical temperature profiles follow moist adiabats may not be quantitatively correct at these levels, at least not for the temporal-spatial-averaged sea surface temperature response considered here. For instance, Flannaghan et al (2014) show that tropical temperature trends only follow a moist adiabat once you appropriately weight the near surface temperature trends toward regions of deep convection, since it is the deep convecting regions that ultimately influence the upper troposphere. Third, contrasting the red dots and blue circles one gets the distinct impression that the spread across each large ensembles is comparable to the spread across the entire CMIP6 set. This is especially clear for CanESM5, which provided 50 distinct runs of the same model (panel a), and suggests that a large fraction of the CMIP6 spread may actually come from internal variability. Finally, we note that the means of the large ensembles (blue plus Figure 5. A comparison of the spread in tropical temperature trends for the CMIP6 models (red) and individual large ensembles (blue) at two different pressure levels, 850 hPa and 200 hPa. Each shade of blue represents a different large ensemble. From light-blue to dark-blue they are: CanESM5, CNRM-CM6-1, GISS-E2-1-G, GISS-E2-1-H, IPSL-CM6A-LR and NorCPM1 (see table 1). The large ensembles have been scaled so as to be centered on the CMIP6 ensemble profile, see text for details. The observations at these two levels are marked by a black cross (RICH), plus (RAOBCORE) and circle (ERA5/5.1). Note that at 850 hPa RICH and RAOBCORE overlap. The box-and-whiskers display the same statistics as in figure 1. symbols), which represent the forced trends in each model, can be found at both ends of the CMIP6 range (red dots): contrast, for instance, panels a and f. This indicates that the spread in forced trends across the models can be almost as large as the range spanned by the CMIP6 models.
In order to more clearly illustrate this point, i.e. to quantitatively compare the spread of the entire vertical profile of temperature trends across both the CMIP6 ensemble and the large ensembles, some thought is required. As seen in figure 4, the large ensembles have an average surface warming which is often different from the CMIP6 set. This implies that the lapse rates are higher for the large ensembles with higher surface warming than CMIP6, notably CanESM5, and lower for the large ensembles with lower surface warming (e.g. NorCPM1). Thus, some rescaling is needed for a meaningful comparison. Exploiting the tropospheric lapse rates across the large ensembles (the blue dashed lines in figure 4 panels a-f), we construct the relationship where T t (n, p) is the temperature trend at level p for large ensemble member n, and the values of α(p) and c(p) are derived from linear regression across ensemble members at each level p. Above 200 hPa the regression is not strong, so we do not apply this trend scaling beyond that level. Now, to quantitatively compare the spread in trends across the large ensembles and across the CMIP6 ensemble, we scale the individual large ensemble members so that their ensemble mean, at 850 hPa, is the same as the CMIP6 ensemble.
The scaled trends T ′ t (n, p) are defined by the expression: where O = ⟨T t (n, 850)⟩ Large−ensemble − ⟨T t (n, 850)⟩ CMIP6 is the difference between the ensemble means at 850 hPa. Figure 5 shows the scaled spread, as per equation (3), in the CMIP6 models (red) and each of the large ensembles (blue) for two different pressure surfaces. To be clear: the red boxes here are identical to those in figure 1(a) at 850 and 200 hPa. The mean trend for each large ensemble at 850 hPa is, by construction, identical to the mean of the CMIP6 ensemble. The standard deviation of the scaled CanESM5 ensemble (lightest blue) encom-passes~70% of the CMIP6 range, whereas for CNRM-CM6-1 (second lightest blue) it only encom-passes~30%. All the other large ensembles are found somewhere between these extremes and, on average, the large ensembles explain~50% of the total CMIP6 variability. Note that the number of models (or ensemble members) in each spread is different. To test if this matters we repeat our analysis with only 20 samples for each of the datasets (the lowest common denominator), but our results remain similar, and so we conclude there is little sensitivity to sample sizes greater than 20. Given this result, the clear indication here is that internal variability may be responsible for around 50% of the CMIP6 standard deviation, at least for trends over intervals spanning 3-4 decades (in our case, the trends are 35 years long). Finally, we note that while the large ensemble spreads are approximately Gaussian, the spread in CMIP6 models has a heavy upper tail, in line with the skewed range of climate sensitivities within this ensemble (Forster et al 2019).

Conclusions
We have compared the modeled and observed tropical temperature trends, over the period 1979-2014, from 850 hPa to the mid-stratosphere. Focusing on the CMIP6 models, we have confirmed the original findings of Mitchell et al (2013): first, the modeled tropospheric trends are biased warm throughout the troposphere (and notably in the upper troposphere, around 200 hPa) and, second, that these biases can be linked to biases in surface warming. As such, we see no improvement between the CMIP5 and the CMIP6 models.
Finally, analyzing six CMIP6 models which provided relatively large ensembles (from 20 to 50 members), we have been able to quantify the fraction of the CMIP6 model spread due to internal variability, as opposed to model differences. We find that the standard deviation of the large ensembles, which is due to internal variability alone, is 30-70% of that of CMIP6 models for the period 1979-2014, with a central estimate of 50%. This result highlights the importance of using large ensembles when evaluating trend differences across the CMIP6 models.