A Limited Role for Unforced Internal Variability in Twentieth-Century Warming

The early twentieth-century warming (EW; 1910–45) and the mid-twentieth-century cooling (MC; 1950– 80) have been linked to both internal variability of the climate system and changes in external radiative forcing. The degree to which either of the two factors contributed to EW and MC, or both, is still debated. Using a two-box impulse response model, we demonstrate that multidecadal ocean variability was unlikely to be the driver of observed changes in global mean surface temperature (GMST) after AD 1850. Instead, virtually all (97%–98%) of the global low-frequency variability ( . 30 years) can be explained by external forcing. We ﬁnd similarly high percentages of explained variance for interhemispheric and land–ocean temperature evolution. Three key aspects are identiﬁed that underpin the conclusion of this new study: inhomogeneous anthropogenic aerosol forcing (AER), biases in the instrumental sea surface temperature (SST) datasets, and inadequate representation of the response to varying forcing factors. Once the spatially heterogeneous nature of AER is accounted for, the MC period is reconcilable with external drivers. SST biases and imprecise forcing responses explain the putative disagreement between models and observations during the EW period. As a consequence, Atlantic multidecadal variability (AMV) is found to be primarily controlled by external forcing too. Future attribution studies should account for these important factors when discriminating between externally forced and internally generated inﬂuences on climate. We argue that AMV must not be used as a regressor and suggest a revised AMV index instead [the North Atlantic Variability Index (NAVI)]. Our associated best estimate for the transient climate response (TCR) is 1.57 K ( 6 0.70 at the 5%–95% conﬁdence level).


Introduction
The global temperature evolution over the instrumental period is conventionally attributed to the combination of external forcing and internal variability Denotes content that is immediately available upon publication as open access.
Supplemental information related to this paper is available at the Journals Online website: https://doi.org/10.1175/JCLI-D-18-0555.s1. (Stott et al. 2000;Bindoff et al. 2013;Flato et al. 2013). Virtually all of the warming since 1950 is attributed to human influences (Stocker et al. 2013;Jones et al. 2013Jones et al. , 2016Ribes et al. 2017). Yet due to the loosely constrained nature of magnitude and evolution of AER, there continues to be a fierce debate about the cause of multidecadal global mean surface temperature (GMST) fluctuations present in the instrumental record (Shiogama et al. 2006;Booth et al. 2012;; Thompson et al. 2015). Most prominently, the origin of the early twentieth-century warming (EW;1910-45) and the midtwentieth-century cooling (MC) periods, thought to be linked with North Atlantic (NA) ocean variability and commonly expressed in terms of Atlantic multidecadal variability (AMV) (Delworth and Mann 2000;Knight et al. 2005Knight et al. , 2006, is still hotly contested because of the difficulties of disentangling the contributions from internal and external drivers at different time scales (Brönnimann 2009;Mann et al. 2014;Zhang et al. 2016;Clement et al. 2016;Vecchi et al. 2017;Sutton et al. 2018;Hegerl et al. 2018).
Conventionally, the AMV has predominantly been attributed to internal ocean variability, which in turn has been linked to changes in the Atlantic meridional overturning circulation (AMOC) as a deep ocean driving mechanism on multidecadal time scales (Zhang and Wang 2013;Yeager and Robson 2017). While stochastic atmospheric flux forcing is thought to influence SSTs on shorter time scales (Roberts et al. 2013;Duchez et al. 2016;Josey et al. 2019), associated with changes in the North Atlantic Oscillation (NAO) index (Hurrell and Deser 2009), the prevailing view regarding NA SST changes on longer time scales is that large internal variations are superimposed on the anthropogenic warming trend. However, in recent years, external forcing has been shown to contribute to multidecadal swings in the AMV region (Otterå et al. 2010;Murphy et al. 2017;Bellucci et al. 2017), suggesting a reduced role for internal ocean dynamics. Changes in aerosol forcing (AER) (Booth et al. 2012;Bellomo et al. 2018) as well as periods of strong volcanic activity (Iwi et al. 2012;Knudsen et al. 2014;Pausata et al. 2015;Swingedouw et al. 2017) have been linked to these changes. Also, it has been demonstrated that an AMV-like SST pattern can be reproduced in slab-ocean experiments (Clement et al. , 2016Bellomo et al. 2018). Hence internally generated low-frequency GMST variations are increasingly thought to play only a smaller role, with Pacific Ocean variability to be more recognized as a pacemaker for global temperature (Schurer et al. 2015;Dong and McPhaden 2017).
While there is no debate about the existence of aerosol-related dimming and brightening (Wild et al. 2007;Wild 2009) due to a huge array of supporting data from observations (Boers et al. 2017;Dumitrescu et al. 2017;Manara et al. 2017) and modeling (Shindell et al. 2013;Wilcox et al. 2013;Rotstayn et al. 2015;Dallafior et al. 2016;Chung and Soden 2017), its impact on the AMV is less certain. Many studies do not (Huss et al. 2010;Chylek et al. 2014) or insufficiently (Ting et al. 2009;) incorporate or acknowledge AER, which potentially leads to misattribution of cause (Zhang et al. 2016;O'Reilly et al. 2016) and effect (Chylek et al. 2009;Wyatt et al. 2012;Tung and Zhou 2013;Pasini et al. 2017;Levine et al. 2018). Arguments for the presence of an internally generated AMV based on ostensible pseudo-oscillatory behavior in instrumental, proxy, or model data are unconvincing (Singh et al. 2018), and it is noted that such behavior can arise from statistical artifacts alone (Vincze and Jánosi 2011;Cane et al. 2017). Regression-based methods are thereby particularly susceptible to conflating internal variability with forced responses because of strong covariance between the predictors (Mann et al. 2014;Stolpe et al. 2017), yet studies that use the AMV as a regressor or explanatory factor continue to be published despite the lack of an unequivocal physical underpinning (Lewis and Curry 2018;Rypdal 2018;Shen et al. 2017;Zhang et al. 2018;Folland et al. 2018).
We argue that any attribution exercise that does not sufficiently account for the spatiotemporal AER changes will invariably produce unreliable and erroneous results. Incorporating now better quantifiable biases in the instrumental SST record, we demonstrate that a carefully designed analysis (that avoids overfitting) yields a surprisingly high level of agreement between our model and observations without the need to infer additional unexplained internal variability. We endeavor to highlight the pitfalls associated with attributing and identify the shortcomings in representing the externally forced temperature responses.
Since attempts to estimate the magnitude of internal variability by means simple climate models are plagued from dissatisfying low correlations with observations (Aldrin et al. 2012;Skeie et al. 2014), here we use a refined two-box impulse response model framework that accounts for fast and slow responses to forcing perturbations in the climate system. To constrain the complexity of the model, we introduce a novel transient climate response (TCR) adjustment factor for different forcing agents that is governed by robust physical factors. Apart from Northern Hemisphere (NHem) and GMST (Global) data, we also analyze Southern Hemisphere (SHem), land surface air temperature (Land) and SSTs (Ocean), expanding on previous GMST-only analyses (Mann et al. 2014;Dong and McPhaden 2017) to better understand the impact of radiative forcing changes on surface temperatures. We recommend all impulse response or energy balance model studies use land, ocean, and hemispheric temperature records with our dedicated set of model parameters as separate benchmark tests to robustly evaluate model performance.

Radiative forcing and observational data
We use the latest well-mixed greenhouse gas (WMGHG) radiative forcing (Etminan et al. 2016;Meinshausen et al. 2017) and the gridded aerosol community emission dataset (CEDS) (Hoesly et al. 2018), including sulfur dioxide (SO 2 ), ammonia (NH 3 ), black carbon (BC), and organic carbon (OC). For solar forcing, we use sunspot numbers from the Greenwich Royal Observatory (Wilson and Hathaway 2006), scaled to solar forcing according to Dewitte and Nevens (2016). Stratospheric aerosol optical depth (AOD) data from explosive volcanic eruptions (Crowley et al. 2008;Crowley and Unterman 2013) are scaled to match NASA-GISS volcanic forcing data (Sato et al. 1993), and updated to include recent smaller eruptions Solomon et al. 2011;Arfeuille et al. 2014;Schmidt et al. 2018). Figure 1a shows our revised forcing estimates.
The global direct radiative forcing for each aerosol component (SO 2 , NH 3 , BC, and OC) is derived by scaling the current emissions to the Fifth Assessment Report (AR5)-forcing estimate for 2011 (Myhre et al. 2013;Stocker et al. 2013). Using BC emissions over North America, we account for enhanced Arctic warming during the first half of the twentieth century (Johannessen et al. 2004;McConnell et al. 2007;McConnell and Edwards 2008;Suo et al. 2013) (orange shading in Fig. 1b). The indirect forcing of 20.45 W m 22 is mostly a function of SO 2 (90%; 10% for OC). While considerable uncertainty exists regarding aerosol-cloud effects (Carslaw et al. 2013;Regayre et al. 2014;Nazarenko et al. 2017;Lohmann 2017), the best estimate for indirect AER in AR5 has not been fundamentally challenged since. Together with the direct effects, we obtain a total AER of ;20.55 W m 22 in accordance with AR5 (Fig. 1b), which is set to 20.75 W m 22 pseudoeffective global aerosol radiative forcing (ERF) in our response model framework (Figs. 1b,c). The total ERF estimate is guided by a recent review by Forest (2018), which is slightly lower than the best estimate for ERF of 20.9 W m 22 published in AR5. We note that other recent research has also suggested that AER ERF might be lower (Stevens 2015;Myhre et al. 2017), essentially reflecting arguments for stronger BC warming effects (Bond et al. 2013;Myhre and Samset 2015) and less cooling due to noticeable SO 2 reductions in China since 2006 Klimont et al. 2013).
We use Berkeley Earth Land/Ocean (BE) (Rohde et al. 2012), HadCRUT4-Cowtan/Way (Cru4CW) (Cowtan and Way 2014;Cowtan et al. 2015), HadISST2 (Titchner and Rayner 2014;J. Kennedy et al. 2017, unpublished manuscript), and Operational Sea Surface Temperature and Sea Ice Analysis (OSTIA) (Donlon et al. 2012) data as observational data. We note that there are indications that the land datasets may still underestimate warming in northern areas (Wang et al. 2017;Way et al. 2017). Since Cru4CW uses HadSST3 data (Kennedy et al. 2011) over oceans, we developed an additional composite product with Cru4CW over land andHadISST2 (1850-1985; preliminary release available only until 2010) and OSTIA data (from 1986 to the present; calibration period 1986-2005) over ocean to reflect the full range of available SST products (hereinafter referred to as HadOST). To obtain land and ocean proxies, ocean points that are covered with sea ice are treated as land points. The sea ice extent to generate the ice mask is taken from HadISST2 and OSTIA. The same mask is applied to Cru4CW and BE.
Because of continuous problems in currently available SST datasets, mainly manifest as warm bias as a result of changing SST sampling methods (from bucket to engine-room intake measurements) during World War II, associated with changing fleet composition (Karl et al. 2015;Hansen et al. 2016;Kent et al. 2017), Cowtan et al. (2018) have recently proposed a novel method to address the World War II (WWII) bias using island and coastal weather stations only. Inspired by the idea, we replicate their analysis with a slightly simplified methodology. We use a mask where grid boxes over land (adjacent to ocean) and over ocean (along coastlines) are selected, including islands. The global average of all such subsampled ocean grid boxes establishes our new SST proxy. The two coastal time series are scaled to match the 1980-2016 global SST trend [see Cowtan et al. (2018) for details on the scaling method] as it is deemed the most reliable period in the marine instrumental record (Rahmstorf et al. 2017).
The results are shown in Fig. 1d for HadOST and BE and in Fig. 1e for ERSSTv4 and GHCNv3 as used in GISTEMP (Hansen et al. 2010). The scaling factors are provided in the figure legend. In both cases, the two coastal records (derived from HadISST2 and ERSSTv4) show excellent agreement during the calibration period. As expected, the land scaling factor is lower in agreement with amplified warming trends over land. The land and ocean proxies agree after 1920 and show only minor deviations before 1920 (Fig. 1d). The HadOST proxies (land and ocean) suggest that HadISST2 is reliable with marginal biases between 1880 and 1940. We find much less agreement between GHCNv3 and ERSSTv4 before 1980. Our analysis further suggests that ERSSTv4 has a substantial cold bias between 1900 and 1980 as well as a spurious warm bias during WWII (Fig. 1e). While by no means perfect, this straightforward analysis is at least indicative that SSTs in general and ERSST in particular (versions 4 and 5 are almost identical throughout the period of coverage) are still impacted by substantial unresolved inhomogeneities. In our main analysis we discard GISTEMP and apply the following correction factors to HadOST, Cru4CW, and BE during four of the WWII years (1942-45): NHem 5 20.048C, Global 5 20.088C, SHem 5 20.128C, and Ocean 5 20.188C. The remaining years in the time series remain unchanged. We discuss the implications in section 4.
Finally, we use historical climate simulations from phase 5 of the Coupled Model Intercomparison Project (CMIP5) (Taylor et al. 2012) and an ensemble of the Met Office HadCM3 model (Euro500) (Schurer et al. 2014) to estimate warming ratios and multidecadal internal variability.

Impulse response model and uncertainty
Following the method introduced in earlier work (Otto et al. 2015;Haustein et al. 2017) [vaguely similar to the analysis presented in Lean and Rind (2008) and Lean (2018)], we employ a two-box impulse response model framework that accounts for fast and slow temperature T changes in response to external forcing factors [components (comp): WMGHGs, anthropogenic aerosols (AER), and volcanic eruptions (VOL)]. The fast component can be associated with the ocean mixed layer response whereas the slow component approximates the response of the deep ocean (Li and Jarvis 2009): (1 2 e 270/d 2 ) , and (2) ECS 5 F 23CO 2 (q 1 1 q 2 ) .
More details can be found in Millar et al. (2017). The forcing due to doubling of CO 2 (F 23CO 2 ) is 3.71 W m 22 . The factor q j (integrated contribution for response j) can be determined using Eqs.
(2) and (3) with a defined set of values for TCR and equilibrium climate sensitivity (ECS). Our chosen TCR range encompasses values from 1.1 to 2.1 K, with an associated ECS range of 2.0-4.0 K, in line with IPCC AR5 estimates (Stocker et al. 2013). As TCR/ECS ratios derived from observational data are plagued by a variety of shortcomings (Armour 2017;Proistosescu and Huybers 2017;Marvel et al. 2018), we apply the CMIP5 mean of ;0.53 as our central TCR/ECS ratio estimate, supported, for example, by a reasonably good match of measured and simulated ocean heat uptake ). The associated adjustment factors for NHem, SHem, Land, and Ocean as well as for AER and VOL forcing are introduced below. The slow response time (d 2 ) is taken from Geoffroy et al. (2013b) (320 yr), which included deep ocean feedbacks in contrast to accompanying work (Geoffroy et al. 2013a). Given that the fast response time (d 1 ) of 4 years suggested in the same study (Geoffroy et al. 2013b) relies on estimates from GCM simulations, we follow the approach presented in Rypdal (2012) and double d 1 to 8 years, which is in line with coefficients presented in Boucher and Reddy (2008) based on idealized simulations undertaken with the HadCM3 model. It is argued that observed temperatures show a prolonged/delayed response due to mediating effects intrinsic to our climate system (Emile-Geay et al. 2008;Santer et al. 2014;McGregor et al. 2015), which may be less well represented in many GCMs (Le 2017). These estimates of d 1 and d 2 yield the highest correlation with observations.
As far as the response to AER is concerned, Shindell (2014) and Marvel et al. (2016) have highlighted the importance of different hemispheric treatment of the heterogeneous aerosol load. The conceptual idea is to have an enhanced TCR for AER due to its preponderance over land as a result of the skewed spatial distribution of aerosols. Differential heat capacities over land and ocean (and therefore implicitly the hemispheres) lead to considerably different response times over land and ocean, associated with inhomogeneous hemispheric warming rates that are mediated by crossequatorial energy transports (Loeb et al. 2016;Stephens et al. 2016) for all forcing agents. Having said that, aerosols are transported over vast distances (Uno et al. 2009;Schulz et al. 2012), affecting oceans directly (due to albedo effect) and indirectly as well (due to cloud effects, particularly over formerly pristine areas), despite very low direct emissions over oceans mainly from ship exhaust (Kunkel et al. 2013;Shindell et al. 2013). Therefore, ocean aerosol emissions are not a suitable proxy for the associated ocean temperature response. To remedy the problem, the interhemispheric exchange of aerosol-induced temperature responses has to be accounted for appropriately using coupling factors (introduced below).
The differential warming requires dedicated TCR calibration factors for the WMGHG, VOL, and AER induced temperature responses. To obtain a plausible and robust set of such calibration factors, we use observed transient warming ratios (TWR) between NHem and SHem as well as Land and Ocean. In Decadally averaged warming ratios are provided above or under each graph. All data are low-pass filtered with a smoothing radius of 5 years. The TWR is obtained during the 30-yr period of strongest transient warming.
Given that TWRs for WMGHG, AER, and VOL can only be inferred from GCMs, we apply a scaling factor that represents the difference between observed and allforcing TWRs. Assuming that the observed TWR (red shaded area in Figs. 2a and 2b) is our target ratio, the responses in HadCM3 are scaled accordingly. HadCM3 is used because it provides a small ensemble of simulations (mainly drawn from the Euro500 experiment) that  (g) and (h). All data are low-pass filtered to remove interannual variability. The boxes at the bottom show the inferred (diagnosed) warming ratios (TWRD) for WMGHG and AER using the product of the ratios of observed (red) and modeled (orange) allforcing TWRs in (a) and (b), multiplied by the modeled TWRs (orange) for WMGHG in (c) and (d) and AER in (e) and (f). The estimated warming ratios (TWRE) refer to the simulated response model TWR using TWRD. Both values are given in light purple. Only the 30-yr period of strongest differential warming is used for the central TWR estimates. VOL TWR is only a function of the fast response.

4898
is consistent across the experiments. The TWR in the historical HadCM3 ensemble is 1.7, compared to 2.8 in HadOST (Fig. 2a). Hence a scaling factor of ;1.6 is applied to the TWR deduced from the WMGHG and AER ensemble of the same model in order to correct for the underestimated TWR in the historical HadCM3 simulations. The resulting inferred TWR (hereinafter referred to as TWRD; D 5 diagnosed), which is then used in the response model, is provided in the boxes at the bottom of Fig. 2.
Since the bulk of the VOL response takes place on the fast time scale (1-10 yr) and thus differs from WMGHG-related responses (Ding et al. 2014), we refrain from scaling and use the TWR from HadCM3 directly [consistent with above-mentioned findings in Boucher and Reddy (2008) regarding HadCM3's fast response time]. Note that the VOL responses in Figs. 2g and 2h are shown for the full 1500-1999 period in contrast to the shorter 1850-1999 (1850-2017) period for all other scenarios.
In addition, since we do not know the resulting warming ratios in the response model a priori when we impose the inferred TWRD, we compare them with the posteriori TWRs (hereinafter referred to as TWRE; E 5 estimated) in order to validate our approach. We find that, for example, the TWRD for WMGHGs (TCR of 2.65 K over Land and 1.11 K over Ocean) of ;2.4 results in a TWRE of ;2.2 (see Fig. 2d). We therefore argue that our method is reasonably well constrained to provide a robust answer.
All TCR calibration factors based on the deduced TWRDs (and shown at the bottom of Fig. 2) are summarized in the upper box in Fig. 3. We would like to point out that these calibration factors modulate the TCR/ECS ratio and are used for the full range of TCR and ECS values, respectively, not only the best estimate. The latter is provided at the top of Fig. 3 as well, together with the TCR for AER effective forcing, which is ;40% higher (best estimate 5 2.2K) than that of WMGHGs (best estimate 5 1.6 K), consistent with findings in Rotstayn et al. (2015), to reflect the higher aerosols load over land.
To estimate the TCR calibration factor for AER, hemispheric and land-ocean coupling factors need to be determined. They reflect the above-mentioned fact that interhemispheric energy exchanges in response to the heterogeneous distribution of AER need to be balanced. Conveniently, the coupling factors are an emergent property and as such a function of the hemispheric area weighting factors, which are strictly interlinked and hence constrained as follows (example for WMGHGs): Note that the Land fraction is marginally .30% because areas covered with sea ice are treated as land throughout the analysis. Apart from the area-weighted constraint, the coupling factors are also dependent on the emission ratio, that is, the ratio between the hemispheric (and land/ocean) and the total global aerosol emission strength, which in turn determines the appropriate fractional contribution to match the inferred AER-TWRD (see appendix A for more details). The resulting coupling factors are 3.9 (ratio of 1.47 and 0.38, which corresponds to 85% NHem and 15% SHem AER contribution for NHem AER and vice versa for SHem AER) and 2.1 (ratio of 1.46 and 0.7, which corresponds to 70% Land and 30% Ocean AER contribution for Land AER and vice versa for Ocean AER). These factors are also provided in the bottom box of Fig. 3, together with all other parameters used in the response model. Global temperature trends for the 1978-2017 period in HadOST (Fig. 3a), CMIP5 (Fig. 3b), and HadCM3 ( Fig. 3c) are also shown. The spatial distribution of the trend highlights why observed and modeled TWR do not agree, which is primarily caused by delayed Southern Ocean warming (Armour et al. 2016), and partly by an accelerated Arctic amplification (Serreze and Barry 2011). Both physical processes are not satisfactorily reproduced in most GCMs. Last, as apparent from the discrepancy between the AER factor provided at the bottom of Fig. 2 (3.5 and 2.4 for NHem/SHem and Land/Ocean, respectively) and that shown in the top box of Fig. 3 (5.1 and 2.9 for NHem/SHem and Land/Ocean, respectively), we increased the inferred AER-TWRD slightly. While the adjustment of the AER-TWRD does not change our conclusions (see Figs. S1 and S2 in the online supplemental material for the same result without AER-TWRD tuning), it does lead to better agreement between HadOST and the response model during the period of strongest AER cooling between 1960 and 1980. Given that the HadCM3 AER ensemble is not a strict AER-only simulation rather than the difference between the allforcing and a nonaerosol ensemble of HadCM3, the results likely do not reflect the full extent of the aerosol-induced TWR. Therefore we think it is a defensible decision and well within the realm of the uncertainty of our AER-TWR estimate. The resulting AER time series is shown in Fig. 1c.
For the uncertainty analysis, response model, radiative forcing and internal variability uncertainty is considered. Apart from the TCR (1.1, . . . , 2.1 K) and ECS (2.0, . . . , 4.0 K) range, we also include a range of fast response times (3, . . . , 13 years) in our response model uncertainty estimate. For the forcing, 200 total radiative forcing realizations are used (Forster et al. 2013) and converted into response model temperature equivalents to estimate the associated error range. The resulting s (32nd-68th percentiles) of the fractional uncertainties is shown in Fig. 4a (response model error in green and radiative forcing error in blue). If we assume that potential internally generated, low-frequency variability adds linearly to the externally forced response, we need an estimate of FIG. 4. (a) Fractional variance (square of the model error) for impulse response model uncertainty (green), total radiative forcing uncertainty (blue), and internal variability uncertainty (gray). The 1s (32nd-68th percentiles) range is shown. We note that internal variability is no response model uncertainty in a strict sense as it is added post hoc (i.e., onto the calculated temperature). The peaks in the response model uncertainty coincide with volcanic eruptions (e.g., Tambora in 1816). (b) The internal variability from selected CMIP5 piControl runs is contrasted with the unforced residuals from the GMST datasets used in this study. Observed and modeled time series are lowpass filtered with a 30-yr smoothing radius. The standard error is provided in parentheses.
(modeled) unforced multidecadal variability. As introduced in Haustein et al. (2017), we use equidistant intervals of selected CMIP5 preindustrial control simulations that do not drift (Knutson et al. 2013) and possess a similar range of unforced variability as our response model-based estimate of the residual observational variability. In Fig. 4b, the low-pass filtered residuals for HadOST, Cru4CW, and BE between 1850 and 2017 and low-pass filtered sample intervals of 168 years from selected CMIP5 models are shown together with their standard deviation s. The obtained 5th-95th percentiles of their internal variability span 60.178C (s 5 0:18C and s 2 5 0:018C 2 ) as shown in Fig. 4a in gray.
We note that there is additional parameter uncertainty, which is not fully included here as it is difficult to objectively constrain the upper and lower bounds of the respective parameters. To rectify this problem, in Fig. S3 we have plotted the response model results for a set of reasonable model parameters, including aerosol sensitivity, TWRs, coupling strength, TCR efficiency, varying SOL and VOL forcing, and high and low AER ERF (dashed line for 20.5 and 21.0 W m 22 ). The resulting uncertainty is small compared to the total uncertainty, which is dominated by the forcing uncertainty. Hence we conclude that our results are insensitive to the parameter choices, even if our observationally constrained estimates were biased.

Model performance and evolution
In Fig. 5, the response model results for Land ( Fig. 5a; brown), NHem ( Fig. 5b; red), Global ( Fig. 5c; green), Ocean ( Fig. 5d; purple) and SHem ( Fig. 5e; blue) are shown. The central allforcing temperature response estimate is shown as the bold line in the lower graph in each panel, while thin lines indicate slightly higher/ lower alternative TCR estimates as indicated on the right-hand side (1.2, . . . , 2.0 K). The 5th-95th percentiles and the interquantile (25th-75th) uncertainties are added as shaded gray contours. The low-pass filtered (30-yr smoothing radius) instrumental data from HadOST (dark green), Cru4CW (yellow), and BE (black) are shown for comparison, including the WWII correction introduced in section 2. Before filtering, the influence of ENSO (Deser et al. 2012) is removed from the observational time series in order to minimize short-term noise (Stuecker et al. 2015), following the multiple regression approach of Foster and Rahmstorf (2011).
Conversely, in the upper graphs in Figs. 5a-e we have added ENSO variability to the response model results by scaling the multivariate ENSO index (MEI) (Wolter and Timlin 1998) for each domain and applying the lag coefficient obtained from the multiple regression. Other than the WWII bias correction, the observational data in the upper graphs show the annual mean temperatures. On the top left in each panel, the explained variance R 2 for non-ENSO-corrected, model-adjusted (MEI), and observation-adjusted correlations between model and the observational datasets are shown. The correlations are based on the low-pass filtered time series (30-yr smoothing radius). To avoid problems due to autocorrelation, the associated nonfiltered R 2 between Global HadOST and model-adjusted (MEI) time series is 0.935 (not shown; 0.92 for Cru4CW and 0.912 for BE). We would like to highlight that our R 2 for HadOST exceeds the explained variance found in Rypdal (2018) and Folland et al. (2018), without the need to invoke any contribution of the contentious AMV.
We find excellent agreement between our response model and observations in all three time series. NHem and Land are well reproduced over the entire duration of the instrumental period, including the EW and the MC periods (Figs. 5a,b). SHem and Ocean are similarly well reproduced, with notable deviations before and after WWII when compared with Cru4CW or BE (Figs. 5d,e). Using HadOST, the SHem and Ocean model results can be almost entirely reconciled with observations (Fig. 5d). HadOST and Ocean only start to diverge before 1900. But overall, the Global results (Fig. 5c) leave little room (on the order of ;0.18C) for unforced low-frequency temperature variations.
Before we investigate other notable excursions in light of the role of unforced Pacific and Atlantic Ocean variability, in Fig. 6 the evolution of the response model for all five domains is shown. The top graph in each panel shows the response model result using WMGHG and aerosol forcing based on IPCC AR5 (Meinshausen et al. 2011), extrapolated to 2017, and volcanic forcing from NASA-GISS (Sato et al. 1993) updated to 2017 and a fast response time of 4 years. As such, it corresponds to the results published in Haustein et al. (2017). The middle graph in each panel is using our slightly modified VOL and solar forcing and a fast response time of 8 years. All that is otherwise different compared to our final response model result as shown in the lower graph in each panel is AER. The results based on the new CEDS AER show significant improvements in each domain, resolving most of the discrepancies associated with the EW and MC period. As far as EW is concerned, the improved response model performance is partly linked with the SST bias correction during WWII, which is only applied in the lower graph in Fig. 6. Accordingly, the warming spike particularly over Ocean (Fig. 6d) and SHem (Fig. 6e) disappears, leading to a visibly better agreement between model and observations. With the current AER lowered by .10% (Fig. 1a), here we briefly explore the implications for TCR, including a cautionary remark regarding the lack of robustness when estimating ECS. In Fig. 5, the TCR range from 1.2 to 2.0 K is indicated with our best estimate using a TCR of 1.6 K (bold lines). Based on linear regression between HadOST and the Global response model result, our most precise TCR estimate is 1.57 K with an associated interdecile uncertainty range of 0.87-2.27 K (10th-90th percentiles). This is in good agreement with other recent work (Richardson et al. 2016), despite the lower AER estimate. While others have suggested that TCR might be time-dependent (Gregory et al. 2015), our results do not provide evidence for a change over the instrumental period.
With the TCR/ECS-ratio held constant, ECS is tied to TCR by construction in our analysis (3.0 K with an associated interdecile uncertainty range of 1.7-4.3 K). Nonetheless, it is instructive to investigate the impact of different ECS values upon the model results when the TCR/ECS ratio is permitted to vary. As shown in the lower graphs of Fig. 6 where we have added the response model result for the low-end (2.0 K) and high-end (4.0 K) ECS range, neither of the two estimates provides sufficient guidance as to which ECS value is more likely to be correct. The small range of possible outcomes severely hampers a robust ECS estimation. We therefore agree with others (Armour 2017;Proistosescu and Huybers 2017;Marvel et al. 2018) who found that ECS cannot reliably be inferred from historical observations alone, and recommend caution as ECS is easily conflated with the effective climate sensitivity, which is likely to be lower (Knutti et

Role of unforced Pacific Ocean variability
Returning to Fig. 5, here we assess a few noteworthy remaining excursions that are arguably related to unforced internal variability. To facilitate quantifying those excursions, in Fig. 7 the residuals between the HadOST and response model temperature time series are plotted for the five domains. In Fig. 7b, the low-pass filtered MEI evolution is provided (black line). Cru4CW and BE are shown in Fig. 7e for completeness.
We note that MEI shows signs of multidecadal variability, which is linked to the Pacific decadal variability ) is a matter of intense debate and beyond the scope of this paper. However, the residuals as well as the explained variabilities provided in Fig. 5 suggest that low-frequency ENSO variability has little bearing on the outcome of our response model results. Merely the timing of the modern warming is slightly better aligned with observations when MEI rather than Niño-3.4 (Trenberth 1997) is used (not shown), which is indicative of a minor role for additional decadal PDV impacts indeed.
Modeled Land (Fig. 7d) shows only a few peaks that are not explained by ENSO (e.g., 1884ENSO (e.g., , 1913ENSO (e.g., , 1939ENSO (e.g., , 1949ENSO (e.g., , 1980ENSO (e.g., , 1991ENSO (e.g., , 2010. Such excursions should be expected given the large standard deviation over land due to the stochastic nature of continental interannual variability (Mahlstein et al. 2012). There are a few years between 1950 and 1960 that appear to be cooler than the response model suggests, but since no such deviation shows up over Ocean (see Fig. 7f), it might be related to European aerosol emissions (Persad and Caldeira 2018).
The positive residual after 2000 (also visible in the NHem residual in Fig. 7a) is perhaps more interesting as it relates to the infamously dubbed ''hiatus'' period in the wake of the strong El Niño in 1997/98. While primarily caused by a clustering of La Niña events around 2010 (Kosaka and Xie 2013;England et al. 2014;Schurer et al. 2015;Dong and McPhaden 2017), upon closer inspection another feature stands out. There has been a succession of anomalously cold years between 2010 and 2013, exclusively linked with boreal winter. More precisely, this period is linked with extremely cold Eurasian winters (Cohen et al. 2012), which may or may not have been assisted by forced atmospheric circulation changes in response to declining sea ice (Tang et al. 2013;Cohen et al. 2014;Overland 2016;Francis 2017;Hay et al. 2018). But other than that, SHem (Fig. 7c) and Ocean (Fig. 7f) residuals are inconspicuously smooth and only diverge before 1900 as outlined above already. Overall, our results support previous work that has shown that using updated external radiative forcing (Huber and Knutti 2014;Schmidt et al. 2014) and accounting for ENSO-related variability explains the so-called hiatus. We refer to Medhaug et al. (2017) for a comprehensive review of the unprecedented flurry of publications on the subject. That said, despite being less sensitive to small changes near the endpoints compared to higher degree polynomial fits, we caution that the lowess smoother is still susceptible to overestimating trend changes at the beginning and end of the time series.
With explained variabilities ;98% for HadOST for the Land (Fig. 5a), NHem (Fig. 5b), and Global (Fig. 5c) response model results, we conclude that almost all lowfrequency variability is explained by external forcing factors independent of ENSO. The Ocean (Fig. 5d) and SHem (Fig. 5e) results reveal similar explanatory skill with explained variabilities between 93% and 95%. Interestingly, BE shows lower correlation factors than Cru4CW over Ocean (even more so over SHem), despite their common use of HadSST3. Thus, differences in data processing alone can explain much of the discrepancies. The fact that HadOST not only fares considerably better in terms of correlation but also performs best regarding the coastal proxy analysis (Fig. 1d) justifies its inclusion in our analysis. However, more work needs to be done to reconcile the differences between the available SST products and to reduce associated biases (Davis et al. 2019). In appendix B, we briefly analyze the spatiotemporal characteristics of those products with regard to decadal means.

Role of unforced Atlantic Ocean variability
While the accurate reproduction of the EW and MC period in our response model framework does not require multidecadal temperature variability to attribute to the ostensible AMV, we do not dispute the existence of internal variability associated with AMOC variations. Therefore here we aim at quantifying the AMOC's role in setting NHem temperatures and its relation to the AMV. To facilitate the assessment, we would like to propose a more adequate, straightforward, and intuitive definition of the AMV index itself. Rather than using the standard definition (Delworth and Mann 2000) or an improved definition thereof (van Oldenborgh et al. 2009), we define the AMV as average SST at 408-608N and 158-508W (red box in Fig. 8d) minus NHem ocean temperature. The resulting revised time series is shown in Fig. 7a (bold black line).
The revised AMV index [which we more appropriately propose to be named the North Atlantic Variability Index (NAVI)] is essentially reflecting and reliably mirroring the long-term AMOC decline in response to anthropogenic warming. The unprecedented dip around 2015 is associated with the continued advection of very cold air of Arctic origin over the Canadian archipelago region during the winters of 2014/15 and 2015/16. Atmospheric forcing has been recognized to drive short-term AMOC variability (Roberts et al. 2013;Duchez et al. 2016) as opposed to gradual changes in sea ice cover (Sévellec et al. 2017), temperature, salinity, and pressure gradients that eventually cause the slower long-term AMOC changes that are indeed already detectable (Rahmstorf et al. 2015) and concomitant with the well-known Atlantic warming hole (AWH) (Menary and Wood 2018). Since asymmetric land-ocean warming cools the NA region relative to NHem (physically consistent with a transient warming scenario), we use NHem SSTs. The resulting slow pace of the NAVI decline suggests a contributing role for AMOC.
To qualitatively explore the role of longer-term effects associated with low-frequency modes of variability, we have conducted a simple correlation analysis. In Fig. 8, we have plotted the spatial map of correlation coefficients between Global and NHem time series obtained from the response model versus global observations (HadOST). The correlation between the older, slightly more advanced AMV index (van Oldenborgh et al. 2009) and HadOST is provided as well (Figs. 8d,g,k). We notice that the AWH in the subpolar NA region appears uncorrelated with the forcing time series (Figs. 8a,b,e), regardless of whether we use the Global or NHem time series. Another noteworthy feature is the accompanying anticorrelation between the AMV index and most world regions. The AMV/NAVI region is highlighted with a red box. (e) As in (c), but with a 5-yr running means applied to both NHem and HadOST. (f) Combination of (c) and (e) where both regressors are detrended and low-pass filtered with a 5-yr running mean. (g) As in (d), but with both AMV and HadOST being detrended. (h) As in (e), but with a 10-yr running mean. (j) As in (f), but with a 10-yr running mean. (k) As in (d), but with both AMV and HadOST being detrended and low-pass filtered with a 20-yr running mean. (m) As in (e), but with a 20-yr running mean. (n) As in (f), but with a 20-yr running mean. The SHem area is shown in semitransparent colors to highlight the NHem region of interest.
Since we are not aware of a robust mechanism that would cause multidecadal AWH variability as opposed to a steady decline, in the following we test three potential reasons for why the AWH region may or may not follow externally forced changes: 1) A longterm warming trend difference, 2) a different spectrum of high-frequency SST variability, or 3) true internal low-frequency variability. To investigate whether the third possibility is a viable explanation, we applied running means from 5 to 20 years (Figs. 8e,h,m; middle panel), we linearly detrended model and observations (Fig. 8c), or we did both (Figs. 8f,j,n; right panel).
What we find is that the AWH is robust against temporal averaging as far as nondetrended data are concerned. In contrast, if detrended data are used, the temporal averaging aligns the NAVI/AWH region with the NHem forcing response in terms of correlation, maintaining its (forced) multidecadal low-frequency variability. In fact, detrending alone considerably reduces the unique behavior of the AWH region already (Fig. 8c). What we infer from this is that the secular warming trend in point 1 above is responsible for the specific characteristic of the AMV region. The root cause of this cooling trend is well known and one of the key features in GCM projections (Rahmstorf et al. 2015;Menary and Wood 2018). The high-frequency variability over the wider NA region is higher than on global average, but comparable in magnitude to the western North Pacific (equally high supply of baroclinicity) or Eurasia (Fig. 8b).
After a decade, not much multiannual stochastic variability is left (Fig. 8e). Together with the Indian Ocean, the wider NA region shows high correlations with the NHem model after trend removal in both (Fig. 8f), suggesting substantial dependencies on externally forced low-frequency variability. It is a different story over land (and much of the Pacific), where the signal-to-noise ratio is lower on decadal scales due to limited radiative constraints on winter temperatures (Cohen et al. 2012;Knutson et al. 2013;Deser et al. 2017). The positive correlation between the 20-yr lowpass filtered, detrended AMV and the Arctic (Fig. 8k) is physically very plausible as amplified Arctic warming relies on heat transport via the NA region, governed by the NAO index and the associated strength of the AMOC. However, it is the forced long-term warming trend that is the driver as evident from Figs. 8m and 8n.
Since no noticeable low-frequency signal can be detected over the key AWH region (Fig. 8n), we conclude that it is unlikely that internal variability on time scales . 5 years plays an important role in the North Atlantic. There is room for 1-5-yr unforced feedbacks, but apart from the cooling due to the long-term decline in AMOC strength (Fig. 8m), the high-and low-frequency AMV patterns appear to be externally forced according to our response model results. This is in line with an empirical model study that uses multiple regression to attribute forcing contributions globally (Suckling et al. 2017), and also supported by other studies that show that subpolar NA variability is largely driven by AMOC changes, with little evidence for a strong AMV-AMOC link (Marini and Frankignoul 2014;Frankignoul et al. 2017).
In conclusion, combined with the recent downward trend in the new NAVI index, our analysis strongly suggests that the impact of internally generated NA ocean dynamics on Global, NHem, and Land temperatures is rather limited. The remaining AMOC related to low-frequency variability  may have regional implications, but a strong influence beyond that is unlikely. The results are supported by another simple exercise in which NA SSTs are weighted by the surface area of the AMV/NAVI region, divided by the NHem surface area. This way, the fractional fingerprint of the AMV on NHem temperatures can be inferred. The peak contribution would be ,0.038C, assuming all NA SST variability is of internal origin, which we have shown not to be a very plausible conjecture. Helped by a more advanced (yet still debatable) regression analysis, we note that Folland et al. (2018) also found almost no AMV contribution to global temperature.

Conclusions
With explained variabilities of observed global temperatures of up to 98% (30-yr smooth) or ;93% (with ENSO variability), respectively, our impulse response model performs exceptionally well. We are able to match the historical temperature evolution since at least 1850 in general, and succeed in reproducing both the EW and the MC period with high precision in particular, without the need to invoke unexplained internal multidecadal temperature variability as an additional driver.
Three key aspects are crucial for an appropriate attribution of the temperature response to external radiative forcing perturbations: 1) careful treatment of the spatially heterogeneous AER forcing as its temporal evolution has major repercussions for both the EW and the MC period; 2) removal of the WWII warm bias in the current generation of SST datasets as there is now solid evidence that 1942-45 period is biased warm to differing degrees, causing a spurious warming trend at the end of the EW period; and 3) calibration of the fast response time in order to account for the mediating effects of ENSO as far as the response to volcanic eruptions is concerned.
While others (Mann et al. 2014;Folland et al. 2018) have found similarly good agreement as far as the GMST evolution is concerned, our analysis demonstrates that it is possible to reproduce the temperature evolution separately for NHem, SHem, Land, and Ocean with equal precision. We achieve this by introducing a set of suitable TCR calibration factors that are informed by observed (HadOST) and modeled (HadCM3) TWRs and traceable throughout the analysis. Apart from minor finetuning related to the deduced TWR for AER, every response model parameter used in our study is backed up by independent analysis and/or based on well-established research. The use of updated aerosol emission and volcanic forcing data as well as the application of a longer fast response time (complemented by a hemispherically more uniform fast VOL response) are otherwise the only changes that we made compared to previous iterations within the response model framework. Owing to the introduced analytical constraints, which are designed to avoid model tuning, our results warrant robustness against overfitting.
With the introduction of HadOST, which includes a coastal temperature analysis inspired by Cowtan et al. (2018) that appears least biased with regard to the incorporated HadISST2 and OSTIA SSTs, we add another option to the existing batch of GMST datasets. We recommend using it more widely as it resolves some of the discrepancies present in HadSST3 before 1940. Despite a smaller warm bias during WWII in HadISST2 compared to HadSST3, we still have to impose a correction factor (20.088C for GMST) to reconcile it with the coastal hybrid temperature time series. As a result, almost all of the EW warming could ultimately explained by external forcing changes, which-if confirmed by future research-may call the current partition of attributable EW causes, as recently reviewed in Hegerl et al. (2018), in considerable doubt.
In our assessment of potential contributions from Atlantic and Pacific multidecadal variability, we demonstrate that with the exception of prolonged periods of El Niño or La Niña preponderance, there is little room for internal unforced ocean variability beyond subdecadal time scales, which is particularly true for the NA region. This finding is buttressed by our demonstration that despite high covariability, cause (VOL and AER) and effect (AMV) are clearly distinguishable. That does not mean AMV cannot have internal mechanisms , rather only that the signal cannot be detected in Global or NHem (nor is necessary to explain their temporal evolution). Hence the traditional AMV index must not be used as predictor or explanatory variable, as it may lead to demonstrably incorrect or flawed attribution results (Hetzinger et al. 2008;Chylek et al. 2009;Huss et al. 2010;Wyatt et al. 2012;Tung and Zhou 2013;Chylek et al. 2014;Pasini et al. 2017;Hodgkins and Wilson 2017;Yan et al. 2017;Shen et al. 2017;Zhang et al. 2018). We suggest a revised AMV index formulation (NAVI) which avoids such pitfalls as it better mirrors the longterm AMOC decline as suggested, for example, in Rahmstorf et al. (2015).
On that note, we also caution against confusing atmospherically driven short-term variability (noise) with changes due to anthropogenic or natural external forcing factors (signal). As demonstrated in the supplementary analysis, anomalous atmospheric NHem winter circulation features explain most of the short-term AMOC variability, acting as the control knob on multimonthly time scales. Longer time scales are conceivable, such as 1) via changing wind stress related to anomalous NAO phasing, which in turn affects the subpolar horizontal gyre circulation (Piecuch et al. 2017); 2) via atmospheric teleconnections associated with ENSO such as the PNA-NAO relationship (Pinto et al. 2011), which in turn links to the emerging paradigm of the Pacific basin as pacemaker for global temperature (Guan and Nigam 2009;Kosaka and Xie 2013;England et al. 2014;Schurer et al. 2015;Dong and McPhaden 2017;Frankignoul et al. 2017); and 3) via ocean memory effects, which may favor the reoccurrence of certain large-scale weather patterns in the Euro-Atlantic region during successive boreal winter seasons via air-sea coupling (Scaife et al. 2014). But generally, progress in understanding Atlantic decadal climate variability has been slow (Yeager and Robson 2017). Taken together, our analysis underscores that despite the complexities of the climate system, changes to the mean state are dominated by radiative forcings on longer time scales and ENSO-related variability on shorter time scales.
By virtue of these findings, we are confident that our associated best TCR estimate of 1.57 (60.70) K is robust, despite a substantial error range due to the large forcing uncertainty. We strongly advise against the use of ECS estimates based on the instrumental record alone without considering further evidence (from paleoarchives or GCMs), as they cannot be reliably constrained with data of such a short time interval.
In a future analysis, we aim to quantify another important response model feature that also contributes to an improved representation of the EW period. In a nutshell, it can be demonstrated that failure to initialize the response model (or GCMs for that matter) before a series of strong volcanic eruptions will very likely bias the beginning of the simulated EW period warm, leading to an artificially low warming trend in models.
Acknowledgments. The authors thank the U.K. Met Office Hadley Centre (HadCRUT4, HadSST3, HadISST2), the National Centre for Ocean Forecasting (OSTIA), NASA/GISS (GISTEMP), and Berkeley Earth (BE) for providing the observed temperature and SST data. We also thank the National Oceanic and Atmospheric Administration (NOAA) for providing the surface wind vector, 850-hPa temperature, and Optimum Interpolation (OI) SST data, as well as the NOAA Earth System Research Laboratory (ESRL) for providing the Twentieth Century reanalysis data. Furthermore, we thank Piers Forster for providing the updated radiative forcing data, including the forcing uncertainty ensemble, and the Greenwich Royal Observatory for providing the sunspot number data. Finally, we would like to acknowledge the input from M. R. Allen regarding the original conceptual idea in context of the Global Warming Index, as well as the input from B. B. B. Booth, who provided valuable comments during the early stages of paper writing. This work was supported and partly funded by the UK National Environmental Research Council's DOCILE project under the grant assignment NE/P002099/1 (2019-1552).

Calculation of Coupling Factors
As outlined in section 3, interhemispheric energy exchanges in response to the heterogeneous distribution of AER need to be balanced by virtue of so-called coupling factors. As shown in Fig. 3, for AER there is a notable discrepancy between the TCR scaling factors (2.5/1.9 for NHem/SHem and 2.7/1.95 for Land/Ocean) and the diagnosed AER-TWR (5.1 for NHem/SHem and 2.9 for Land/Ocean). A discrepancy that does not appear for WMGHG and VOL (for instance: TWRD WMGHG 5 2.23/0.97 5 2.3). This is where the coupling factor comes into play. Since the TCR scaling only works under the assumption that the NHem aerosol response is governed by NHem aerosol emissions (and vice versa for SHem; the same problem applies for Land and Ocean), we have to find a way to accommodate the additional temperature response from SHem emissions in case of the NHem response.
We have applied two-stage methodology: 1) We derive the emission ratio, that is, the ratio between the hemispheric (and land/ocean) and the total global aerosol emission strength. It is a function of the fractional contribution of each hemisphere (or land/ocean) and could vary between 95% and 5%, up until 50%-50%. 2) We determine the optimal fractional contribution or coupling strength. For this to work, we balance the ratio of the TCR scaling factors (e.g., 2.5 for NHem and 1.9 for SHem) and the TWRD (e.g., 5.1 for NHem/ SHem). Since the effective forcing of the SHem emissions will be lower (due to the lower TCR scaling factor), a secondary scaling factor has to be applied to NHem and SHem emission strength. This factor is only be equal for one particular set of coupling strengths. For NHem/SHem, the fractional contributions are 85% and 15%, associated with a NHem emission ratio (fractional NHem emission divided by Global emissions) of 1.58 and a SHem emission ratio (fractional SHem emission divided by Global emissions) of 0.42. The additional secondary scaling factor is 0.93. Hence the NHem/SHem emission ratio is reduced to 1.47 and 0.38. Their ratio defines the coupling factor, which is 3.9 accordingly (51.47/0.38).
Applying the same method for Land/Ocean, the fractional contributions are 70% and 30%, associated with a Land emission ratio (fractional Land emission divided by Global emissions) of 1.35 and an Ocean emission ratio (fractional Ocean emission divided by Global emissions) of 0.65. The secondary scaling factor is 1.08, which is explained by the fact that the Ocean sensitivity is lower, but instead covers a much larger area fraction compared to Land (area fractions are equal in case of NHem/SHem). Hence the Land/Ocean emission is increased to 1.46 and 0.70. The associated ratio to determine the coupling factor is 2.1 (see lower box in Fig. 3 and fine print below it).

Decadal Temperature Evolution
As highlighted in the main text, SST observations are still afflicted with considerable uncertainties. Having investigated time series of field means, here we provide the spatiotemporal context and discuss potential causes for some of the discrepancies noted above. In Fig. B1, the GMST data used in this study are plotted as decadal averages from 1850-59 to 2010-17 (BE, Cru4CW, HadOST), accompanied by GISTEMP from 1880-89 to 2010-17. In addition, in the Twentieth Century Reanalysis (20C Rean) (Hirahara et al. 2014), the ensemble mean of a subset of CMIP5 simulations is plotted, together with NorESM1-M , which is found to represent the temperature evolution since 1850 very well compared to observations. We also added the recently proposed hybrid SSTs (Cowtan et al. 2018). The decade 1940-49 is highlighted by the red box. Note that HadOST and Cru4CW use the same infilled HadCRUT4 (Morice et al. 2012) data over land. We also note that both Cru4CW and BE have shown to carry a small negative trend bias in recent years (Hausfather FIG. B1. Decadal GMST anomalies for the Twentieth Century Reanalysis, all observational data used in this study including the new Hybrid SST dataset (Cowtan et al. 2018  If that is not enough, it has also been suggested that the Arctic region might still be biased cold (Wang et al. 2017;Way et al. 2017). As noted above, the 1880-1935 period is too cold in HadSST3 (Cru4CW) compared to HadISST2 (HadOST), most pronounced during 1890-1920. Looking at those three decades, at least the first two show a noteworthy feature near Cape Cauldron off the southern tip of South Africa, presumably associated with the Agulhas and Brazil Currents. The otherwise distinct cold SST anomalies in the turbulent exit region where the Agulhas Current leaks into the South Atlantic Ocean (cf. HadOST) turns into a vast area of cold SST anomalies that essentially covers most of the South Atlantic. Given the poor observational coverage and the intrinsic shortcomings of any infilling technique (kriging in case of Cru4CW), it is likely that the cold South Atlantic SST anomalies in Cru4CW, BE, and GISTEMP are exaggerated to varying degrees.
Since the only bias in HadOST with regard to our response model results was found during the 1850-79 period, mainly caused by warmer NH SST conditions, it is interesting to ask whether the warm SST anomaly in the North Pacific in HadOST is real given that it does not show up in other observational datasets. While such a pattern is consistent with a prolonged PDV negative phase, the amplitude of the anomaly appears very strong, especially during the 1860s. The pattern reoccurs during the 1950-80 period, but background SSTs are less cold than during the 1850-79 period. This is arguably a feature that deserves to be investigated in more detail, particularly in light of recent work by Huang et al. (2018).
Regarding the WWII period, even though we have plotted decadal averages, what stands out is the sudden warming of all ocean basins in ERSST during the 1940s (and to a much lesser extent HadSST3 and HadOST). As evident from Fig. 5a, Land did not notably warm during the same period, which strongly suggests an artifact related to biases due to the previously explained change in fleet composition during WWII. The final feature we would like to mention concerns the cold bias during 1950-80 in ERSST. While the general NHem ocean cooling due to increased anthropogenic SO 2 emissions is visible in all observations (and CMIP5 simulations, irrespective of some temporal misalignments), ERSST seems to exaggerate the cooling slightly given that the spatial pattern of the SST anomalies are indistinguishable from other datasets. We speculate that this might be a general theme in ERSST given that it draws heavily from maritime nighttime measurements in contrast to other products. We note that ERSSTv5 (Huang et al. 2017) has not changed notably compared to ERSSTv4.
As we cannot provide robust conclusions with regard to causes for the mismatch between different observational dataset at this point, we would like to close by encouraging the research community to address the oftentimes underappreciated problems in more depth. Insights from energy balance models such as presented in this study can guide such efforts. With clever new strategies to combine the various data available, we are confident that the remaining gaps in our understanding will be eliminated and instrumental data brought into better agreement.