A fluctuation in surface temperature in historical context: reassessment and retrospective on the evidence

This work reviews the literature on an alleged global warming ‘pause’ in global mean surface temperature (GMST) to determine how it has been defined, what time intervals are used to characterise it, what data are used to measure it, and what methods used to assess it. We test for ‘pauses’, both in the normally understood meaning of the term to mean no warming trend, as well as for a ‘pause’ defined as a substantially slower trend in GMST. The tests are carried out with the historical versions of GMST that existed for each pause-interval tested, and with current versions of each of the GMST datasets. The tests are conducted following the common (but questionable) practice of breaking the linear fit at the start of the trend interval (‘broken’ trends), and also with trends that are continuous with the data bordering the trend interval. We also compare results when appropriate allowance is made for the selection bias problem. The results show that there is little or no statistical evidence for a lack of trend or slower trend in GMST using either the historical data or the current data. The perception that there was a ‘pause’ in GMST was bolstered by earlier biases in the data in combination with incomplete statistical testing.


Introduction
The Earth's climate varies on a vast range of temporal scales (National Research Council 1982, 1995. The persistent increase in greenhouse gases since the industrial revolution is imposing climate changes on timescales from decadal to centennial, and ultimately much longer too as the oceans and cryosphere respond to the changes in Earth's energy balance (Hansen et al 1985, Houghton et al 2001. The detection and attribution of greenhouse climate change (Mitchell et al 2001) deals with the identification of the 'signal' of the forced response to greenhouse gases from the 'noise' of variability of climate that occurs on the same decadal and multidecadal timescales. The greenhouse climate signal is always accompanied to some degree by 'noise' (variation) from other forcings of the climate system (such as due to changes in aerosol loading or solar variations) (Marotzke and Forster 2015) and by internal variations intrinsic to the coupled climate system (O'Kane et al 2013).
In recent years there have been more than two hundred articles in the climate literature discussing the notion of a 'pause' or 'hiatus' in greenhouse warming that is variously alleged to have taken place some time in the past couple of decades (Lewandowsky et al 2016). The form of alleged climate 'pause' varies across the literature, but essentially involves calculation of a short-term trend in global mean surface temperature (GMST) over a decade or two, which is then compared with either other periods in observed GMST (Stocker et al 2013), or with trends estimated from coupled climate model projections (Fyfe et al 2013, Risbey et al 2014. This review addresses the former issue (comparison of observed trends), while a companion review (Lewandowsky et al 2018) addresses the comparison with climate model expectations of trends.
When it first emerged, the concept of a global warming 'pause' was mostly cast in terms of the observational record as a period of slower than average warming (e.g. Stocker et al 2013). With time, usage broadened to include a comparison of observed warming rates with those inferred from model projections. The observations-based view of the 'pause' is perhaps more intuitively accessible, whereas the model-comparison view of the 'pause' allows for more complexity in matching variations in the forcing and the (model simulated) response to that forcing with observed trends. Neither definition turns out to be straight-forward in practice. We concern ourselves exclusively here with the first (observations-based) view of the 'pause'. As such, we do not consider the role of climate forcing, and we do not conduct any analysis directed at a causal understanding of fluctuations in GMST (which would require the use of climate models). We do not discount the worth of the model-comparison view of the 'pause', but the issues are complex enough that they require separate examination (Lewandowsky et al 2018). While the observations-based view of the 'pause' is intuitively appealing in that one can ostensibly 'see' a slowing of warming rate in (parts of) the GMST record, mere description is not the same as statistical evidence. The complexity here lies in the choices of data, periods, and tests employed to quantify whether any part of the record is indeed unusual. This review attempts to foreground some of those choices and their consequences.
The notion that global warming 'paused' is now entrenched in the journal literature (Stocker et al 2013). The 'pause' in warming is generally posited in this literature as an anomaly about climate that is inconsistent with rising greenhouse gases. Many pause-papers commence with the statement that despite rising levels of greenhouse gases, GMST has not increased since about 1998 (although the supposed start year varies) (Guemas et al 2013, Kosaka and Xie 2013, England et al 2014, Santer et al 2014. This alleged prima facie inconsistency is employed as one of the prime motivations in papers on the pause (Lewandowsky et al 2016). This review assesses the evidence for the 'pause' in the observed GMST record, as it is now, and as it was at the time the research was undertaken.
The review provides initial context by describing temperature fluctuations in the climatological literature and some issues in constructing observed series of GMST. The consequences of the uncertainties in GMST are described for assessment of short-term GMST trends. The review then proceeds by providing a series of retrospective constructions of short-term GMST trends on the basis of what was known about uncertainty in each of the major GMST series at different points in time compared to what is known now about uncertainty in each of these series. This retrospective analysis provides a framework in which to assess what was known (or could have been known at the time) when assessing the evidence for a 'pause' in global warming. The retrospective (historical) assessment of trends uses the versions of the GMST data that existed at the times when researchers carried out their assessments of trend-intervals.
Because the literature on the 'pause' is now so vast, the review treats the literature primarily as a database for statistical assessments. The sets of definitions implied for the 'pause' can be inferred from the pauseliterature, which provides the range of intervals against which to assess potential pauses. We have attempted to summarise some of the key messages and the approach to statistical methods in this literature, but do not provide a chronological assessment of individual contributions in the sense more common in reviews. Our concern is with the definitions, data, and methods used and their implications for the conclusions drawn.

Climate fluctuations past and present
The field of climatology has long recognized that climate varies on decadal and longer time scales. The concept of a 'climate normal' was introduced in the early 20th century as a 30 year record or average of climate (Arguez and Vose 2011). The 30 year period was considered necessary to smooth out at least some of the known large decadal-scale variations in climate. The various GMST datasets have used a 30 year 'climate normal' period as a baseline against which to calculate anomalies for similar reasons. The literature on climate variability and change has recognized episodes or periods of multidecadal GMST variation throughout the 20th century (Handel andRisbey 1992, National Research Council 1995). Thus, the notion of fluctuations in GMST is not new and has been recognized as a confounding factor in attributing causes of decadal-scale GMST changes in all the IPCC reports since their inception in 1990. For example, the 1990 report (Houghton et al 1990) noted that 'Because of long period couplings between different components of the climate system, for example between ocean and atmosphere, the Earth's climate would still vary without being perturbed by any external influences. This natural variability could act to add to, or subtract from, any human-made warming.' (Here, the reference to unforced 'natural variability' is equivalent to 'internal variability' in modern usage.) And the 1995 report (Houghton et al 1996) noted that for projections of climate change, 'Kdecadal changes would include considerable natural variability.'

And that
'Knatural climate variability on long time-scales will continue to be problematic for CO 2 climate change analysis and detection.'

Present view of the present fluctuation
Given that climatologists were well aware that GMST fluctuates on decadal (and longer) time scales, the emergence of a claim in the climate literature from about 2009 that climate change as represented by GMST had entered a 'pause' or 'hiatus' was a strong claim. In effect, the claim was that the most recent decadal-scale fluctuation in GMST was somehow extraordinary or substantially different from past GMST fluctuations. This interpretation is consistent with the fact that the fluctuation was given a name ('pause' or 'hiatus') and with the claim frequently made in pause-papers that this fluctuation (but not others) was not consistent with the GMST response to increases in greenhouse gases (Lewandowsky et al 2016).
In order to assess the claims made about this particular fluctuation in the literature, we identified a set of 224 peer reviewed articles in the climate literature (through 2016) that referred to a 'pause' or 'hiatus' in GMST in the title or abstract. From this larger set, we constructed a subset of papers that defined a start and end date for any alleged pause, and which specified the GMST data used for analysis. This is the minimum amount of information needed to reproduce and test the claims of a 'pause' in these papers. The application of these criteria reduced the subset to 90 papers, which is the analysis subset used here and denoted 'pausepapers'. The number of papers published each year on the 'pause' is shown in figure 1(a) and rises substantially from 2013. The 'pause-research period' (as reflected by published papers) extends from about 2010 through the present.
Note that the 90 'pause-papers' are the subset that refer to a climate 'pause' and that provide sufficient information to reconstruct the nominal notion of the pause for that paper (the period used and the GMST dataset(s)). Many of these papers presuppose the existence of a 'pause' and address issues that are conditional upon its existence, without necessarily providing their own analysis or evidence for the identified 'pause'. The purpose of this literature set is to allow us to develop a picture of what the GMST pause is presumed to be in the literature, capturing areas of diversity and commonality. Further, the set of 'pausepapers' allow us to be inclusive in capturing all the different definitions used for the pause in GMST in our analysis here. The set of papers are listed in the appendix.
There is no single or dominant definition of the 'pause' in the literature (Lewandowsky et al 2015b). Many papers are not explicit about the period used to assess the pause or the criteria used to reach the conclusion that there is a pause. The distribution of start dates from the pause-papers (set of 90) for the 'pause' is shown in figure 1(b). These span a range from 1995 to 2004 illustrating the lack of consensus on this issue. Further, there is usually little or no statistical justification offered for choice of the start-year. This is a critical issue which we return to in section 3.4. Similarly, the durations presumed for pauses in the pause-papers span a range from about 10 to 20 years with a median of 15 years (figure 1(c)). The number of times a given year falls into the period defined as a pause across all the pause-papers is shown by the histogram in figure 1(d). The frequency profile of the histogram reveals a 'pause-period' in the literature spanning roughly 1998-2015.
The pause-period was selected by the authors of pause-studies to correspond to a period where the rate of warming is slower than the average longer-term warming rate. This period can be highlighted and placed in context by showing a sliding sequence of short-term trends in GMST through the modern period (figure 2). By colour-coding the trends as red/blue according to whether they are warming faster/slower than the longer-term rate of warming, it is apparent that there are persistent periods of faster and slower than average warming. The pause-period in the pauseliterature shows up as the second slower than average warming period on the plot. The identification of a period of slower than average warming does not suffice to demonstrate that such a period is statistically unusual. For that more formal criteria would need to be applied.
Different criteria have been used to constitute a 'pause' in the pause-papers. Most early papers employ it in a manner consistent with the common sense usage to signify an absence of a warming trend (no trend). Later papers, however, often use it to signify a reduction in the warming trend, i.e. a slower than normal trend. This shift in definitions, by itself, might indicate a problem, as it shows that even at the time, the scientific community was unclear and inconsistent as to what the object of study was. In this paper we test both claims. To illustrate these definitions we have redrawn figure 2 in idealised form in figure 3. Here, we represent the GMST series (without interannual variation) in its idealised form as undergoing regular fluctuations about a long-term mean warming rate (the dashed black line). The fluctuating line is again coloured red when the trend is greater (warming faster) than the longer-term mean trend and blue when it is smaller (warming slower) than the mean trend. One expects short-term trends to fluctuate faster and slower through time than the longer-term trend as illustrated here. There has been little research attention on the faster fluctuation that preceded the slower fluctuation that is the target of the pause-papers (Rahmstorf et al 2007, Lewandowsky et al 2015a. The 'slow' trend view of the pause (figure 3) is seldom defined formally in the pause-literature. It could refer (as in Stocker et al 2013) to a meaningful change in the trend (slope) of GMST in the pause-period relative to the longer-term trend that prevailed prior to the pause-period (change in trend). Alternatively, it could refer to a claim that the trend during the pause-period is unusual relative to trends of a similar length during the modern warming period (unusual trend). For example, the pause-period fluctuation in figures 2 and 3 could be assessed against slower than average fluctuation periods such as the prior one in the 1980s. We  The thin red and blue lines are linear 11-year trend lines sliding over the period. These lines are red/blue when the slope of the 11-year fit is greater/less than the slope of the longer-term dash line. The choice of interval length here (11 years) is arbitrary, but all interval lengths used in the pause-papers will exhibit periods of faster and slower than average trend. will restrict this comparison to fluctuations that occur through the period that GMST has been fairly steadily increasing (with fluctuations) to avoid including a large sample of the early record when the longer-term warming trend was much weaker. An objective way to determine how far back to include past fluctuations is to assess the GMST record for meaningful 'changepoints' in trend (Cahill et al 2015). We have performed change-point analysis on each of the GMST records used here and find changes in each dataset near 1970. This is consistent with other analyses and with the choice often made in the literature to define the modern warming period. In all the analyses to follow we use the change-points (near 1970) particular to each dataset in assessing how unusual the recent slower fluctuation is.

Methods and data
The data used to construct records of GMST consist of a diverse set of observations of temperature collected over land (typically surface air temperature) and oceans (typically sea surface temperature (SST)) through time. The construction of GMST series requires the blending of these observations and removal of any known biases (debiasing) in the data (Karl et al 1989, Jones et al 1999. These efforts have been carried out principally by groups in the US and UK, and provide estimates of GMST back into the 19th century.
The time series of GMST from five of the principal groups constructing records of GMST is shown in figure 4. The five series exhibit clear variability at interannual and decadal to multidecadal scales, with a longterm warming trend. While there are some differences between the series as represented by the five different datasets here, they display very similar variability and long-term trends. As such, the differences between the datasets have historically been more of interest to specialists in the field, as they yield very similar views of the climate response to greenhouse gases.
More recently, the literature on the alleged 'pause' in GMST has brought about a shift in focus to consider short-term trends in this data (of typically 10-15 years duration). Short-term trends can be quite sensitive to small differences in end points in trend intervals, and thus the small differences between the GMST datasets can matter in determining trend magnitudes (Risbey and Lewandowsky 2017). All of the GMST data sets are evolving over time as they better account for measurement errors , extend coverage, add or change interpolation methods, and implement improved bias reduction on past data. We do not provide a review of these issues here, but we do single out a couple of issues that have played a role in assessing short-term trends over the pause-period. These are data coverage (Cowtan and Way 2014), and the bias reduction of SST data (Karl et al 2015, Hausfather et al 2017, Kent et al 2017. Improvements related to coverage and SST debiasing of the data over the past decade have resulted in changes to estimation of recent trends in some of the datasets. We provide an assessment of these issues here as it relates to claims that the recent temperature fluctuation represents a 'pause' in greenhouse warming. Another critical issue in characterising short-term climate trends is their statistical treatment. Here again we single out two particular issues that have effectively confounded claims in the pause-literature about the prominence or otherwise of short-term GMST trends. The issues relate to the selection of a short interval to analyse the GMST trend that seems to depart from the Figure 3. Idealised schematic of a smoothed global mean surface temperature series (blue/red line). The series is a linear trend plus sinusoidal variation to mimic multidecadal fluctuations. The dashed line is the linear fit component. The series is coloured red/blue when the local gradient is steeper/shallower than the linear fit. The inset boxes show a red segment where the slope is compared to the long-term linear slope, and a blue segment where the slope is compared to either the long term linear slope or to zero slope. long-term trend. To be fair in this comparison, one must properly account for the selection process (the 'selection bias' issue) and whether the trend in the interval is continuous or broken relative to neighbouring intervals (broken trends) (Rahmstorf et al 2017). These issues are described below and we assess their role in the interpretation of short-term GMST trends and the 'pause'.

Datasets
The main data for this review of short-term GMST trends are the five GMST series (as they existed at different points in time) that formed the basis of the trend assessments for papers on an alleged 'pause'. The datasets and some of their properties are described in table 1.
All of these datasets have undergone some forms of bias reduction effort over the past decade during which the climate community has focused on shortterm GMST trends. This means that different versions of the data were in play at different times. The Berkeley   Table 1. Data sets used to represent GMST and some characteristics related to data coverage. The global coverage is calculated for each dataset during 1981-2010 and is the average percentage of the global surface covered by grid cells with data. The release dates given are for when the data was available to the public. If no release date is given, the data set had been in use well before the period of research on the 'pause'. The number of versions of each dataset available for this research are Berkeley (7), Cowtan and Way (2), GISTEMP (113), HadCRUT (7), and NOAA (31). the SST records have a cool bias during the recent period, which does affect the magnitudes of short-term trends (section 4.1).

Data set
The differences in global coverage among the datasets (see table 1) also matter for determination of short-term trends. That is because the differences relate substantially to whether the Arctic region is well represented (Berkeley, Cowtan and Way, GISTEMP) or not (HadCRUT, NOAA), given that the Arctic has been warming fast enough relative to the global mean rate over the recent period to make a difference (Benestad 2008, Rahmstorf et al 2017). Note that Cowtan and Way is based on HadCRUT, except that Cowtan and Way include data coverage in the Arctic by applying kriging techniques to interpolate into the Arctic (Cowtan and Way 2014). The differences between Cowtan and Way and HadCRUT4 thus provide a direct measure of the role of data coverage, at least over recent decades when observational coverage is sufficient to properly support near-global temperature reconstruction.
The analyses conducted here have been repeated with all six of the datasets shown in table 1. For the sake of presentation we sometimes show results for just GISTEMP and HadCRUT. One reason for this is that we seek to provide a retrospective assessment of what the trends looked like at different points in time, and these two datasets were in use throughout the period of research on the 'pause', whereas some of the other datasets (Berkeley, Cowtan and Way) were only available after the start of this research. GISTEMP and HadCRUT are also good choices for contrasting lower data coverage (HadCRUT) and near complete coverage (GISTEMP). HadCRUT effectively provides the lowest estimate of short-term warming trends throughout the 'pause' research period and so provides a lower bound on what a pause-researcher with no insight into the differences between data sets might infer concerning the GMST trend.
All of the GMST datasets have been truncated here to the period 1880-2016. All data were converted to anomalies using a common reference period of 1981-2010. This reference period is suitable because the different SST records are most consistent over this period, and it avoids the recent changes in ship bias (Kent et al 2017, Hausfather et al 2017. All trends calculated here are linear trends using least squares regressions. The choice of linear trends matches usage in the literature, and can be justified over the period since about 1970 in which no new change points are detected in the GMST series (Cahill et al 2015(Cahill et al , 2018. Versions of the GMST datasets were archived for analysis as they existed at different points in time over the 'pause' research period. This allows us to provide a set of 'historical' views of what the GMST trends looked like at different points in time as described in the next section. These data are available at: https://git.io/ fAuos. In some of the analysis here we break the GMST time series into a baseline-period and a pause-period to perform statistical tests. The baseline-period extends from the start of the modern warming period up to the beginning of each pause-period tested. The modern warming period is assumed to start at the last significant change point detected in each GMST series (Cahill et al 2015). These are 1970 for GISTEMP, NOAA, and Berkeley, and 1974 for HadCRUT and Cowtan and Way. A range of different intervals are used to test many different choices of pause-periods, with the range of intervals encompassing the set of periods inferred from the literature in figure 1(d).

Historical and hindsight trends
We could, and have, examined GMST trends over the pause-period using the latest available GMST data (through 2016 here). We term this view of the trends the 'hindsight' view since the current (hindsight) GMST data has the benefit of any and all bias reductions that have taken place in the preceding periods. While 'hindsight' provides the best current view of the pause-period trends, the calculation of trends during the pause-period necessarily relied on the versions of the data that were available at the points in time when the research was conducted. To be fair to researchers at any given point in time, we have also calculated a set of 'historically-conditioned' trends for each of the GMST datasets. The historically-conditioned trends use the versions of each of the datasets that were current at the time the trend was calculated.
The concept of 'historically-conditioned' trends is illustrated for the HadCRUT dataset in figure 5. This figure shows the trend value for trends starting in 1998 and ending at the points in time marked on the x-axis (vantage year). The solid line is the series of historically-conditioned trend values and uses only data that was available up to the time of the vantage year. Different versions of the HadCRUT data are indicated by the dots on the trend line. The trend line shows one particularly large jump up (of about 0.05 K/decade) just after 2012, corresponding to the switch from Had-CRUT3 to HadCRUT4 (where the ship-buoy bias went from uncorrected to partially corrected (table 1)). The thin lines in figure 5 show the trends calculated back to the earlier vantage years as if the newer versions of the data existed in the earlier periods. The difference between the thick historically-conditioned trend line and the thin (hindsight) lines is thus an indication of how the trends change between different versions of the datasets. In this case, the differences between HadCRUT3 and HadCRUT4 are quite marked as indicated by the large differences between the solid and thin lines.

Continuous and broken trends
In the calculation in figure 5 we performed the trend calculations as traditional in the pause-literature using a simple trend between a start and end date. However, when the trend is fitted to the data in this way (without regard for the years preceding or following the chosen start and end years respectively), then the isolated trend is 'broken', meaning not continuous with, trends in the remainder of the data. This has implications for the testing of the data (Rahmstorf et al 2017).
An example of 'broken' and 'continuous' trends in the HadCRUT3 GMST data is shown in figure 6. Here the trend in the pause-period from 1998 to 2012 is shown as a broken-trend (not continuous with prior trends) by the dashed red best-fit trend line over the period. The red broken-trend line for the baselineperiod 1970-1998 preceding the broken-pause trend exhibits a jump discontinuity at the common year in 1998. This introduces an extra degree of freedom into the trend analysis which affects the assessment of statistical significance. Such a jump should be explicitly mentioned (e.g. 'temperature jumped upward and then remained flat', rather than just stating it remained flat), and it would normally require some physical justification as to why such a jump in the series should be modelled here (Rahmstorf et al 2017). Such allowance and justification is largely absent from the pause-literature that purports to find a pause. A  more parsimonious and physical assumption that does not introduce the extra degree of freedom is to model the trends as continuous trends as shown by the dashed blue trend lines in figure 6. The change in slope of the continuous trend line is much less severe during the pause-period. In testing short-term GMST trends against the hypothesis of 'no trend' we will show results for both broken and continuous trends.

Selection bias
In the set of pause-literature supporting the notion of a pause in GMST, there is often little or no explanation for why the pause-period used was chosen. While there are differences in the periods chosen to examine GMST for a pause (figure 1), the periods have in common the property that they roughly cover the interval from the late 1990s to early 2010s when GMST was fluctuating with a slower short-term trend than the long-term trend (Risbey et al 2015) (as represented by the 'blue' period in figure 3). It is clear from this commonality of pause-periods that the period was not randomly chosen or drawn. Rather, the pause-period was selected (from many possible time intervals) because of its lower trend (Rahmstorf et al 2017), as evident in figure 2. Any analysis of the significance of such a period must take into account the fact that it was selected on the basis of its value rather than randomly drawn. This is the 'selection bias' problem. This problem is not accounted for in any of the pausepapers that claim to have found a significant slowdown of warming. Since frequentist hypothesis testing requires a sampling plan that is 'blind' to the nature of the data, selecting a subset of data on the basis of its value and then testing it will have the effect of artificially raising the presumed significance of the pause-periods chosen.
Appropriate corrections for selection bias are described in Rahmstorf et al (2017) and performed as one variant of the analysis here. The selection bias problem is referred to as 'multiple testing' (Ventura et al 2004, Wilks 2006 in Rahmstorf et al, since overcoming the bias in the selection period requires one to perform multiple tests for different start and end times of the tested period. The procedure used to account for selection bias must address the issue that we have only a half century or so of relatively enhanced greenhouse warming (the modern warming period), and thus few samples from which to test the unusual nature of the pause-period. The remedy applied here is to generate Monte Carlo samples from the modern period as follows: for each of the five datasets the longer-term (baseline) trend is fitted to the period post the change-point determined circa 1970 for each dataset up to the start of each pause-period selected (usually close to 1998). The standard deviation of the residuals about the baseline fit is calculated. We then generate synthetic realisations of GMST over the period encompassing both the baseline and the pause-period using the same linear trend as the baseline-period plus white noise with the same standard deviation as the residuals 9 . This procedure is repeated to give 1000 synthetic realisations. We then compare the magnitude of the pause-period trend with all trends of the same length that occur (at any time) through the 1000 realisations. We report the result as the percentage of realisations that contain a trend with a magnitude smaller than the pause-period trend. We take the view here that a minimal requirement for a trend to be unusually weak (paused) with this procedure is that fewer than 5% of all realisations sampled contain a less-positive trend interval than the selected pause-period trend. If this is not satisfied, then the trend is not very unusual in relation to what one would expect to find in the case of a constant warming trend superimposed by random interannual variability.

Historically-conditioned and broken trends
The various GMST datasets were updated during the pause-research period, resulting in different views of short-term trends for different versions of these datasets (section 3.2). An illustration of the effect of these changes on short-term trends is shown in figure 7, which plots the magnitude of the trend since 1998 in each of the five datasets. Noticeable jumps in the trend value (solid lines) occur for HadCRUT, NOAA, and GISTEMP as they undergo the bias reductions to SST described in section 3.1.
The period from 2012 to 2014 is of particular interest, since it spans the completion of the 5th IPCC assessment report. The HadCRUT, NOAA, and GIS-TEMP trends appear to be in good agreement over this period, however this agreement is illusory, because the NOAA and GISTEMP records did not include corrections for the SST bias until late 2014. The apparent agreement in trends arises from Had-CRUT4 underestimating the rate of warming due to incomplete area coverage and the remaining datasets underestimating warming due to the uncorrected SST data.
Some reduction in the spread of trend values across the datasets would be expected for later vantage years as the actual trend interval considered is longer. However, this is not the main factor in this case.  . Historically conditioned trends as in figure 5, but for all 5 datasets. The major jumps in the trends occur when HadCRUT shift from HadCRUT3 to HadCRUT4 (green curve) and when GISTEMP and NOAA shift from ERSSTv3 to ERSSTv4 (light blue and navy blue curves). The trends in this figure are all 'broken trends' (section 3.3) in that they start in 1998 without a requirement that the trend start value is continuous with trends before that time.

Assessment of no-trend
In this section we provide a systematic assessment from historical and hindsight perspectives on whether one could make a statistical determination that there was no-trend in GMST during the pause-period. For this assessment we show results for both the Had-CRUT and GISTEMP data, since both datasets have been heavily used through the pause-research period and HadCRUT provides a lower bound on the magnitudes of pause-period trends.
We assessed a matrix of trends from 3 to 25 years duration from vantage points (i.e. the last year of the trend interval) between 1989 and 2016 as shown in figure 9 for GISTEMP. Note that this set of intervals includes all those used in the pause-literature for the 'pause' along with earlier intervals to provide further context for the pause-period. The earliest vantage year considered here is 1989 so as not to include intervals that are substantially outside the period of modern warming. The colour scale shows positive trends in red and negative trends in blue. It is clear right away that trends less than about 10 years duration are 'noisy' in the sense that they could be of either sign. It is generally regarded that about 17 year intervals are needed to obtain sufficient power to detect a signal in GMST (Lewandowsky et al 2015b) or tropospheric temperature (Santer et al 2011). The trends significant at the level p<0.05 in figure 9 are represented by black dots in the matrix. If there is no black dot in a square here, one has failed to reject the hypothesis that there is no trend in the data. In the landscape of trends provided by these diagrams one is interested in whether it takes longer to reject the hypothesis of zero trend during the pause-period than at other times.
For GISTEMP using broken trends (figures 9(a), (b)) two major periods of non-significant trend show up (represented by intervals extending to 17 years to reject the no-trend hypothesis). The second of these periods corresponds to the pause-period. Neither of these periods seem statistically unusual. For GISTEMP there is also little difference in this picture whether one uses historical (Figure 9(a)) or hindsight (figure 9(b)) versions of the data to calculate the trends. When switching from broken to continuous trends (figures 9(c), (d)) any weakly significant trend in GMST is even less pronounced, and it takes only 12 years to reject the no-trend hypothesis during even the slower fluctuation periods.
For HadCRUT (figure 10) the matrix of trend magnitudes and intervals to reject the no-trend hypothesis is broadly similar to that for GISTEMP. That is, with broken trends (figures 10(a), (b)) there are two slower than average fluctuations evident in the matrix of trends, and the time needed to reject zerotrend is similar to GISTEMP and similar using historical or hindsight versions of the data. Further, the use of continuous trends for the analysis (figures 10(c), (d)) reduces the interval needed to reject zero-trends to only about 12 years. In short, the trend matrices for GISTEMP and HadCRUT both show that there are no unusual or remarkable periods where it takes longer than expected to eliminate the no-trend hypothesis. This conclusion is not sensitive to whether one used historical data or not, or whether one used broken or continuous trends.

Assessment of an unusual trend
The view of the 'pause' as a significant slowing of trend can be assessed either as a change in trend or as an unusual trend. In this section we address the 'unusual trend' definition, and in the following section we address the 'change in trend' definition.
A pause-period trend would be unusual if it were very unlikely to find similar length trends with such a weak magnitude during the modern warming period. The Monte Carlo testing procedure described in section 3.4 has been applied to each of the GMST data sets over a combination of pause-segments of varying lengths and start times sufficient to span the range of pause-definitions found in section 2.1 ( figure 1(d)). For each pause-segment there is a baseline-segment spanning the interval from the change point detected in the dataset up to the pause-segment. The Monte Carlo series is generated over the combination of these two periods and provides the basis to assess whether the pause-segment is unusual.
The results for GISTEMP and HadCRUT are shown in figure 11. A matrix of pause-segments are represented. The vantage year (x-axis) is the last year of each pause-segment tested, and the number of years included (y-axis) defines how far back the pause-interval extends. For every pause-segment represented by an element in the matrix we have tested intervals of the same length throughout the Monte Carlo realisations of the series. The left column of figure 11 uses the historical data as they existed for each interval of the matrix, and the right column is for the hindsight version of each dataset.
For GISTEMP the pause-period trends are not at all unusual as shown by the high proportions of realisations in the Monte Carlo sample that contain a trend with magnitude lower than that in the pauseperiod. This conclusion holds whether considering historical or hindsight versions of GISTEMP. For HadCRUT the results show some differences from GISTEMP, but the conclusions are substantially the same. For the hindsight version of HadCRUT4 the pause-periods are not unusual and never drop below Figure 11. Multiple tests of a matrix of pause-interval trends. Each element of the matrix corresponds to a trend-interval spanning the last year of the interval (vantage year) back through the number of years listed on the y-axis. For example, the top right element of the matrix is for the 19 year trend-interval ending in 2016. Each of these trends is compared with a population of trends. The population is generated by taking the baseline-interval from the change point in the dataset signifying the modern warming period to the start of each trend-interval. The residuals to a best fit from this baseline-interval are then used to generate 1000 Monte Carlo realisations of a series from the beginning of the baseline-interval to the end of the trend-interval. The magnitude of the trend-interval is then assessed against a population of all intervals of the same length that occur any time in the Monte Carlo realisations. The shading denotes the percentage of realisations that contain an interval of a lower trend magnitude than the pause-interval trend tested. The numbers in the squares are the actual percentages. Where a yellow circle is present it denotes that fewer than 5% of the realisations in the Monte Carlo sample contain a lower magnitude trend interval than the tested pause-interval trend. For broken trends for GISTEMP using (a) historical data and (b) hindsight data, and for HadCRUT using (c) historical data, and (d) hindsight data. 8% of the Monte Carlo realisations. The same result also generally holds for the historical HadCRUT, but there are a few isolated choices of pause-interval (given by the yellow circles in figure 11) where there is a smaller than 5% chance of finding a lower magnitude trend in the Monte Carlo sample. For two choices of pauseintervals in the HadCRUT historical Monte Carlo data there is a 1 in 25 chance of obtaining a lower trend magnitude than the pause-interval. Such odds are not that unusual given that the analysis involves multiple trials by testing different possible durations of slowdown intervals, which increases the likelihood of finding one by chance. To be unusual, one would expect to see a more sustained set of intervals about these intervals that are also indicative of low odds weak trends. That is not the case for even the HadCRUT historical data, where those few occasions where the odds drop below 5% are among intervals where the trends are more typical. As such, the evidence that the pauseintervals are unusual is weak, even in the most favourable configuration (HadCRUT historical) for such evidence.

Assessment of a change in trend
The assessment of unusual trends above allowed each pause-segment tested to be 'broken'. A more reasonable test is to ensure that each segment tested is continuous with the data that precedes it (section 3.3). When the Monte Carlo assessment is carried out with continuous trends it then becomes a test of a change in trend between the baseline-segment and the pausesegment. When the baseline-segment and pausesegment share the overlapping year in common without a jump (continuous trend), then the proportion of Monte Carlo series containing a lower magnitude trend than the pause-segment provides a statistical measure of the change in trend between the baseline and pause segments.
The results for Monte Carlo tests with continuous trends are shown in figure 12. As in figure 11 the tests are shown for GISTEMP and HadCRUT with both historical (left column) and hindsight (right column) data. In practice it makes no real difference to the results whether hindsight or historical data are used. For both cases and both datasets the change in trend is not unusual. Even for HadCRUT historical data, the pause-segment from the change in trend is always larger in trend than the trends in at least 10% of other Monte Carlo realisations. Thus, the evidence to support a change in trend in GMST during the pause-period is similarly lacking.

Types of evidence
In this section we review the evidence for a 'pause', as it was asserted in the GMST record. We review the evidence through time as it depended on different versions of the GMST record, on the number of years available in the record, and on the methods applied to assess the record. Use of the term 'evidence' implies that the information has some substantive meaning to the nature of the phenomenon asserted-in this case the claim of an unusual and noteworthy period of global temperature trend that has such different characteristics from prior temperature fluctuations that it warrants its own name and can be posited as a form of counter-evidence to global warming (e.g. Guemas et al 2013, Kosaka and Xie 2013, England et al 2014, Santer et al 2014. The evidence for this could be strong, or partial, or not at all. Evidence can also be current in the sense that it continues to be sustained by data and reason. Evidence can also be 'apparent' in the sense that it appears (or appeared) to support the existence of the phenomenon, but upon closer inspection turns out not to be substantive.

Analysis choices and evidence
In section 2 we noted that the 'pause' is typically neither clearly defined nor consistently defined in the literature. It is possible to characterise the range of pause-period definitions by surveying what is used to assess 'pauses' in the pause-literature (figure 1). All our assessments of pause-periods sampled the entire range of pause-periods used. The views of the 'pause' for observations in the literature divide into assessments of 'no-trend' or a 'slow-trend' as illustrated in figure 13. Much of the pause-literature models the trends as 'broken' trends, but does not take into account the additional degree of freedom introduced by that choice (section 3.3), nor the need to account for selection bias (section 3.4). The branches in figure 13 represent allowance for those choices in examination of pause-trends. For the 'slow-trend' definition of the 'pause', use of broken trends amounts to search for unusual trends (section 4.3), whereas use of continuous trends tests for a change in trend (section 4.4).
In any assessment of pause-trends one must select sources of GMST data. Many studies use a single data source, though it is prudent, given the sensitivity of trends to uncertainties in the data, to sample multiple sources. The HadCRUT, GISTEMP, and NOAA datasets were available to researchers throughout the entire pause-research period. Versions of the Cowtan and Way and Berkeley data came online during the pauseresearch period, and were thus only partly available (see table 1). We represent the data-choice available to researchers by the penultimate branches, HadCRUT, GISTEMP, and Other (NOAA, Cowtan and Way, Berkeley) in figure 13. Thus, descending the tree in the figure, a typical researcher makes choices (explicitly or implicitly) about how to define the 'pause' (no-trend or slow-trend), how to model the pause-interval (as broken or continuous trends), which (and how many) datasets to use (HadCRUT, GISTEMP, Other), and what versions to use for the data with what foresight about corrections to the data (historical, hindsight). For example, a researcher who chose to define the 'pause' as no-trend and selected isolated intervals to test trends (broken trends) using HadCRUT3 data would be following the left-most branches of the tree. These assessments could be made at various points in time during the pause-research period. The bottom rows in figure 13 represent assessments made for each year from 2010 through 2016.
Since the GMST datasets changed through time during the pause-research period, we kept track of the 'historical' data that was available at the time any pause-research was conducted, and made sure that one line of our analysis used only historical data. We also redid all assessments using the most recent versions of each dataset, termed 'hindsight' here. In practice, some datasets incorporated improvements before others. Further, some of the deficiencies in the historical datasets were known at the time. For example, the effect of a lack of Arctic coverage on assessment of GMST trends was known before the pause-research period (Benestad 2008, Simmons et al 2010, and was addressed in some datasets, but not others (table 1). The presence of a bias in the SSTs arising from the increase in buoy observations was also known prior to most of the pause literature (Smith et al 2008). As such, even when using purely historical data, researchers often have some knowledge of limitations in the data used, of improvements available, and of the effects of those changes. That is, the 'historical' perspective is not entirely blind to the 'hindsight' data, and thus in practice the historical researcher sits some way between these perspectives.
From the 'hindsight' (current data) perspective, the results of this study are unanimous in showing no evidence for a statistical 'pause' in GMST. This unanimity is represented by the bottom rows (years 2010-2016) in figure 13 for all the 'hindsight' branches. The open green symbol on these branches is used to indicate little or no statistical evidence. Using hindsight GMST it does not matter how one defines the 'pause' (as a lack of trend or as a slow trend), whether one models the trends as broken or continuous, or even which version of GMST one uses (HadCRUT, GISTEMP, NOAA, Cowtan and Way, or Berkeley). For any given set of choices on the above, the result is the same in showing a lack of statistical evidence for a 'pause'. That is, looking back using current data we can't find any evidence for a 'pause', even using the most generous (and statistically dubious-broken trends, no selection bias correction) assumptions of how to model and define the 'pause' (no trend, slow trend).
Moving to the purely 'historical' data perspective, the result of the combination of tests from section 4 is substantially the same. For the no-trend definition of the 'pause' the interval length in years required to obtain significant trends is longer if using broken trends than continuous trends. However, even for broken trends, the interval is about 17 years, consistent with the result in the literature that it takes about this long to establish a signal (Santer et al 2011, Lewandowsky et al 2015b. This conclusion does not depend on the dataset used. Since there is nothing statistically unusual in this result, we have classified it as 'little or no evidence' in figure 13. Redefining the 'pause' from no-trend to a slower than average trend introduces ambiguity into the definition of the 'pause' (Lewandowsky et al 2015b). It also creates confusion by using a common-language term in an uncommon manner. However, even if we accept the 'slow trend' definition of the 'pause' in historical data there is still little indication of a statistically unusual pause. The closest any of the tests comes to showing evidence is for the use of broken trends to test for an unusual trend. In the sole case of a choice of historical HadCRUT data there are a few isolated trend intervals that occur in the Monte Carlo realisations at the 1 in 20 to 1 in 25 level (4%-5%) for low trend values. We have judged this to be weak evidence in figure 13 as such levels of occurrence are not very extreme and are not sustained outside a few intervals. Further, the length of the intervals reaching this level (13, 14, and 16 years) is less than that typically required to demonstrate signal in GMST trends. And, in any of the other datasets available at the time, the pause-segment trend values for these particular intervals are even more common.
The vast preponderance of outcomes summarised in figure 13 shows that there is little or no evidence (now or then) for a lack of trend or slowing of trend in GMST during the alleged pause-period. In order to infer even minimal statistical evidence for a 'pause' a researcher must have accepted all of the following: that the term 'pause' refers to a change in the rate of, rather than a cessation of warming; that a broken trend Figure 13. Tree representation of choices to represent and test pause-periods. The 'pause' is defined as either no-trend or a slow-trend. The trends can be measured as 'broken' or 'continuous' trends. The data used to assess the trends can come from HadCRUT, GISTEMP, or other datasets. The bottom branch represents the use of 'historical' versions of the datasets as they existed, or contemporary versions providing full dataset 'hindsight'. The colour coded circles at the bottom of the tree indicate our assessment of the level of evidence (fair, weak, little or no) for the tests undertaken for each set of choices in the tree. The 'year' rows are for assessments undertaken at each year in time.
implying an upward jump in temperature at the start of the pause-period is the best way to detect a change in rate of warming; that HadCRUT is a better representation of GMST (than other data sources) despite known limitations in coverage; and that isolated intervals suffice to make the case. One may ask whether this isolated case of 'weak evidence' in figure 13 among all possible choices and outcomes is consonant with declaration of a 'pause' in GMST? The case against this is strong and includes the following points: • The requirement to coin a new climate phenomenon, 'a pause' in observed GMST, which allegedly ran counter to greenhouse warming expectations is that the period in question should be quite exceptional or statistically unusual. The period alleged does not meet that requirement.
• Researchers knew that the climate fluctuates naturally on the time scales considered and knew to expect faster and slower than average warming periods spanning a decade or two.
• Even in the most favourable case for a pause, using HadCRUT3, the pause-period was not very exceptional.
• Researchers knew that short-term trends were sensitive to uncertainties in the GMST data, and that other GMST datasets were even less remarkable in their pause-intervals.
• There were reasons to view the other GMST datasets as good/better alternatives to HadCRUT for trend examination. This included updated corrections and better spatial coverage. The wisdom of this has been confirmed as HadCRUT trends move closer to the other datasets when HadCRUT is updated (figure 7).
• The pause-research literature did not reach a consensus on what the 'pause' actually was (figure 1), and the pause-definitions shifted through time.
• The pause-research literature did not generate robust statistical evidence for a 'pause'.

Alternative reviews of the evidence
With GMST now returning to a period where decadal trends are fluctuating steeper (faster) than the longer term warming rate (red periods in figures 2 and 3), various researchers have reviewed the evidence for a 'pause' in the prior slower than average warming rate fluctuation. The comprehensive review of Medhaug et al (2017) is agnostic about a pause in GMST observations and concludes that it depends 'on the time period considered, the dataset, and the hypothesis tested'. We agree with that only to the extent that there is a sensitivity to these factors. The choice of dataset and the statistical tests used contributed to a perception of a 'pause' for one dataset and for some questionable statistical tests using that dataset only. That apparent evidence was weak as discussed above and does not withstand more rigorous scrutiny with more complete or updated datasets, nor with appropriate statistical tests. Medhaug et al argue that 'the diverging conclusions' (about the reality of the pause) 'do not need to be inconsistent'. We argue that they are inconsistent because we do not accept that equally valid conclusions about the 'pause' in GMST can be reached using incomplete statistical methods and subsets of the data with known additional biases. The review of Fyfe et al (2016) is mostly addressed at the view of a pause as a discrepancy between observations and models (see Lewandowsky et al 2018 for analysis of this view of the 'pause'), but concludes on the issue of a 'slowdown' in the GMST record that the 'pause' has a sound scientific basis and is supported by observations. They argue that any baseline period used to assess a pause-period in GMST must commence from 1972, not earlier. That is consistent with our choice here to use the last significant break point in the GMST record, circa 1970 (depending on dataset), to mark the beginning of the baseline-period. Fyfe et al argue that using this baseline period, the trend over 2001-2014 is significantly smaller than the baseline warming rate. It is not clear whether the testing underlying this conclusion took into account testing for continuous (versus broken) trends, or for correction for the selection bias problem. Our analysis, which does take these issues into account, does not support their claim to find a soundly-based slowdown in the observational record (see also Rahmstorf et al 2017). Some of the pause-literature has not made clear the statistical tests applied, and in some cases the evidence offered has been simple visual inspection of curves without any statistical support. The 'visual' evidence for a 'pause' may seem compelling if the series is truncated in particular ways, but it does not withstand substantive statistical scrutiny.

Discussion and conclusions
In learning lessons from the pause-episode in the GMST record we can describe some elements of the pause-timeline and its consequences. The origin of the 'pause' lay in contrarian narratives about the climate (Mooney 2013, Lewandowsky et al 2015a. With the 'pause' (or 'hiatus'), a false narrative about an alleged inconsistency between natural fluctuations of global temperature and ongoing global warming was inserted into climate discussion. Once the notion of a 'pause' was established, some of the major journals gave prominent feature to articles about it (Nature 2017). The IPCC formalised the 'pause/hiatus' for the climate community in its 5th assessment report by defining and accepting it as an observed fact about the climate system (Stocker et al 2013) [Box TS.3]. Many climatologists also adopted the 'pause' or 'hiatus' into their own language about climate change. The adoption of these terms by the mainstream research community gave the 'pause' further legitimacy, even though they often explained that it was not unusual in the context of natural variability. Whether intended or not, this fed the public narrative that there was a 'pause' in global warming (Mooney 2013). To complete the cycle, researchers and climate institutions have now declared the pause to be 'over', thereby reinforcing the notion that it once existed Kosaka 2017, Met Office 2017).
In hindsight, with current GMST datasets, there is no statistical evidence for a 'pause'. That is the case regardless of which dataset is used and even using statistical tests that inflate the significance of the results. Global warming did not pause in observations (according to any common usage of the term or in statistical terms), but clearly we need to understand how and why scientists came to the conclusion that it had in order to avoid future episodes of this kind. To this end, we pose a series of counterfactual questions about the evidence on the 'pause' in GMST.
Looking back, did the evidence depend on earlier versions of the GMST data? This question hinges upon the use of HadCRUT3 rather than any of the other GMST datasets, for only in HadCRUT3 was there even weak, isolated evidence. If HadCRUT4 had existed when HadCRUT3 did, it is unlikely that the initial claims of a 'pause' would have been made. As such, one can conclude that the use of one of the earlier GMST datasets (HadCRUT3) contributed to the perception of a 'pause'. Given the existence of known shortcomings in this data at the time (related to global coverage and SST biases), that raises the issue of communicating data uncertainties  and their implications more broadly between GMST data providers and users.
Alternatively, one can ask whether the evidence depended on the statistical methods and assumptions used to test for a 'pause'. Suppose for example that the use of continuous trends and selection bias testing had been standard at the time the pause-research was first carried out. In that case there would have been no statistical evidence for a pause, even using HadCRUT3 GMST data, and the issue would presumably not have gained any currency in the research community. Thus, the use of inappropriate statistical tests also contributed to the perception of a 'pause'. That also raises issues for the research community about the need to formulate definitions of new phenomena in terms of clear, quantifiable metrics, and in avoiding the common pitfalls in trend analysis (Miller 2013). Some recommendations along these lines for addressing future (inevitable) fluctuations in GMST trend might include: • Any description of a new form of climate fluctuation should include a clear and generalisable definition of the phenomenon. This would include criteria for identifying onset and decay of the phenomenon.
• The definition should make clear the features that make the fluctuation unusual and whether it has a statistical or physical basis or both.
• The statistical assessment of the phenomenon should include some assessment of the sensitivity to the statistical methods employed and to the sources and major biases in the underlying data.
Researchers have noted that whether the 'pause' was real or not, it helped generate research on the mechanisms of climate variability on decadal time scales, and thus increased understanding about the climate system (Lewandowsky et al 2015a, 2015b, Fyfe et al 2016, Medhaug et al 2017, Nature 2017. While this is true, it is also important to ask what has been lost by the invention of a 'pause' in global warming? We will never fully know the answer to this question, but it is clear that the climate-research community's self-declaration of a 'pause' in global warming has created additional confusion for the public and policy-system about the pace and urgency of climate change. This in turn may have contributed to reduced momentum for action to prevent greenhouse climate change, even if only a bit and if only by some years. That lost momentum is likely to be counted in higher total emissions of greenhouse gases before climate stabilisation . The full costs of that are unknownable, but the risks are substantial (World Bank 2012, Hansen et al 2016). That is, there are costs, and there are perspectives upon which it matters whether the 'pause' was real or not. The effort to deconstruct the basis for the 'pause' is not strictly academic and provides some salient lessons for the science.