A large ensemble illustration of how record-shattering heat records can endure

The record-shattering hot day in the Pacific Northwest in June 2021 is used to motivate a study of record-shattering temperature extremes in a very large hindcast ensemble. The hottest days in the Pacific Northwest in the large ensemble have similar large scale and synoptic patterns to those associated with the observed event. From the perspective of a fixed location, the hottest ensemble days are acutely sensitive to the chance sequencing of a dry period with a precisely positioned weather pattern. These days are thus rare and require very large samples (tens of thousands of years) to capture. The enduring nature of record-shattering heat records can be understood through this lens of weather ‘noise’ and sampling. When a record-shattering event occurs due to chance alignment of weather systems in the optimal configuration, any small sample of years subsequent to the (very unlikely) record event has an extremely low chance of finding yet another chance extreme. While warming of the baseline climate can narrow the gap between more regular extremes and record-shattering extremes, this can take many decades depending on the pace of climate change. Climate models are unlikely to capture record-shattering extremes at fixed locations given by observations unless the model samples are large enough to provide enough weather outcomes to include the optimal weather alignments. This underscores the need to account for sampling in assessing models and changes in weather-sensitive extremes. In particular, climate models are not necessarily deficient in representing extremes if that assessment is based on their absence in undersize samples.


Introduction
The impacts of climate variability and change are often associated with the occurrence of climate extremes (Mitchell et al 2006). Typical examples include extended or severe drought (Verdon-Kidd and Kiem 2009, Risbey et al 2013, Schubert et al 2014, damaging storms (Tozer et al 2020), megafires , and severe heatwaves (Barriopedro et al 2011, Dole et al 2011, Lau and Kim 2012, Schneidereit et al 2012. The impacts from extremes can be particularly acute when the extreme event is so rare that it far exceeds past records . Such record-shattering extremes can induce failure modes in natural and social systems, such as when a severe heatwave causes transport and power system failures (Collins et al 2019), along with spikes of fatalities in vulnerable populations (Trigo et al 2005, Henderson et al 2022.
Record-shattering extremes are those whose magnitude well exceeds any prior events in the observational record . Such events are then by definition at the far tail of the observed distribution, and will usually be identified as 'outliers' in the population. The occurrence of record-shattering heat extremes can present a set of challenges for the science and its communication (Swain et al 2020, Seneviratne et al 2021. The new record high will often replace a record that has been extant for many decades (Abatzoglou and Barbero 2014). The fact that many current maximum temperature records were set so long ago has raised the question of why contemporary global warming has not wiped away these records (Redner and Petersen 2006)? When new records are set so far above past records, it is prudent to ask what this portends for the future, particularly in light of climate change (Diffenbaugh et al 2017, Diffenbaugh and Davenport 2021, Thompson et al 2022? When such outlier events occur, it is also natural to look to climate projections to see if models have captured these extremes. When the model projections have not encompassed the extreme in question, what does that imply about the pace of climate change or the fidelity of the models (Hook et al 2021, Teirstein 2021? Climate models are known to have difficulty simulating some of the processes responsible for local extremes (Tibaldi and Molteni 2002, Watson and Colucci 2002, Scaife et al 2010, van Oldenborgh et al 2022. The relationships between extreme events and climate change depend very much on the type of extreme considered. For extremes that surpass a given threshold (e.g. 90th percentile) in a well-defined distribution, it is already clear that climate change is creating more extreme events (Redner andPetersen 2006, Seneviratne et al 2021). When it comes to record-breaking or record-shattering events, the contribution of climate change is less clear. Rahmstorf and Coumou (2011) find that when the extreme relates to data that is well aggregated in time and space (e.g. monthly global surface temperature), the underlying trend is often greater than the variability, and new records will generally be driven by, and attributable to, climate change. Conversely, for very disaggregated extremes like daily temperature at a station location (local extremes), the underlying variability can be large relative to trend. In that case, the rate at which temperature records are set may not show much response to climate change (Redner andPetersen 2006, Rahmstorf andCoumou 2011).
In this work we are concerned with local extremes. We use daily temperature at a fixed point location, and focus on cases where the extreme is record shattering. These local, record-shattering extremes are always viewed through imperfect lenses; on the one hand through extreme value theory applied to small sample observations, and on the other hand through large sample ensemble modelling in imperfect models. We focus on the large ensemble approach here, since our interest is in how these extremes behave in very large samples, and on ways in which sampling size influences views of record shattering extremes.
The sensitivity of extreme events to sampling has been documented in both statistical and modelling studies. The sensitivity of generalised statistical maximum likelihood estimators to sample size is well known (Ailliot et al 2011, Hu et al 2020, and a range of studies have addressed how many samples are needed to draw reliable estimates of the likelihood of an extreme (Cai and Hames 2011, Sippel et al 2015, Paciorek et al 2018. The sensitivity of extreme events to sample size in large model ensembles has been demonstrated (Annan and Hargreaves 2011, Coats and Mankin 2016, Mankin et al 2020. More recently, it has become clear that very large ensembles can and do generate local, record-shattering heat extremes in the model that are as extreme in the model context , Gessner et al 2021, McKinnon and Simpson 2022 as equivalent outliers in observations (Barriopedro et al 2011). We employ a large ensemble here that also captures extraordinary outlier extremes. These outlier extremes are used to illustrate how record-shattering extreme records can be very persistent to being beaten, even in light of ongoing climate change. The illustration is provided around a case study of a recent record-shattering extreme; the heatwave that afflicted areas of the Pacific northwest of the US and southwest Canada in late June 2021. This case has been studied using a range of different large ensemble samples , McKinnon and Simpson 2022, Thompson et al 2022. These studies show that the event was a large outlier in observations, and suggest that such events are likely to become more common in future. This work provides additional analysis on why these very extreme events are so rare at fixed locations, and why their frequencies at fixed locations might be slow to increase. Note that our framing of the characteristics of extreme extremes at fixed locations is different from approaches where extremes can be considered at any location or across broad regions, where multiple tests (Ventura et al 2004) or aggregation (Rahmstorf and Coumou 2011) provide different statistical responses to fixed points.
In the following sections we describe the heatwave event example in more detail, the assessment of the event in observations and in a large model ensemble, the sensitivity to weather sampling, and discuss some implications of sampling sensitivity for capturing record-shattering extremes in climate projections.

Data and methods
For the purposes of this work we define a record-shattering extreme as one in which the event sets a record by a wide margin  and is a demonstrable outlier in the population that it joins. We provide an example of use of this definition below, after introducing the data for study of the extreme heat event. We draw on two different datasets here for the analysis of the record-shattering maximum temperatures in the Pacific Northwest, one from observations and one from a model. To represent the event as closely as possible to an experience of the event, and for consistency with other approaches (Philip et al 2020), we use station data of maximum temperature at a point location. The station selected is Seattle Tacoma International Airport (SeaTac), which has a daily record of maximum temperature spanning the period from 1948 to present in the Global Historical Climatology Network (Menne et al 2012). The record-breaking extreme temperature of 42.2 • C was set on 28 June 2021.
A number of studies of extreme hot days define these events in terms of the annual maximum of daily maximum temperature for a given location (or grid cell for a model), denoted by TXx (Kharin and Zwiers 2005, Kharin et al 2013, Naveau et al 2018, Diffenbaugh 2020, Seneviratne and Hauser 2020, Ajjur and Al-Ghamdi 2021, Almazroui et al 2021, Seneviratne et al 2021, Pitman et al 2022. We adopt the same definition here, acknowledging that TXx is only one of various definitions of heatwaves Alexander 2013, Emerton et al 2022). TXx is a useful index in that it avoids problems of seasonality and reduces temporal dependence (Stein 2020). TXx is well suited for use on record-shattering extremes since such records are seldom likely to occur more than once a year. The time series of TXx for SeaTac is shown in figure 1(a). The sample size in the observed station record is 74 years of TXx, with 2021 constituting a candidate record-shattering event.
In practice, we'd like to know how unusual (how likely, or how frequent) the record-breaking event was? We address this question here, though there is no well-defined answer to this question. For a well-sampled, well-behaved record of TXx of length n years, we might assess the likelihood, p, of observing an event-year equal or greater than some critical threshold value, TXx c , as Conversely, the 'return period' for an event-year equal or exceeding the critical value is r = n/ (#event-years ⩾TXx c ). Unfortunately, we can not straightforwardly apply these formulations directly here, because by definition there is only one record-breaking event, TXx r , in our sample of n = 74 years of TXx. If we had a much longer record, we do not know how many additional years we would go before seeing more events of this size, if at all (Gessner et al 2021). Thus, the value of n in equation (1) remains open-ended for TXx r . The problem is further compounded in that the climate record is nonstationary (Lorenz 1968, Hasselmann 1976) and we should further account for the factors underlying the nonstationary behaviour.
A common approach to address the small sample size in observations is to fit a parametric distribution to the available data. Given an interest in events in the very tail of the distribution, the distribution chosen in attribution studies is often the generalised extreme value (GEV) distribution: for 1 + k(TXx − µ)/σ > 0 with location parameter µ, scale parameter σ, and shape parameter k, (k ̸ = 0). For In fitting the observed data in figure 1(a), choices must be made about the population sample to be fitted. To avoid selection bias (Ventura et al 2004, Wilks 2006, Risbey et al 2018a, one might wish to exclude the record-shattering event from the population to be fitted , van Oldenborgh et al 2021. On the other hand, one might wish to use all the data available, including the record-shattering extreme . The choice made here about how to treat the new record results in very different outcomes for the fitted distribution , Bercos-Hickey et al 2022. One way to elucidate this sensitivity is, similar to Cook's distance in regression (Cook 1977), to test the effect of successively removing every observation in the sample, with a view to identifying outliers. The histogram of SeaTac TXx observations is shown in figure 1(b). The grey lines provide 72 GEV fits to the SeaTac data spanning years 1948-2020 (leaving aside the 2021 record for now), where one year from this period has been successively left out of the population to be fitted each time. The resulting GEV fit is remarkably similar for all the grey line fits except one, which is the one where the previous record TXx (from 2009) is left out of the sample. Without that one prior event, the warm tail of the GEV fit is truncated about 2 • C cooler. All of the GEV fits without 2021 fail to extend and include the 2021 event in the warm tail. When 2021 is added to a fit over all years (blue line), the warm tail extends by an additional 3 • C to include the record-shattering event. Since the 2021 event has such a profound effect upon the fitted distribution, we consider it to be a 'demonstrable outlier' in satisfying our definition of record-shattering extreme here.
Since there are considerable uncertainties associated with the characterisaton of the record-breaking event from small-sample observations (Diffenbaugh et al 2017), it is common to use model simulations to provide an additional way to characterise the event that can offer large sample sizes (Paciorek et al 2018, Naveau et al 2020, Philip et al 2020. In this work we use the very large model ensemble provided by hindcasts from a decadal climate forecast system (ACCESS-D) (Squire et al , 2022 to complement the observations-based analysis. A growing number of studies use hindcast ensembles to assess extreme events (Thompson et al 2017, Deser et al 2020, Kay et al 2020, Kelder et al 2020, Kent et al 2022. The forecast system here is based on the GFDL coupled model CM2.1 (Delworth et al 2006), with the ocean component upgraded to MOM5.1. The model hindcast consists of a suite of daily 10-year lead-time forecasts inititated for each May 1st and November 1st from 1995 to 2020. Each forecast contains 96 ensemble members. Since we require complete calendar years to calculate TXx, the total number of sample years available from this ensemble is 26 (years) × 9 (lead years) × 2 (start times) × 96 (members) = 44 928. This ensemble provides over 16 million simulation days, which enables us to test very large samples of daily weather generated from a single, self-consistent source (one model). The effective sample size can be reduced if there is dependence among ensemble members (Kelder et al 2020. We tested this by assessing the correlation of TXx between ensemble members at given lead times. The correlations are very nearly zero (at all lead times), implying that the estimates of TXx in the model ensemble are effectively independent.
The model world differs from the observed world in a number of important ways that mean it can at best only be a proxy for SeaTac temperatures. Models always contain some biases relative to observations (Kharin et al 2012, DelSole and Tippett 2014, Kelder et al 2022. Models typically have a cold-bias in TXx for the Pacific Northwest region (Seneviratne et al 2021). For this exercise we use the model gridbox encompassing SeaTac, which has a warm bias relative to the point observations. There is a spatial mismatch between point-based observational data and the gridbox resolution of climate models. The larger area of the gridbox and the inclusion of areas of high topography in the model gridbox would tend to produce lower maximum temperatures relative to SeaTac point observations. On the other hand, the gridbox contains areas further from the coast than SeaTac, which would tend to allow higher maximum temperatures. The model climate sample over 1995-2030 contains only the warm end of the observations sample over 1948-2021, which would also tend to favour more years with higher values of TXx in the model.
The model warm bias in maximum temperatures is adjusted by removing the mean of observed maximum temperature, o, at SeaTac for each day, d, of the year, ⟨o⟩ d , from the mean of model grid maximum temperature, f for each day of year, ⟨ f ⟩ d , where ⟨·⟩ indicates temporal averaging over the period 2004-2021. The bias-correction is performed as a function of lead time, τ in the model hindcasts so that the bias-corrected model maximum temperatures, f ′ , are given by The bias-correction is deliberately simple to alter the model data as little as possible in generating distributions that are broadly statistically consistent (assessed using a Kolmogorov-Smirnov test) with the limited historical record .
The histogram of TXx from the model sample is compared with the observed histogram in figure 1(c). The bias-corrected model histogram is closer to the observed histogram than the uncorrected model TXx, but is still shifted warmer, indicating that the actual bias varies with the magnitude of maximum temperature. The bias-correction ensures only that maximum temperatures in the model have the same mean as observations (over all days), which does not guarantee that the hottest days in a year have the same mean in observations and model. The model histogram is smoother because of the vast increase in number of samples (about 600 times larger). In the results that follow we show sampling tests using the bias-corrected model data, but all results have been repeated on the uncorrected data and are qualitatively similar. The results here will be dependent on the model used to a degree. In particular they will depend on the model's representation of weather systems, which is explored in section 4. However, the finding that variations in weather patterns can necessitate very large samples to represent extremes like very hot days is likely to be common to any model with reasonable synoptic processes and variability.

Return periods
In order to provide some initial context on the extreme nature of the SeaTac heat event and on the role of sampling, we assess the event 'return periods' (Naveau et al 2018), which is a relevant measure for planning and adapting to extremes (Seneviratne et al 2021). The return period is defined as the typical (average) interval between occurrences of the 'event' . The return period for years with an event TXx ⩾ TXx r = 42.2 • C sampled directly from the model ensemble is shown in figure 2 (blue bars). The return periods are calculated for a range of different sample sizes. For each sample size (10-44 928), the sampling is repeated 1000 times to provide an estimate of the uncertainty in return period for a given sample size. For each sample of a given size, the sample members were selected at random without replacement of members. These experiments were repeated with replacement of members in generating the samples. The results are broadly similar for all but sampling from the full population of 44 928 years. All the other sample sizes tested are small relative to the population, and so it does not make much difference whether each member is replaced or not in selecting subsequent members from the population. Return periods estimated two ways. The blue box and whiskers represent estimates of return periods for TXx ⩾ 42.2 • C from random samples of the given sample sizes, sampled directly from the model large ensemble. Each random sample is repeated 1000 times from the full population to generate the distributions of return period. The orange box and whiskers represent estimates of return periods for TXx ⩾ 42.2 • C deduced from a GEV distribution fitted to samples of given sample size from the model large ensemble. The samples fitted with the GEV are the same samples used to sample directly. There are too few instances of TXx ⩾ 42.2 • C at low sample sizes (<500) for the direct method to define sampling distributions. The boxes represent the interquartile range [25th (q1) to 75th (q3) percentiles], the whiskers extend to the most extreme points not considered outliers, and the outliers (black circles) are defined as > q3 The return periods estimated from direct sampling of the large ensemble in figure 2 (blue bars) can not be deduced for small samples. For samples sizes that are small relative to the return period there will often be no event-years with TXx ⩾ TXx r in the direct sample. In that case the return period (and distribution of return periods) is not well-defined, which applies to sample sizes less than 500 years for the direct method. For large sample sizes (>1000 years) the return period for TXx r in the model is about 270 years. Even for a sample size of 1000 years one can obtain sample estimates of return periods between 100 and 500 years. It takes sample sizes of 5000 years (and greater) to provide more tight bounds nearer 270 years.
The results for direct sampling in figure 2 indicate that very large sample sizes are needed to constrain sample estimates of return periods for TXx r (Kelder et al 2020). Such large sample sizes are not always available, in which case it may be desirable to fit a GEV distribution to the model sample Zwiers 2005, Kharin et al 2013). We test the efficacy of this approach here by fitting a GEV distribution to each of the same samples used to estimate the return periods in the direct sampling. We then estimate return periods from the fitted GEV distributions to see if we obtain more stable estimates for smaller samples. The results are shown in figure 2 (orange bars). The GEV fits to the samples produce broader distributions over repeated samples than the directly sampled estimates in blue. This is perhaps expected since the directly sampled estimates are bounded by sample size (when only one year ⩾ TXx r is present in the sample), and simply count years over TXx r without regard to how extreme the value is, whereas the GEV estimates fit distributions to those very extreme values.
For low sample sizes of 50 or 100 years, the GEV-fit return periods in figure 2 (orange bars) are not well constrained across the resampling, and range over orders of magnitude from several tens to several thousands of years. The interquartile range across the resamples reduces considerably for sample sizes of 500 or 1000 years, but still spans a wide range. It is clear here that the use of a GEV-fit to the model samples does not obviate the need for large samples in order to well-estimate the return period.

The role of weather
When a record-shattering extreme occurs, the climate system has visited a place that has never been observed before, and which may or may not have occurred (unobserved) before (Woo 2021). The record-shattering excursion could be the result of external forcing of the climate system such as via greenhouse forcing, or it could be the result of natural variations of the climate system, or both (Paciorek et al 2018, Deser et al 2020. Natural variability is a consequence of circulation processes in the ocean and atmosphere and their interactions (Monselesan et al 2015). Our focus is the reflection of variability in the atmospheric component, meaning the 'weather' .
Extreme hot days occur at a location when the weather pattern for the day generates strong warming of the lower troposphere, often over dry land surfaces that have been dessicated by prior dry spells (Abatzoglou and Barbero 2014, Gessner et al 2021, Bartusek et al 2022. The weather patterns associated with the Pacific Northwest record-shattering extreme have been well documented and feature a larger-scale wave train (Neal et al 2022) containing a strong blocking high over southwest Canada (Oertel et al 2023), and a cutoff low and trough along the California/Oregon coasts , Emerton et al 2022, Neal et al 2022. This synoptic configuration is shown in the 500 hPa geopotential height field (Z 500 ) for the record-breaking day in figure 3(a). The combination of block and coastal trough directed strong continental air flows across and down the Cascade range, and prevented intrusion of cooler maritime air (Emerton et al 2022).
Inspection of the weather patterns in the model ensemble shows that it generates similar flow configurations for hot days in the SeaTac region to observations. The hottest day for the SeaTac gridbox in the model ensemble of 44 928 years of simulation is 46.5 • C. The Z 500 field for this day in the model is shown in figure 3(b), along with the patterns for the 2nd-5th hottest days in the model ensemble (c)-(f). The model hottest days have a similar large-scale and synoptic configuration to the observed hottest day. They all feature a wavetrain with an embedded blocking high, with the high directing warm downslope air into the SeaTac region. For the two hottest model days there is also a cutoff low off the coast just as in the record-breaking case for observations depicted in figure 3(a). A key issue for the 'sampling' problem is how repeatable the weather pattern is that gave rise to the record-shattering extreme? If the weather pattern is not very repeatable, we will require large sequences of weather to generate statistics on an event of this size (Deser et al 2020). Weather patterns would vary in time, even if forcing were constant, because weather systems are unstable with respect to small amplitude disturbances (Lorenz 1976). As noted by Lorenz (1976), 'any finite time segment of the history of the atmosphere is no more than a statistical sample, and, consequently, that climatological statistics obtained from one segment will not be identical with those obtained from another' . At issue in our example for SeaTac is whether the 74 year sequence of observed weather patterns contains many repeatable instances of the conditions that gave rise to the record-breaking extreme? The answer to this question depends on factors such as the domain used to assess the weather patterns and the metrics used to assess their uniqueness (van den Dool 1994). For large domains and the requirement to match flow patterns within observational error, the weather patterns are effectively unrepeatable in the sense that each pattern is unique (van den Dool 1994). For the SeaTac example, Philip et al (2022) assess a restricted domain around the Pacific Northwest and assess similarity via pattern correlation over 0.8 to conclude that the circulation pattern on 28 June (figure 3(a)) is 'likely not exceptional' . On the other hand, one can point to a mix of large scale, synoptic, and mesoscale conditions in the week of 28 June, and the requirement for the blocking high and coastal trough to sit in their precise positions to generate extreme heat, to suggest that this was possibly a singularly unique weather event. It is not uncommon for record-breaking extremes to be critically sensitive to the precise location and duration of the proximate synoptic system (Tozer et al 2020), which contributes to long-tail events.
To provide a large sample perspective on the uniqueness of the record hot day pattern we turn to the model large ensemble, comparing the weather pattern on the model ensemble's hottest day (shown in figure 3(b)) with the model pattern on all other summer (JJA) days. We let the flow pattern on the ensemble hottest day be defined by Z 500 in a domain centred on the grid point. We compare that pattern with the patterns on all other summer days in the ensemble using both a pattern correlation coefficient and the RMSE over the domain. For each of the p points in the gridded domains, the pattern correlation,   figure 4, the reference pattern is the pattern of Z ′ 500 on the hottest day in the model ensemble for the domain set to ±30 • latitude and longitude around SeaTac. The RMSE is calculated for the Z ′ 500 patterns on all days where TXx occurs. The colors in the dots refer to the summer mean soil moisture value associated with each TXx. The black contour lines depict the density of points. salient blocking high, low, and surrounding troughs indicated in figure 3. The scale implied by this domain is also broadly consistent with the formation distance of the Pacific Northwest heat event of about 4000 km (Röthlisberger and Papritz 2023). For both pattern matching measures (correlation and RMSE) the closer the match (smaller RMSE, higher pattern correlation) in figure 4, the hotter the temperature at the SeaTac grid location. Almost all of the 4.5 million summer days are densely clustered together in the blue 'cloud' in this figure. It is remarkable that only a handful of days sit outside this cloud in the sparse tail of extreme high temperature. This sparse tail behaviour is also found in the other spatial domains tested.
To be sure, there are a range of factors besides the weather pattern that influence very extreme hot days at a location (Barriopedro et al 2011, Abatzoglou andBarbero 2014). That is evident in figure 4 in that some extreme hot days have weaker pattern matches, and some days with high pattern matches are not extreme hot. Part of this variation can be related to variations in soil moisture, as shown in figure 5, where we have replotted just days where TXx occurred (not all summer days) against the same pattern match indicator as in figure 4. In this case the dots are colored according to the summer-mean soil moisture at the SeaTac grid for the summer in which each TXx occurs. This is a relatively coarse metric, but serves to make the points that: (1) the hottest days in the ensemble tend to occur in drier summers, and (2) those days with similar pattern matches to the ensemble hottest day, but less extreme temperature, tend to occur in wetter summers. The general influence of soil moisture on TXx is evident here in the preponderance of wet summers among the lowest values of TXx in the ensemble.
Taken together, the large ensemble results imply that: (1) the weather pattern is a strong determinant of maximum temperature, but soil moisture conditions can moderate this influence, (2) the very hottest extremes are sensitive to the precise positioning of the proximate weather systems, and (3) very large samples are therefore required to represent the contributions of weather variability to the very hottest extremes.

Extreme temperature and sample size
In this section we explore in more detail the relationship between the hottest temperature in a sample population, TXx r , and the size of the population. We develop parallel assessments of the value of TXx r as a function of sample size in the GEV distribution representing the station observations for SeaTac, and in the very large ensemble sample for the gridbox encompassing SeaTac in the model. For the SeaTac station observations, the variability of weather patterns shapes the fitted GEV distribution, and is represented through the location, scale, and shape parameters of the fitted distribution. Sampling TXx from this distribution provides an implicit sampling of weather patterns. For the model ensemble we sample TXx directly (without fitting) from calendar years in the model runs, and so have explicit representations of many years of weather patterns for each sample of TXx r . The model thus provides its own representation of the variability and repeatability of weather patterns. While this is only a proxy for the real-world, we would never have access to samples of the sizes generated by the large model ensemble in the real-world.
We explore a range of sample sizes from 10 years to ∼45 000 years in order to bound the set of sample sizes encountered in use. At the low end, 'small' sample sizes of the order of 50-100 years are typical of the length of record available for instrumental observations (Seneviratne et al 2021). Sample sizes from 100 to 1000 years have been typical of the number of years selected when using CMIP-type (Taylor et al 2012) multi-model ensembles in time-slice experiments (Seneviratne and Hauser 2020). 'Large' sample sizes exceeding 5000 years are common in dedicated attribution experiments (Philip et al 2020) or in initial-condition ensembles (Deser et al 2020).
For the observations-based GEV ( figure 1(b) blue curve), we take random samples from the GEV distribution for the set of given sample sizes (from 10 to 45 000 years) to find the record-breaking maximum value, TXx r in each sample. We repeat these random samples 1000 times for each sample size to develop a measure of the uncertainty of TXx r associated with the sample. The results for the observations-based GEV are shown in figure 6(a) and indicate a strong sensitivity to sample size. For typical 'small' sample sizes of the order of 100 years, the record-breaking TXx r obtained in the sample of years ranges between about 38 • C and 43 • C, with a median value between about 40 • C-41 • C. For 'large' sample sizes, the median value of TXx r increases towards about 44 • C, and the sampling uncertainty (range) decreases.
The sensitivity of the record-breaking TXx r to sample size for the model grid box is shown in figure 6(b). Because we have a large sample for the model ensemble, we use direct sampling from the ensemble population without curve-fitting. The results display the same qualitative sensitivity as the fitted observations, with the median value of TXx r increasing with sample size, and the range decreasing. As the model sample grows there are more weather patterns sampled and the hottest day in the model population is hotter again than in smaller population samples. The hottest day in the 44 928 year sample is 46.5 • C, which might be exceeded in still larger samples as it is not clear that this value has stabilized here. This illustrates the way in which the hottest day attainable in a region can keep increasing as we add more samples of daily weather to our population (Gessner et al 2021). The hottest day attainable would be bounded by physical constraints, and boundedness is consistent with the GEV-fit to observations yielding parameters indicative of a bounded tail (Gessner et al 2021). In this case however, the model shows less evidence than the GEV-fit to observations that such bounding has been reached.
A critical feature of figure 6(b) is what happens for sample sizes of about 100 years, which is the typical length of observational temperature records. The hottest temperature attained in a sample of 100 years from the large ensemble usually lies between the 'whisker' lines in the plot at about 39 • C and 45 • C. The temperatures that lie above this value are ranked as 'outliers' and we saw in figure 4 that they are very rare. If we repeat our sample of 100 years enough times (we performed 1000 repeats), some of those samples do end up capturing the very hottest day in the entire model ensemble, as indicated by the open circle point at the extreme value. In the real world we get only one sample of 100 years, not 1000. There is therefore a very low chance of getting an extreme outlier in our one 100-year observed sample. But, since we have many locations on Earth to search for extremes (many 100-year samples), some of them (including SeaTac) inevitably will include one of these rare 'outliers' in their 100-year sample. If we do happen to get such an outlier in an observed series, that outlier will likely be well above past records. Further, since we are only adding years incrementally to our 100-year observed sample (through lived experience), we draw only a tiny number of additional samples, and are exceedingly unlikely to draw another outlier. In other words, the record-shattering outlier is very likely to persist, at least in a stationary climate. The role of nonstationarity is explored next.

Nonstationarity
Thus far we have considered only weather variability in generating record-shattering extreme events. We have not considered any causal issues related to nonstationarity and warming. In both the real atmosphere and the model hindcast runs, the greenhouse forcing increases through time. The observed time series for SeaTac TXx (figure 1) displays some evidence of warming , Mass et al 2022. Further, sets of model experiments designed to assess the role of greenhouse forcing indicate that it played a role in increasing the likelihood of the record-shattering event, TXx r , Bercos-Hickey et al 2022.
As a first test of nonstationarity, we plot the distribution of TXx as a function of calendar year for the model ensemble in figure 7(a). We assess the distributions for all calendar years in which the number of Figure 6. (a) Maximum value of TXx as a function of sample size for random samples from the best fit GEV distribution for SeaTac observations. Each random sample is repeated 1000 times to generate the distributions shown by box and whiskers. The horizontal dashed line marks the value of the record-breaking extreme, TXxr, for SeaTac. (b) Maximum value of TXx as a function of sample size for random samples drawn directly from the model ensemble sample of 44 928 years. Each random sample is repeated 1000 times to generate the distributions shown by box and whiskers for each sample size. Note that the sample size on the x-axis is not linear. The boxes represent the interquartile range [25th (q1) to 75th (q3) percentiles], the whiskers extend to the most extreme points not considered outliers, and the outliers (black circles) are defined as > q3 + 1.5(q3 − q1) or < q1 − 1.5(q3 − q1).
sample years in the ensemble is the same. For the years 2004-2021 the sample size is 1728 years each. The distributions for these years are shown with progressively changing colors from yellow to black. Almost the entire distribution (for low through high values of TXx) has a slight progressive shift with calendar year to warmer values of TXx. This shows how the warming climate (over the model time-slice) has a progressive influence on TXx. The hottest days of the year are getting hotter. The median of the TXx distributions has shifted up to 1 • C warmer, reflecting nonstationarity over this period. While the baseline climate of hot days is clearly shifting in the ensemble with calendar year, that might or might not have an impact on record-shattering hot days over the period. To assess the response of record extremes here we now focus on just the maximum (hottest) value of TXx in each calendar year-the extreme value in each of the distributions in figure 7(a). The maximum value of TXx obtained in the ensemble samples in each calendar year is shown in figure 7(b). If the climate in the model were stationary over 1995-2030 we would expect more or less constant values of the maximum TXx in each calendar year, provided that each calendar year was well-sampled. This assessment is more complicated here because the number of samples in each year ramps up/down for the first/last decade shown because we are using a hindcast dataset. The number of samples for each year is shown by the dashed grey curve, displaying the ramp up/down. The number of samples per calendar year is constant for the years marked by the grey shaded region.
We know from figure 6(b) that the maximum value of TXx obtained in a sample of years from the model ensemble increases as the sample size increases. That means that we might expect the profile of the blue curve (maximum TXx each year) in figure 7(b) to follow the profile of the grey curve (number of samples) in ramping up, being roughly constant, then ramping down over the period. While there is some suggestion that the maximum TXx each year does ramp up and is then constant, the variation in this value from year to year is large, implying that there are not enough samples of each calendar year (1728 years of each calendar year) to provide stable estimates. The lack of any obvious upward trend when the sample size is the same (in the grey shaded region) implies that nonstationarity is not very evident for extreme values of TXx. Furthermore, the extreme hottest day in the ensemble occurs in 2003 and is not exceeded in the large ensemble by the time the experiments finish near 2030, despite increases in climate forcing and TXx through this period.
This result presents the conundrum that TXx values in the ensemble are nonstationary (increasing), but the most extreme values of TXx are seemingly not. The resolution of this comes back to the issue of sampling. The sample of 1728 simulation years of each calendar year is enough to see the general increase in temperatures reflected in TXx. However, the maximum in these 1728 samples of TXx each year is pulling out the record (shattering) extremes, which are much more sensitive to sample size, and are still undersampled here.

Selection bias in observations and projections
A consideration of sampling together with selection-bias can explain the tendency for record-shattering extreme temperature records to endure for many decades without being beaten. The logic in this process proceeds as follows: Search for record shattering extremes wherever they occur. Since our interest is in record-shattering extremes, we are drawn to them when they occur and identify them for study on the basis that they have already happened. Since we monitor thousands of locations across the planet, we will often find such events to study (McKinnon and Simpson 2022), even though each event may be extremely rare/unlikely in the place it happened. Select a record-shattering event at a fixed location. The event selected may be a big outlier, and, for heat events like this one, could depend on the random and rare draw of a weather pattern in a very precise location and alignment. Once we have selected the event, the very unlikely is now a fait accompli. Draw more years to add to the sample at the fixed location. The years following the record-shattering event add more observations to our population sample at this fixed location, but since we are typically monitoring and scanning over decades, we are adding only very few extra samples in observations. That means that the chance due to weather variability of obtaining another precisely-located weather pattern like the one that generated the original event is vanishingly small. Since we are not adding record-shattering outliers to our sample, the chance of breaking the new record is small, unless nonstationarity changes the distribution sufficiently.
We represent this process schematically in figure 8 using the large ensemble world for illustration. Analogous to our real world SeaTac event, we start here by selecting the hottest day in the large ensemble, which is the 46.5 • C day that occurred in the year 2003 in the hindcasts. The chance of this event happening is now 1, since we selected it because it happened. This is the event in green shading. For the purpose of this illustration we have extrapolated the model world forward in time to 2050 (we do not have actual hindcasts to this date) using randomly generated GEV distributions with the same scale and shape parameters as a GEV fitted to the model hindcast ensemble, but where we have imposed an annually increasing trend on the fitted location parameter of 0.06 • C/decade, which is the value of the model trend in TXx over the period 2004-2021 in the hindcast ensemble. Having selected an event at a fixed location, the chance of drawing another (weather-generated) extreme outlier at this point over the next several decades is exceedingly small. Thus, values of TXx near 46.5 • C in subsequent years are proscribed by red shading here to indicate they have very low probability of occurring by chance. In order to beat the record selected in green, we now rely on warming to shift the distribution such that less-extreme outliers (those not conditioned on an optimal alignment of weather systems) can now beat the extreme outlier record. This occurs around 2050 in our extrapolated example. The precise date is not important here as our extrapolation imposes a trend and form of distribution, which may not be exactly what occurs. The point is that it may take many decades to beat the selected record TXx value (consistent with findings from climate projection experiments (Deng et al 2022, Dong et al 2023), and the timing is set by the tradeoff between how extreme our selected record was, and how quickly the climate is warming at the location. When record-shattering extreme heat events occur in the real world, we often turn to climate model projections for context on what future changes to expect. This can present pitfalls around the same considerations of selection-bias and sampling. For example, suppose again that a record-shattering extreme such as the Pacific Northwest heatwave, occurs. An event like this is an extreme outlier, but it has happened in the real world because we selected it after it happened. We now turn to ensembles of climate projections to see if an event like this has happened in this fixed location in the projections for either the current climate or climate some decades ahead. Our sample size in projections may include dozens of models with perhaps a number of ensemble members each, and is thus large compared to our single observed sequence. However, while selection-bias in observations gave us a certainty of observing the event, we do not gain from selection-bias in this case in the models, since we are not searching for this event in the models over all time and places, just in one place (where it occurred in the real world). We showed here that when searching for a record-shattering extreme at a fixed location in our model ensemble, very large samples may be required to meet or beat that extreme because of the very high sensitivity of such outliers to the precise weather configuration. Unless we use very large projection ensembles to search for the selected extreme, we are not likely to find one in the projection sample, at least for the current climate. We may not even find one in the projections for climate decades ahead, unless the shift in temperature distribution is large enough as described above.
Further, the fact that we do not find an instance of an extreme in climate projections as great as our observed record-shatterer can be explained by applying selection-bias to observations (allowing multiple tests to many locations) but not models (restricting tests to a single location), and by not having large enough samples in the models to overcome selection-bias. To be fair to the models, this explanation should always be tested and ruled out before concluding that the models are somehow deficient in their simulation of the given extreme because we did not find it in the projections.

Conclusions
A record-shattering hot day occurred in the Pacific Northwest in late June 2021, which was represented for illustration here by conditions at a single location, SeaTac airport, and by the gridbox containing SeaTac in a model large ensemble. The extreme fits our definition of a record-shattering event in that it well surpasses prior records and has a demonstrable influence on the GEV distribution fitted to TXx at SeaTac. The model large ensemble simulates extreme hot days at the SeaTac grid that are as extreme in the model distribution as the record-shattering event is in observations. The model simulates large scale and synoptic scale weather patterns for the hottest days in the model ensemble (at the SeaTac grid) that are similar to the patterns that gave rise to the observed SeaTac event. In the model ensemble, the weather pattern is a strong determinant of the hottest days at SeaTac. There are only a handful of summer days among the millions simulated that have strong pattern match with the hottest model day, and these share extreme high temperature-unless moderated by soil moisture conditions. From the perspective of a fixed location, the hottest days can be very sensitive to the precise positioning of weather systems and to a prior spell of dry conditions, and thus can require very large samples to represent.
The sensitivity of record-shattering temperature extremes to sampling is evident in the estimates of return periods for an event of the size of the SeaTac event. The return period estimated from the model ensemble is perhaps several hundred years, but there are large uncertainties on this estimate, and they are a strong function of sample size. Samples of order 5000 years are needed to provide some bounds on an event like this one. This requirement is not reduced by fitting a GEV to smaller model samples. In a very large ensemble, the hottest day simulated at SeaTac increases with sample size, and is still seemingly increasing using the full sample of nearly 45 000 simulation years. Small samples are unlikely to provide a random draw of the weather with just the right antecedent dry conditions and optimally-located weather systems to generate the most extreme hot days.
In the hindcast large ensemble, the climate forcing increases with each calendar year, and so does the ensemble distribution of TXx values. Though warming is evident in the ensemble over the hindcast period, the extreme hottest days occurring across the 1728 simulation years for each calendar year display no obvious increasing trend. The lack of a trend in the model's extreme hottest temperatures is related to the fact that very hot days require very large samples to simulate, and our sample size each year is still too small-and thus reflects the variation evident in small samples of very extreme events. Put another way, the contribution of warming to the very hottest days in the model hindcast period is relatively small compared to the large variations that chance weather outcomes can provide.
The enduring nature of record-shattering heat records can be understood in terms of sampling and selection-bias. Record-shattering events may be very rare at a given location, but if we search many thousands of possible locations we will often find such events. Having selected such an event, we then watch what happens to the record as years pass after the record. The tiny sample of additional years is highly unlikely to throw up another such rare event if the event itself was conditioned by the chance alignment of weather systems in the optimal configuration for record heat. The record-shattering record will thus endure. If we turn to climate models to assess whether they can 'capture' extreme hot days at the selected location, the models are at a profound disadvantage if we search at only the fixed location selected from observations. The model will not have the advantage of selection-bias at a fixed location, and will have very similar low likelihoods of drawing a record-shattering extreme as the observations do in the years after a record-shattering extreme has occurred. Of course the models can generate large samples, but unless the model sample is very large (many thousands of years), they are unlikely to capture the most extreme events at that location. The failure of models to represent the record-shattering extreme in this situation could reflect process shortcomings (Pitman et al 2022), but it is also to be expected from sampling considerations alone.
An ironic feature of these examples is that a record-shattering heat extreme is likely to rely on freak alignment of weather systems (and precursor dry weather), and is thus substantially set by chance rather than climate change-though climate change might still have contributed a little by providing a warmer baseline. On the other hand, when records break in short succession, one after another, that is indicative that the warming of the temperature distribution has increased so much (due to climate change) that even events that are not so freakish in terms of weather configuration are now warmer than past events that required just the right conditions. The signature of climate change on extreme temperature thus may be written more in the rapidity of successive records than by the wide margins in which they are set. Climate change is not so much a generator of record-shattering heat extremes; rather, climate change transforms record-shattering extremes (with very long return periods) into more common extremes (with short return periods) over the course of decades.
The results here pertain to extremes like record-shattering temperature that display high sensitivity at a fixed location to the positioning of weather patterns. Not all extremes display such sensitivity (Naveau et al 2020), and even heat extremes occur through a variety of different processes in different locations (Parker et al 2014, Risbey et al 2018b. Future work will address the representation of a range of climate extremes in large ensembles (Findell et al 2022) and the extent to which large sample requirements apply to other types of extreme event. These results on the enduring nature of record-shattering heat records would not apply under different conditions to those examined here, such as when the extreme is not highly sensitive to weather patterns, or when there is a regime change in the processes governing the extreme. Regime changes can occur naturally (Charney 1975, Hare andMantua 2000) or in response to climate forcing (Hansen et al 2016. For example, climate forcing could induce a step response in a variable which could lead to record-shattering extremes that are directly attributable to the forcing, and would not rely on the kind of chance outcomes described here.

Data availability statement
The data that support the findings of this study are openly available. The large ensemble ACCESS-D hindcasts are stored at https://nci.org.au/. The GHCN station data for SeaTac are available from https:// climatedataguide.ucar.edu/climate-data/ghcn-d-global-historical-climatology-network-daily-temperatures.