Evaluating heat extremes in the UK Climate Projections (UKCP18)

In recent years, UK summer heatwaves have resulted in thousands of excess deaths, with both extreme temperatures and high humidity increasing health risks. Here, the UK Climate Projections 2018 (UKCP18) are compared to observational (HadUK-Grid) and reanalysis data (ERA5) to quantify model performance at capturing mean, extremes (95th to 99.5th percentiles) and variability in the climate state and heat stress metrics (simplified wet bulb global temperature, sWBGT; Humidex; apparent temperature). Simulations carried out for UKCP18 generally perform as well as or better than CMIP5 models in reproducing observed spatial patterns of UK climate relating to extreme heat, with RMSE values on average ∼30% less than for the CMIP5 models. Increasing spatial resolution in UKCP18 simulations is shown to yield a minor improvement in model performance (RMSE values on average ∼5% less) compared to observations, however there is considerable variability between ensemble members within resolution classes. For both UKCP18 and CMIP5 models, model error in capturing characteristics of extreme heat generally reduces when using heat stress metrics with a larger vapour pressure component, such as sWBGT. Finally, the 95th percentile of observed UK summer temperature is shown to have ∼60% greater interannual variability than the summer mean over the recent past (1981–2000). This effect is underestimated in UKCP18 models (∼33%) compared to HadUK-grid and ERA5. Compared to projected future changes in the global mean temperature, UK summer mean and 95th percentile temperatures are shown in increase at a faster rate than the global mean.


Introduction
Climate change is projected to increase extreme heat exposure risk around the world e.g. (Zhao et al 2015, Coffel et al 2017, Andrews et al 2018. UK heatwaves in recent decades have resulted in thousands of excess heat-related deaths (Johnson et al 2005, Green et al 2016, Public Health England 2019b, morbidity (Arbuthnott and Hajat 2017) and economic disruption across multiple sectors (Costa et al 2016). These risks could all increase in the future with climate change (Mitchell et al 2016, Vicedo-Cabrera et al 2018. As a result, since 2004, the UK has implemented a Heatwave Plan at the national level in England, with a warning system for at risk areas and advice on how to stay safe in hot weather (Public Health England 2019a).
Heat stress occurs under environmental conditions where humans (or other organisms) are unable to maintain stable internal body temperatures. Where capacity to regulate core body temperature is reduced, symptoms such as heat exhaustion and heat stroke can occur, along with other potential medical complications (Kovats and Hajat 2008). For humans, the principal physiological coping method for heat stress is sweating to reduce body temperature by evaporative cooling, meaning heat stress metrics often include both a temperature and humidity component, as, at higher humidity, sweating becomes a less efficient method of cooling the body Hajat 2008, Sherwood andHuber 2010).
There is no universal definition for what constitutes a heatwave that is appropriate for all situations and locations (Perkins and Alexander 2013), resulting in a variety of definitions used in scientific literature, technical reports and policy plans. A generalised definition would be a 'prolonged period of n or more days with temperatures exceeding a given threshold, x' . Temperatures could be defined in terms of daily maximum, mean or minimum (Perkins and Alexander 2013), or using a metric of heat stress (Chen et al 2019). The temperature threshold could be defined in absolute terms that are regionally specific (Mccarthy et al 2019a) or, more generally, the threshold could be described in relative terms, such as a percentile of the long-term climate of a region (Russo et al 2017, Arnell et al 2019. The latest UK Climate Projections (UKCP18) produced by the UK Meteorological Office (UKMO)  are available at spatial resolutions ranging from ∼60 km with global coverage down to convective permitting 2.2 km for the British Isles region (Murphy et al 2019. UKCP18 simulations provide an opportunity to understand climate risks associated with highimpact heatwave events over the UK for the recent past and future (Murphy et al 2019) and to explore the interaction between high temperature, high humidity and heat stress based on approaches developed in global studies (Fischer and Knutti 2013, Russo et al 2017, Di Napoli et al 2020. To date, there are only limited previous studies focussing on temperature changes using this data besides the official UKMO reports , Murphy et al 2019. There are a number of reports focussed on specific regions, for example for the Cairgorms and Bristol, both using the probabilistic forecasts (Rivington et al 2019, Arup 2020, and some impacts studies focussed on urban heat islands (Lo et al 2020) and heat extremes at selected coastal locations (Edey et al 2020;in review).
Here, we evaluate the UKCP18 performance over the period 1981-2018 against observational and reanalysis datasets. Through the model evaluation, this paper addresses the following questions: Although 15 simulations were available in the GCM subset, only 12 RCM and CPM simulations driven by these experiments were included in UKCP18, selected based upon their performance (Murphy et al 2019). Only these 12 subset members were assessed in this paper. Although we have shortened the ensemble subset names for brevity, the 12 subset members are named consistently with the convention of Murphy et al (2019). Additionally, eight CMIP5 simulations were analysed. These were included as part of the UKCP18 reports and were regridded by the UKMO to the common global grid of HadGEM3-GC3.05 , Murphy et al 2019.
HadUK-Grid gridded observations (Hollis et al 2019) and ERA5 reanalysis data (Hersbach et al 2020) were obtained for model validation. Where necessary for comparisons, HadUK-Grid and ERA5 were regridded to the three UKCP18 spatial grids using nearest neighbour interpolation. There was no special treatment for coastal points, which could introduce some cool coastal biases particularly for ERA5, however the impact on UK-wide averages is low. For comparisons with HadUK-Grid, monthly means of daily mean temperature (T mean ) and vapour pressure were calculated for UKCP18 and ERA5 (to match the temporal resolution of HadUK-Grid data for these variables). Details on these datasets as well as processing that was carried out for the comparison are summarsied in table 1. Some additional datasets were used for analysis and discussion; further details on these and data processing are included in the supplementary information (available online at stacks.iop.org/ERL/16/014039/mmedia). Three heat stress metrics are calculated here: simplified wet bulb globe temperature (sWBGT) (Australian Bureau of Meteorology 2010, Buzan et al 2015), apparent temperature (AT) (Steadman 1979, Zhao et al 2015 and Humidex (Masterton andRichardson 1979, Buzan et al 2015). sWBGT is an empirical algorithm for estimating the wet bulb globe temperature: a physically based heat stress metric derived for setting occupational safety thresholds (NIOSH 1986, ISO 2017. AT and Humidex were developed by meteorological agencies as more generic algorithms for thermal comfort (Buzan et al 2015, Sherwood 2018. It is acknowledged that these metrics will have limitations, not least their actual relevance for defining physiologically based levels of heat stress risk (Sherwood 2018). However, they broadly capture the range of variability from more compicated metrics (Sherwood 2018) and offer comparable results to many other heat stress studies (Zhao et al 2015, Buzan et al 2015, Mitchell et al 2016, Matthews et al 2017. Heat stress is estimated for the warmest part of the day using vapour pressure and daily maximum temperature (T max ; see supplementary information).
As discussed in section 1, heatwaves can be generalised as n or more days exceeding a given temperature threshold, x. For each dataset, heatwave events were assessed in terms of their frequency and spatial extent. Four intensity thresholds (x) were used: 95th, 98th, 99th and 99.5th, along with four duration thresholds (n): 2, 4, 6 and 8 or more consecutive days, for T max and each of the three heat stress metrics. Percentiles are calculated for each grid cell over the reference period 1981-2018. Heatwave Exposure, HE (x,n) , is defined for each metric as the average fractional area of the UK that experiences a heatwave exceeding x intensity and n duration threshold per summer. This metric allows the modelled heatwave variability to be compared to the observed. For reference, the observed frequencies of events of different example HE magnitudes and intensities are shown in supplementary figure 1.

Dataset overview
The frequency distribution of average T max over the Greater London region for 1981-2000 summers is shown for all datasets and UKCP18 subsets in figure 1(a). Other UK administrative regions are shown in supplementary figure 2. There is a notable offset of 1 • C-2 • C between ERA5 and HadUK-Grid data due to ERA5 having more muted diurnal variability, and hence lower T max values compared to HadUK-Grid. HadUK-Grid also records lower daily minimum temperatures than ERA5 (shown in supplementary figure 3). This muted variability in ERA5 is likely due to the daily maximum and minimum values being calculated here from hourly data. For the period 1981-2000, the overall distribution of T max values from GCM and RCM subsets fall between HadUK-Grid and ERA5. CPM simulations generally fall closer to the HadUK-Grid observations with a slight warm bias at the highest temperatures, while CMIP5 models are more similar to ERA5 values but  with a greater spread and a notable overestimation of the fraction of days at low temperatures.
Resolution dependent differences in HadUK-grid T max distributions are most notable between the 60 km (blue line) and 12 km (grey line) datasets. This is related to errors introduced during aggregation of the data (Hollis et al 2019). Supplementary figure 2 shows that in Wales, colder temperatures are captured in the CPM subset compared to the GCM and RCM subsets, likely due to the finer representation of upland areas. Overestimation at the higher end of the T max range shown for Greater London by the CPM subset ( figure 1(a)) is also most pronounced across southern and eastern regions of England compared to GCM and RCM subsets (supplementary figure 2). This regional warm bias is possibly due to reduced cloud cover and soil moisture in the CPM . Figure 1(b) shows a percentile-percentile (P-P) plot for each of the UKCP18 subsets and ERA5 in comparison to HadUK-Grid observations for the Greater London region. ERA5 does not capture as much variability in daily summer T max compared to HadUK-Grid observations, particularly at higher percentiles. The CPM subset generally shows good agreement with HadUK-Grid, with only a slight cold bias at low temperatures and slight warm bias at high temperatures. The GCM and RCM subsets show a reasonably uniform cold bias across all percentiles. CMIP5 models show greater variability than HadUK-Grid, with a significant cold bias at low temperatures, but only a minor cold bias at high temperatures. Although figure 1(b) shows there are general biases in the model and reanalysis data compared to the observations, there are no major inflections at the upper percentiles.

Evaluation of model performance
The relative root mean square error (RMSE) for selected UKCP18 climate variables and derived heat stress metrics compared to HadUK-Grid and ERA5 is shown in figure 2. RMSE is shown for the mean (RMSE mean ), 95th percentile (RMSE 95 ) and standard deviation (RMSE SD ) of all summer days for each variable or metric. The multi-model mean (MMM) is also shown for each UKCP18 subset. Summary statistics derived from this evaluation are presented in supplementary tables 1-4.
Averaged across all members for each resolution, RMSE scores are reasonably consistent between the GCM, RCM and CPM subsets (e.g. absolute T max RMSE mean range 1.03 • C-1.35 • C; supplementary tables 1 and 2). Taken together across all climate variables and RMSE metrics model performance improves slightly with increasing resolution compared to HadUK-Grid and ERA5, with the CPM having absolute RMSE values ∼2%-5% less on average than the GCM and RCM subsets. The subset with the lowest RMSE mean and RMSE 95 depends on whether ERA5 or HadUK-Grid is used as a benchmark. For example, the absolute T max RMSE mean compared to ERA5 is least for the RCM (1.031 • C); whereas compared to HadUK-Grid it is least for the CPM (1.089 • C). RMSE SD shows the most consistent improvements with increasing resolution compared to both reference datasets.
The range in relative and absolute RMSE values is generally much greater between individual ensemble members averaged across resolution classes (e.g. T max RMSE mean ranges 0.68 • C-2.36 • C for different ensemble members; supplementary tables 3 and 4). Again, which members perform well depends to some extent on which dataset is used as a benchmark. Certain members perform relatively poorly at all resolutions (e.g. member 12 at all resolutions is poor for most variables in terms of the RMSE mean compared to both ERA5 and HadUK-Grid), while others perform well compared to only one dataset (e.g. member 15 performs well for all RMSE metrics and variables compared to HadUK-Grid, but poorly compared to ERA5). It is likely that the variability between subset members is due to errors introduced in the GCM carrying through into the higher resolution RCM and CPM that it forces . The similarity in errors between CPM simulations and the equivalent GCM and RCM simulations suggest this is the case. As noted in Kendon et al (2019), the CPM simulations do not have perturbed physics, meaning the major differences between these simulations come from the driving model.
There is more variability between CMIP5 model simulations, with this subset typically having absolute RMSE values ∼50% larger than the other subsets (see supplementary table 1). However, there are exceptions; for example CNRM-CM5 performs comparably to many of the GCM simulations.
Zonal mean T max biases in the 95th percentile for GCM, RCM and CPM simulations generally show an exaggerated latitudinal temperature gradient with northern regions too cold and southern regions too hot compared to ERA5 and HadUK-Grid (supplementary figures 4(a)-(c)). HadUK-Grid zonal mean T max is ∼2 • C warmer in general than ERA5, so simulations are more consistent with HadUK-Grid further south, but produce a greater cold bias in the north of the UK. At all resolutions there are some simulations which are anomalously warm or cold across all latitudes. Spatial distributions of summer T max biases highlight the same meridional structure but also reveal larger differences relative to the observations in coastal and upland areas (supplementary figures 4(d)-(f)).

Heatwave variability
In addition to capturing general climatic characteristics of the UK over the recent past, it is desirable that the UKCP18 simulations also capture realistic magnitude and frequency of heatwave events. The mean summer UK Heatwave Exposure, HE, was calculated for the period 1981-2018 for HadUK-Grid, ERA5, GCM, RCM and CMIP5 simulations in terms of T max and sWBGT. As shown in figures 3(a) and (b) for HadUK-Grid, shorter, less intense UK heatwaves are more common than longer, more intense heatwaves (e.g. in terms of T max , HE (95,2) is 0.40, while HE (98,4) is 0.05).
The error in HE between the GCM, RCM and CMIP5 simulations relative to HadUK-Grid is shown in figures 3(c)-(h). Model subsets, particularly CMIP5, generally overestimate heatwave events of all durations and intensities compared to HadUK-Grid (and ERA5; not shown) when heatwaves are defined in terms of T max . Given the infrequent nature of longer, more intense events, in relative terms the model HE error is generally greater for more extreme events, as shown in supplementary figure 5. Short lived events (2-3 d up to 99th percentile) and less intense events (95th percentile up to 6-7 d) are best simulated by UKCP18.
When heatwaves are defined in terms of sWBGT (or other heat stress metrics; not shown), the general observed characteristics of HE remain the same ( figure 3(b)), however, the errors between the model subsets relative to the observations change (figures 3(d), (f), (h)). Using sWBGT generally reduces the HE overestimation seen for T max , particularly in CMIP5 simulations. For the GCM and RCM subsets, there  is a small improvement particularly for more intense events, however longer, less intense events (95th percentile for 2-7 d) tend to be underestimated. In general for both T max and sWBGT, the RCM subset has fractionally the lowest error in HE and the CMIP5 models have the highest. Figure 4 shows the relationship between annual mean and 95th percentile of UK summer T max and sWBGT for each dataset for the period 1981-2000. Details on the linear best fit for each of the datasets shown in figure 4 are listed in supplementary table 5. In terms of T max , shown in figure 4(a), all datasets show a qualitatively similar positive relationship, with the 95th percentile showing amplified interannual variability compared to the summer mean. Both HadUK-Grid and ERA5 show amplification >60% (e.g. the gradient of the linear fit between the T max mean and 95th percentile is 1.61 for HadUK-Grid). For the GCM, RCM and CPM subsets, the amplification is ∼33% and for the CMIP5 subset the amplification is ∼25%.

Annual variability and long-term trends
In terms of sWBGT, shown in figure 4(b), a similar positive relationship is shown. The amplification of summer 95th percentile sWBGT compared to summer mean sWBGT is again greatest for HadUK-Grid and ERA5 (57%-66%), and again the model subsets substantially underestimate the amplification in summer 95th percentile as a function of summer mean values (ranging from 14%-18%).  UKCP18 simulations also provide projections under RCP8.5 up to 2080 for all models. The relationship between the projected annual global mean surface temperature (GMST) and both UK summer T max mean and 95th percentile using 20-year climate averages are shown in figure 5. The differences in the rate of UK warming from each subset relative to the GMST are shown in figure 5(c). The magnitude of the long-term change in climate is much smaller than the interannual variability shown in figure 4.
The trendlines in figure 5(a) show that modelled UK summer mean T max increases 13%-24% quicker than the GMST, while modelled UK summer 95th percentile T max increases much quicker than GMST, with model subsets ranging 54%-62% quicker (trendlines in figure 5(b)).
To further assess the robustness of these modelled changes, HadUK-Grid observations of UK summer T max were compared with Berkeley Earth estimates for annual GMST for the period 1960-2018 (Rohde et al 2013), as shown in figure 6. Using a 20-year moving climate window, the observations suggest a reasonably strong linear fit between UK summer mean T max and annual GMST (R-squared of 0.94; figure 6(a)), with the UK summer mean warming 57% faster than GMST-considerably more than is found in the model simulations ( figure 5(c)).
Observed UK summer 95th percentile was found to warm 43% faster than the GMST (i.e. less than the UK summer mean; figure 6(b)). This is likely related to the observed UK summer T max 95th percentile having strong decadal variability associated with the NAO (figure 6(c)) (Folland et al 2009, Sanderson et al 2017, National Oceanic and Atmospheric Administration 2020 overprinted on the long-term warming trend. Linear trends presented for the observed rate of UK summer 95th percentile warming in figure 5(c) should therefore be treated as an approximation with high uncertainty.

Discussion and conclusions
This study provides a detailed evaluation of heatwaves and heat stress metrics in UKCP18 over the recent past, showing the impact of horizontal resolution on model performance for a range of heatrelated variables. GCM (HadGEM3-GC3.05) simulations are found to perform as well or better than CMIP5 models for UK heat-related variables, generally lying within the range of offset between ERA5 and HadUK-Grid observation based estimates. It may be preferable for model simulations to perform better compared to HadUK-Grid or ERA5 depending on what variable (including spatial and temporal extent) is being assessed. For example, T max is derived from hourly data for ERA5 and therefore could underestimate the variability observed in HadUK-Grid, whereas vapour pressure is only available at monthly temporal resolution for HadUK-Grid compared to daily resolution in ERA5.
The increased spatial resolution of the RCM and CPM offer significant potential for improvement of climate projections over the UK for certain applications, for example in capturing variability in coastal and upland regions, as has previously been shown for other high resolution modelling experiments (Qiu et al 2020). Additionally, although not assessed here, urban heat island effects are clearly important for UK summer heat extremes  and could be more appropriately represented at higher resolution.
Generally on a national scale, increasing model spatial resolution yields incremental improvements in model performance. For example, our results indicate that modelled variability improves with increasing resolution, with both a decrease in RMSE SD for UK summer climate variables and a fractional improvement in HE in the RCM compared to the GCM. However, increased resolution does not result in major systematic improvements across all aspects of the simulations. There are likely a number of reasons for this.
The majority of large scale variability in the RCM and CPM simulations likely originates from the GCM simulation which is used to drive them, therefore resulting in a similar performance across resolutions when evaluated on a large scale. Additionally, it is important to note that the lower resolution simulations (GCM and CMIP5) were evaluated against HadUK-Grid and ERA5 data that had been regridded to the same lower spatial scale. Therefore, models were not penalised for smoothing out localised variability that occurs when spatially aggregating data. Future evaluation work should focus on whether higher resolution produces better reproduction of dynamical features that contribute to heatwaves such as blocking events. Kendon et al (2019) suggest the RCM and CPM are expected to behave similarly in this regard, however a thorough evaluation of this is beyond the scope of the current study. Soil moisture availability is noted to behave differently between the CPM and RCM  and this can have implications for the amplification of heatwaves (Miralles et al 2012, Petch et al 2020. Finally, as shown here, it is important to consider whether dominant modes of climate variability such as the NAO are well captured by models when considering UK summer temperature extremes.
Focussing specifically on heatwave events, defined in terms of their intensity and duration, the models show overestimations of past heatwave variability when defined in terms of T max . However, this overestimation is generally reduced when defined in terms of heat stress metrics such as sWBGT, particularly in CMIP5 models. This is consistent with other global studies which show that uncertainty in high percentiles of heat stress is reduced by compensation between the errors of extreme temperature and humidity (Fischer and Knutti 2013). Similar uncertainties in temperature and Humidex extremes over Europe have been previously reported between RCMs and reanalysis (Scoccimarro et al 2017). In our study, it is shown that the most extreme events (in terms of intensity and duration) have relatively larger uncertainties, although this could be the result of the relatively small sample size of such events in the historical record. Given the large relative errors for more extreme events, it may be inappropriate to use UKCP18 subsets to analyse future extreme events above a certain magnitude (found to be heatwaves exceeding the 98th percentile for four or more days using the HE method presented here). Large ensembles and extreme value statistics could be better suited for assessing more extreme events (Sippel et al 2015, Suarez-Gutierrez et al 2020.
Previous research using CMIP5 models has shown that extremes in temperature are projected to warm quicker than the annual mean temperature (Seneviratne et al 2016). Assessment of T max in UKCP18 showed that for the recent past (1981)(1982)(1983)(1984)(1985)(1986)(1987)(1988)(1989)(1990)(1991)(1992)(1993)(1994)(1995)(1996)(1997)(1998)(1999)(2000), the UK summer 95th percentile has amplified interannual variability compared to the UK summer mean T max in good agreement with observational and reanalysis data. The magnitude and impacts of past UK heatwaves have already been shown to have been influenced by greenhouse gas emissions (Mitchell et al 2016, Mccarthy et al 2019b. When assessing projected long-term UK summer T max in relation to the GMST, in UKCP18 both the summer mean and 95th percentile temperatures are projected to warm faster than the GMST. This suggests that UK heatwaves could be further amplified with future warming. A process based understanding of the differences between these models and observations should be a priority in future research, and systematic biases should be carefully considered when using the UKCP18 simulations in impacts studies.

Acknowledgments
AKA, OA and RW acknowledge support from UK Natural Environment Research Council (NERC) grant NE/S017267/1. We acknowledge the UK Meteorological Office for their work in preparing and making available the UKCP18 and HadUK-Grid data. ERA5 data was generated using Copernicus Climate Change Service information (2020).

Data availability statement
The data that support the findings of this study are openly available at the following URL/DOI: http://data.ceda.acuk/badc/ukcp18/data.