Estimating heat stress from climate-based indicators: present-day biases and future spreads in the CMIP5 global climate model ensemble

The increased exposure of human populations to heat stress is one of the likely consequences of global warming, and it has detrimental effects on health and labor capacity. Here, we consider the evolution of heat stress under climate change using 21 general circulation models (GCMs). Three heat stress indicators, based on both temperature and humidity conditions, are used to investigate present-day model biases and spreads in future climate projections. Present day estimates of heat stress indicators from observational data shows that humid tropical areas tend to experience more frequent heat stress than other regions do, with a total frequency of heat stress 250–300 d yr−1. The most severe heat stress is found in the Sahel and south India. Present-day GCM simulations tend to underestimate heat stress over the tropics due to dry and cold model biases. The model based estimates are in better agreement with observation in mid to high latitudes, but this is due to compensating errors in humidity and temperature. The severity of heat stress is projected to increase by the end of the century under climate change scenario RCP8.5, reaching unprecedented levels in some regions compared with observations. An analysis of the different factors contributing to the total spread of projected heat stress shows that spread is primarily driven by the choice of GCMs rather than the choice of indicators, even when the simulated indicators are bias-corrected. This supports the utility of the multi-model ensemble approach to assess the impacts of climate change on heat stress.


Introduction
Human health is highly sensitive to the thermal environment (WHO 2008). Extreme heat events lead to heat stress and can increase morbidity and mortality (Garcia-Herrera et al 2010, NOAA Watch 2014) as well as losses of work productivity (Kjellstrom et al 2009, Singh et al 2015. Global temperatures are very likely to continue rising in the foreseeable future (IPCC 2013). Following the Clausius-Clapeyron relationship, absolute atmospheric humidity is projected increase with elevated temperatures over a large portion of the globe, and should worsen the physiological effects of high temperature (Epstein and Moran 2006). In a warmer and wetter world, heat stress and its related consequences are expected to increase globally.
Heat stress depends on environmental and personal factors. With different physiological assumptions, many heat stress indicators that combine air temperature and other important factors, such as humidity, radiation and wind speed, have been developed to estimate heat stress (Fanger 1970, Parson 2006, Jendritzky and Tinz 2009. These indicators vary widely in their level of complexity, input parameters, and intended end-user (Buzan et al 2015). It is important to use 'relatively straightforward and acceptable heat stress indicators to generate economical, practical and universal approach to heat stress assessment' (Spector and Sheffield 2014).
Although there is a growing body of literature related to the response of heat stress to climate change at the regional and global scales (Kjellstrom et al 2009, Fischer and Schär 2010, Willett and Sherwood 2012, Dunne et al 2013, Fischer and Knutti 2013, Oleson et al 2013, information is still lacking on the associated uncertainties. Many studies consider only one climate model (Jendritzky and Tinz 2009, Willett and Sherwood 2012, Dunne et al 2013, and only a few consider multi-model ensembles (Kjellstrom et al 2009, Fischer andKnutti 2013). Although many heat stress indicators exist, most studies use only a single indicator, one study uses two indicators and two climate scenarios (Fischer and Knutti 2013), and to our knowledge, only one has compared more than two indicators at selected cities (Oleson et al 2013). However, this study uses a single climate model, and the uncertainties related to general circulation models (GCMs) have largely been overlooked, as synthesized by Buzan et al (2015), who compared nine different heat stress indicators calculated from the present-day simulation by one GCM. There is thus a need to perform more consistent assessments of climate change impacts on heat stress by accounting, as far as possible, for uncertainties in both climate models and heat stress indicators.
To this end, we use 21GCMs involved in the framework of the fifth phase of the coupled model intercomparison project (CMIP5; Taylor et al 2012) to estimate the modeled heat stress at present-day  and its projection for the end of the 21st century. Heat stress is derived from three temperature-humidity related indicators (Masterton andRichardson 1979, Steadman 1984, Australian Bureau of Meteorology (ABOM) 2010), which are still widely used today by several national meteorological services around the world. We first investigate the consistency among these three indicators in regards to extreme heat events and spatial distribution. We then analyze the accuracy of CMIP5 models to reproduce such heat stress patterns and attribute their biases to humidity and temperature patterns in the climate models. Finally, we compute future projections of heat stress using several GCMs and the three indicators to provide an assessment of the evolution of heat stress under climate change and to systematically quantify the associated uncertainty.
The dataset and methods are briefly described in section 2, and details are provided in the supplementary materials (referred to as SM). Section 3 compares the simulated heat stress for the present day with reference gridded observations, and projected future changes are analyzed in section 4. A summary and discussion conclude the study in section 5.
2. Data and method 2.1. Meteorological data Two variables are required to calculate the chosen heat stress indicators: surface air temperature (T) and atmospheric vapor pressure (VP). In the datasets that we used (see below), VP is not provided and is derived from T, specific humidity (q) and surface pressure (P surf ), using the equation by Buck (1996). P surf itself is not provided for each GCM; therefore, it is derived from sea-level pressure (SLP), temperature and altitude assuming adiabatic conditions (Hempel et al 2013). We tested the sensitivity of heat stress indicators to different formulas to convert SLP to P surf and found a negligible effect, except in high-altitude areas where heat stress rarely occurs.
We used the WATCH-Forcing-Data-ERA-Interim (WFDEI) climate dataset (Weedon et al 2014) to provide an observation-based estimate of the heat stress indicators. This dataset has the advantage of a high spatiotemporal resolution (at 0.5°× 0.5°resolution for the global land surface and at sub-daily/daily time steps for the period 1979-2005) and a wide use for land surface modeling forcing. This dataset was generated by applying bias correction to the ERA-Interim reanalysis product (Dee et al 2011), following the same methodology implemented for the widely used WATCH Forcing Data (WFD; Weedon et al 2011). WFDEI includes a correction of T using the Climate Research Unit TS3.1/ TS3.101 data set (CRU; Mitchell and Jones 2005). Assuming that the relative humidity is unchanged, the specific humidity (q) is modified to follow the correction of T based on the Clausius-Clapeyron relationship. CRU observations of monthly atmospheric humidity, however, were not used for direct bias correction to maintain consistency between variables on a daily time scale (Weedon et al 2011). Thus we need to keep in mind that the consistency of T and q in longterm may not be preserved. To be consistent with the GCMs' output availability (see below), we retained the daily means of T, q and P surf over the period 1979-2005. Future climate projections can only be obtained using model simulations, we used the daily outputs of 21 GCMs participating in CMIP5 ( Taylor et al 2012,  see SM table S1). T and VP were bi-linearly interpolated onto the WFDEI 0.5°× 0.5°grid to compare the observations and model outputs. Heat stress indicators were calculated using each model's original grids and are then bi-linearly interpolated to the 0.5°× 0.5°spatial resolution to maintain the model's internal consistency between humidity and temperature. Evaluation of modeled heat stress was based on the period 1979-2005 in historical simulations (HIST), considered a reference. We assessed the projected heat stress indicators under the Representative Concentration Pathway with an anthropogenic radiative forcing of approximately 8.5 W m −2 in 2100 (RCP8.5) which represents the most severe climate change scenario.
In this paper, we refer to the 'grand ensemble mean' and 'grand standard deviation' as the ensemble mean and standard deviation over three (indicators) times 21 (GCMs) simulations. For each heat stress indicator, we also consider the ensemble mean and standard deviation of heat stress indicators and related variables based on 21 GCMs.

Choice of heat stress indicators and metrics
We chose three widely used heat stress indicators: Humidex (HD), Simplified Wet-bulb Global Temperature (W) and Apparent Temperature (AT). These indicators have different assumptions and thus have limitations in terms of quantifying universally applicable levels of heat stress. They are defined in table 1, with details presented in SM DM1. Buzan et al (2015) compared these indicators with six others, but our AT slightly differs because we do not include the effect of wind to limit the uncertainty of heat indicators calculated from GCMs.
In the following, we focus on two metrics:

Extreme mean
Following the methodology of Fischer and Knutti (2013), the extreme mean is defined as the mean of all values exceeding the 95% percentile which correspond to the values of the 5% hottest days. Very similar results were obtained by using the 99% percentile. Extreme mean is computed on temperature (hereafter T 5% ) and the three heat stress indicators (hereafter AT 5% , HD 5% and W 5% ), For HD 5% for example, the corresponding simultaneous temperature and VP means are referred as T HD5% and VP HD5% , respectively. Similar definitions are adopted for T AT5% and VP AT5% , as well as T W5% and VP W5% .

Heat stress classes
Each heat stress indicator has its own thresholds to quantify four levels of heat stress: 'slight', 'moderate', 'strong' and 'extreme' (table 1 and SM DM1). For each land point, we calculated the frequency of each class over the years of a given period (present-day or future) and then analyze the pluriannual mean frequency (in d yr −1 ). We also estimated the spatial coverage of the four heat stress classes, based on the areal fraction of total land with at least 1 d yr −1 in the specified heat stress class. WFDEI contains 67 209 land points outside Antarctica, accounting for approximately 90.6% of Earth's land surface. Throughout this study, our land coverage refers to % over these WFDEI's land points.

Error and spread analyses
In the following, a particular emphasis is put on quantifying the different sources of uncertainty affecting the heat stress indicators calculated from GCMs, in both present-day and future climates, using two statistical methods relying on variance decomposition.

Error attribution analyses
Through factorial experiments, the total error for each modeled heat stress indicator was decomposed to T-, VP-attributed error, and the covariance between them, following Zhao et al (2012). A negative covariance implies a compensation effect (the errors in T and VP cancel each other), whereas a positive covariance implies an additive effect (the errors of T and VP amplify each other). Details are given in SM DM2.

Analysis of variance (ANOVA)
We used the ANOVA method (von Storch and Zwiers 1999) to decompose the pooled spread of heat stress projections into contributions from GCMs, heat stress indicators, and their interactions. To reduce the risk of underestimating variance in small sample sizes, a subsampling technique was adopted following the algorithms of Bosshard et al (2013). Details are given in SM DM3.
3. Heat stress at present day 3.1. Extreme mean Based on the WFDEI reference, the hottest regions in terms of T 5% (figure 1(a)) are over subtropical desert regions, which have hot and dry climates. Compared with T 5% , observational based heat stress indicators  Steadman (1984) show lower values in subtropical dry areas and higher values over humid tropics and subtropics, such as southeastern China and the southeastern US (figures 1(b)-(d)). The spatial differences of heat stress indicators with temperatures are consistent with the distribution of VP (figure 1(e)). Indeed, heat stress indicators show large positive differences with temperature at locations where the humidity values are the largest around the Equator (figure 1(e), SM figure S1 (a)). We note that HD has much higher absolute values than temperature and the other two indicators because of its construction (table 1). When calculated from CMIP5 HIST and compared to the above reference, HD 5% tends to be underestimated globally, especially over low latitudes (figure 1(j)), due to cold biases in T 5% (figure 1(g)) and dry biases in humidity (figure 1(k)) in these regions. Over dry regions, such as the Sahara desert and the Western US, VP 5% is overestimated, which leads to the contrast in HD 5% between dry and wet regions to be smaller in the simulations than in the WFDEI reference. In central Eurasia, there is no significant bias for HD 5% because the effect of overestimated temperature is cancelled by a dry bias in humidity. Similar patterns of biases are also found for AT 5% and W 5% , despite some minor differences.
To better understand the contributions of biases in T and VP on heat stress indicators, we performed an error attribution analysis (SM DM2). A term measuring the covariance relationship between T and VP is referred as 'offset' (see equation (5) in SM DM2 for details). A negative value of it indicates that the effects of model biases in T and VP tend to counteract one another for heat indicators (referred as compensation effect), whereas a positive covariance indicates that biases will further degrade the modeled heat indicators (referred to as the additive effect). It shows (figure 2) that at mid-high latitudes, the error contribution of HD 5% is larger by T (ca 110%) more than VP (90%) and that these error sources largely cancel difference between extreme mean apparent temperature AT 5% and T 5% ; (c) difference between extreme mean simplified WBGT W 5% and T 5% ; (d) difference between extreme mean Humidex HD 5% and T 5% ; (e) extreme mean vapor pressure VP HD5% corresponding to HD 5% . In the middle and right columns from top to bottom are the ensemble-mean biases/change of the extreme means variables: (g), (m) T 5% ; (h), (n) AT 5% ; (i), (o) W 5% ; (j), (p) HD 5% (k), (q) VP HD5% . Areas with dots indicate regions with robust bias/change (at least 18 models agree on the sign of bias). Units are in°C for T 5% , AT 5% , W 5% and HD 5% and hPa for VP HD5% . each other (−90% in the offset term). In contrast, in the tropics and subtropics, the errors attributed to VP (70%) are larger than or comparable to T (ca 65%), with a weak additive effect (ca 20-40%), except in regions of the northern Amazon, Western Sahara and northern India, where some compensation effect is found (consistent with the reported biases of T and VP, see figure 1). We used zonal means to compare the T and VP contributions and offset the effect between HD and the other two heat stress indicators, AT and W (figure 2(d)). In general, the error contributions are very similar for HD and W (black and red lines, respectively), but AT (green lines) shows a different sensitivity to the biases of T and VP, which is consistent with the relatively smaller role of VP in the definition of AT compared with W and HD (table 1). Fischer and Knutti (2013) noted that the effects of model biases in temperature and humidity largely cancel each other for combined quantities in mid-continental land regions of the subtropics and midlatitudes but that such compensation effects are weak at low latitudes. Our study confirms these findings but also reveals that the effects of model biases may even reinforce each other at low latitudes. The compensation effect in mid-high latitudes is likely to be robust since it has been found using different datasets (WFDEI in our study, ERA-Interim and NCEP in Fischer and Knutti 2013). We might be more careful with the additive effects in hot-humid tropics which could depend on the reference data we chose.

Heat stress classes
Based on the estimation of HD, hot and fully humid equatorial areas, including the northwestern Amazon, Central Africa and Southeast Asia, experience almost year-round heat stress, with F total 250-300 d yr −1 ( figure 3(e)). Over the hot and dry deserts of the Sahara and Arabia, F total is approximately 100-150 d yr −1 . The warm and humid subtropical areas, such as southeastern China and the southeastern US, experience 60 d yr −1 of total heat stress. Heat stress seldom occurs in mid-high latitudes, such as northern North America and Europe, with F total < 5 d yr −1 .
Globally, 'Slight' is the dominant heat stress class ( figure 3(a)). Moderate heat stress occurs primarily in hot and humid areas, with F moderate ca 100-150 d yr −1 in parts of equatorial Amazon, North India, Indochina, South Asia, and parts of eastern Sahel. Strong heat stress (F strong ) is rare and can be found in the eastern part of the Sahel, North India, the Indochina peninsula and northern Australia ( figure 3(c)). Figure 2. Error attribution analysis for extreme mean heat indicators. Example case of HD 5% : (a) ratio of temperature-attributed error to total error E T /E tot ; (b) ratio of humidity-attributed error to total error E VP /E tot ; (c) ratio of offset to total error Offset/E tot . (d) Zonal means of the ratios of each error contribution to total error (E VP /E tot in dashed lines) for AT 5% , HD 5% and W 5% , in green, black and red lines, respectively. The spread of the error attribution among heat indicators is shaded in pink, blue and gray for the ratios of E T / E tot , E VP /E tot and offset/E tot , respectively.
Extreme stress is not observed, according to the standard of HD ( figure 3(d)).
Heat stress classes display similar spatial patterns with indicators AT and W as with HD (SM figures S2 and S3), but there are also significant discrepancies among them (table 2). In general, the estimates of heat stress based on HD and W are more consistent than those based on AT, for two main reasons: the thresholds used here for AT have not been validated as widely as those used for HD and W (SM DM1), and AT is an index for measuring indoor comfort, whereas HD and W were designed for outdoor working protection; thus, they have different severity thresholds. However, HD tends to underestimate the severity of extreme heat stress such that an event in an 'extreme' class according to W may be only 'Strong' according to HD. Consistent with the model biases on the extreme means, the model based total heat-stress frequency (F total ) is underestimated over humid tropics-subtropics while it is overestimated over regions with dry climates (figures 3(g)-(l)). Globally, the spatial coverage of heat stress is slightly overestimated (table 2). However, the biases can be different for different heatstress classes. For example, over the humid tropics, such as the Amazon and India, GCMs tend to underestimate F slight but overestimate F strong .

Projected change 4.1. Extreme mean
Changes in the projected heat stress variables between RCP8.5 (2070RCP8.5 ( -2099RCP8.5 ( ) and present-day (1979RCP8.5 ( -2005 are shown in figures 1(m)-(q) with global statistics in table 3. Although the absolute values of global mean change are different among indicators, the patterns are similar. The spatial correlation coefficient is 0.91 between AT 5% and HD 5% , 0.84 between AT 5% and W 5% , and 0.99 between HD 5% and W 5% . We thus use the grand ensemble mean change to discuss the largescale features. The spatial pattern of change in the grand ensemble mean of heat stress indicators resembles the ensemble mean change in T 5% , with a spatial correlation as high as 0.81 (figures 4(a), (b)). The global grand ensemble mean of heat stress indicator change is +5.8°C (table 3), with large increases (exceeding +6°C) at mid-latitudes over North America, the Mediterranean area, the southern Sahara and the northern Sahel, and the western Amazon (figure 4(a)). VP increases globally (figure 4(c)) with an increase in temperature ( figure 4(b)), but changes are smaller in the hotspots such as in the Mediterranean than in the humid tropics. Thus, changes in heat stress indicators are more geographically homogeneous than temperature, especially for W and HD, which give a higher weight to VP in their construction. As a result, the gradient between tropics and mid-high latitudes is smaller, as shown in the zonal means of the ensemble mean changes in AT 5% , W 5% , HD 5% and T 5% in SM figure S1(b).
The grand standard deviation of change in indicators (SD) is shown in figure 4(d). Its global mean is +1.49°C with large changes in the mid-high latitudes (40-60°N) in the Mediterranean and in the Amazon. Using the ANOVA decomposition of variance (SM DM3), we estimate the variance contribution in the grand ensemble from the spread of GCMs (SD GCM ), heat stress indicators (SD Indicator ) and the interaction between them (SD Interaction ). GCM-attribute is the major source of spread in the tropics, where it represents more than 50% of the total variance (figure 4(e)), but the indicator-attributed spread is more important in the mid-high latitudes of the Northern hemisphere (figure 4(f)). The interaction-attributed variance is negligible (ca 5%).

Heat stress classes
The total frequency of heat stress based on HD increases globally (figure 3(q)). Of course, the frequency of no-heat-stress decreases globally ( figure 3(r)). There are decreases in F slight and increases in F moderate in regions in the tropics and subtropics, which indicates that the dominant heat stress level shifts from the 'slight' to the 'moderate' class, or even more (figures 3(m), (n)). Strong heat stress (F strong ), which primarily occurs over South India and the Sahel at present day, will widely spread over tropics and subtropics. Extreme heat stress (F extreme ), which is absent in the HIST, appears in the future simulations, such as in northern India. Similar spatial patterns of changes in heat stress levels can also be found according to W and AT, but they differ in the frequency and severity of heat stress at regional scales.
The spreads of changes in heat stress attributed to GCMs and indicators are further quantified using ANOVA at six representative regions ( figure 5(a)). They experience heat stress at present-day and belong to various climatic zones, according to the Koppen-Geiger climate classification (Kottek et al 2006).
Given the systematic biases in heat stress estimation derived from the simulations of GCMs (section 3.2) and the dispersion of estimated heat stress among GCMs at present day and for the future (table 2, figure 4), we tested a statistical bias-correction (BC) approach to reduce the systematic biases, and constrain the projection of heat stress classes. Given the widespread compensation effect in the biases of T  1, figure 2), we proposed a new BC method. Instead of correcting T and VP separately as the conventional practice does, we modified the thresholds defining the heat stress classes, in each land point and for each GCM such that over the present-day period, the modeled heat stress events according to the new thresholds exactly match the observed (WFDEI) heat stress frequencies. These sets of modified thresholds were then applied to the future simulations, and resulted in 'bias-corrected' future heat  stress classes. More detailed information on this method is in SM DM4. Although the comprehensive evaluation of this BC method is beyond the scope of the present study, we observed that it reduces the dispersion among GCMs of the changes in heat stress frequency (figures 5(b)-(d)). We also found that the magnitude of the frequency change is not very different, regardless of whether BC is used, and that the sign of the change is preserved, with very few exceptions, mostly for the 'slight' class in the two subtropical regions ( figure 5(b)). This leads to a more systemic decrease of the 'slight' heat stress class frequency in the tropical band when using BC.
ANOVA has then been applied to projected heat stress derived both from the raw outputs of GCMs (noBC) and after BC. Without BC (figure 5(c)), the regional mean GCM-attributed standard deviation (SD GCM ) is 1-3 times higher than indicator-attributed standard deviation (SD Indicator ). After BC ( figure 5(d)), the amplitude of SD GCM is reduced depending on the climatic zones and the heat stress classes. For example, SD GCM is reduced about 20-40 day in Southeast Asia (SE.Asia) in Slight to Strong classes, and 2-5 day in Western Europe (W.Europe). In general it is reduced by 10 to 70% for all of the classes, whereas SD Indicator remains almost unchanged. Although the projected Figure 5. Regional variance analysis of the changes in annual mean frequency of heat-stress classes for the three heat stress indicators. (a) Location of the six selected regions; (b) change in regional grand ensemble mean frequencies of the four heat stress classes without bias-correction (bars in pink) and with bias-correction (bars in orange). One grand standard deviation around the grand ensemble mean is plotted as a black line. (c) The grand standard deviation of change in heat-stress frequency attributed to GCM, Indictor and the interactions between them without bias-correction of GCMs, (d) same as (c) but for bias-corrected heat-stress frequencies.
spreads still mainly come from GCMs, for example 40 days in SE.Asia and 5 days in W.Europe, accounting ca 50% of total SD in the 'slight' class, SD Indicator becomes the dominant term (ca 40-50% of total SD) in the 'strong' and 'extreme' classes, for example in SE.Asia, SD Indicator is 36 days accounting 45% of total SD. In the 'moderate' class, the spread attributed to GCM versus indicator varies by region but is more or less comparable.
In general, regardless of whether we use BC, the regional mean GCM-attributed spread decreases, and the indicator-attributed spread increases as the severity of heat stress increases. In Southeast China, Western Europe and central-eastern North America, SD GCM (ca 60% of total SD) is larger than SD Indicator (30%), except at the 'Extreme' class after BC, which suggests that the chosen heat stress indicators give consistent results for mild or warm and wet climates. In contrast, the larger SD Indicator in the Sahel, especially for the most severe classes, suggests these heat indicators may not all be suitable in hot semi-arid climates where the construction of heat stress indicators is very sensitive to the spread of humidity modeling. In Southeast Asia and the southern Amazon, spread from indicators and GCMs are comparable after BC, which indicates that caution is required when a heat stress indicator that has been validated in a mild climate is used in the hot and humid tropics in that the effect of humidity relative to temperature on heat stress is different between the two climatic zones.

Acclimatization and relative thresholds
Humans have remarkable ability to adapt heat stress through physiological and behavioral process (Epstein and Moran 2006). Heat acclimatization results in biological adaptations that reduce the negative effects of heat stress, thus people in hotter climate seem to be less sensitive to effects of high temperature (Fouillet et al 2008, Ballester et al 2011. The method of applying the same thresholds globally has thus its shortcoming. As a complement, we examine heat stress using relative thresholds based on the 75th, 95th and 99th percentiles of heat stress indicators over the whole period at present day and the future (SM figures (4)- (6)). The regional mean values of HD averaged over six selected regions are shown in figure 6. It shows that the range of absolute HD values from 75th to 99th percentile is narrow (ca 35-40°C) in low latitudes as Southeast Asia, southern Amazon and Sahel, but it is wide in subtropical and mid-high latitudes. In low latitude the 75th percentile of HD in the future is even higher than the 99th percentile at present day. In the subtropics and mid-high latitudes, the 95th percentile in the future is equivalent to the 99th percentile at present day. Similar results are found using the two other heat stress indicators AT and W (SM figures 7(a), (b)).

Conclusions and discussion
We used three temperature-humidity related heat stress indicators (apparent temperature, AT; Humidex, HD; simplified wet-bulb global temperature, W) and 21 climate simulations from CMIP5 (historical and RCP8.5) to examine how well climate models simulate present-day heat-stress distribution on a global scale and how the latter may evolve in the future. In particular, we investigated how GCMs and indicators contribute to heat stress projection spread. The primary findings of this study are as follows: • Using observational meteorological data, humid tropical areas tend to experience more frequent heat stress than do other regions, with a total frequency of heat stress 250-300 d yr −1 , and the most severe heat stress is found in the Sahel and southern India. The estimate of heat stress is more consistent between HD and W that AT because the designed goal of AT is not for working protection under severe environments, unlike HD and W (SM DM1). Direct comparisons of our estimations with those from Buzan et al (2015) are difficult to make because of the difference in the examined metrics, but the general patterns are very similar and include the hotspots in the Sahel and India.
• GCMs tend to underestimate heat stress over the tropics due to dry and cold biases. Over mid to high latitudes, heat stress is well estimated due to a compensation effect between biases in humidity and temperature, consistent with previous findings (Willett andSherwood 2012, Fischer andKnutti 2013). Overall, temperature contributes more to the biases of heat stress indicators than humidity, except over hot and humid regions. Our results confirm the conclusions that temperaturehumidity related indicators tend to experience extreme high values attributed to moisture at low latitude and attributed to temperature at high latitudes (Buzan et al 2015).
• Globally, the severity of heat stress increases by one class under RCP8.5 by the end of the century. Heat stress is projected to significantly increase over tropical and subtropical humid areas, as noted in previous studies (Kjellstrom et al 2009, Fischer and Knutti 2013, Oleson et al 2013, although temperature is not projected to increase as much as in midlatitudes. • The spread of modeled local heat stress attributed to the GCMs is approximately 1-3 times larger than the spread caused by the choice of heat stress indicators, but the indicator-attributed spread is larger at the most dangerous level ('extreme' class).
In mid-latitudes and subtropics with mild or warm and wet climates, the spread of the estimated heat stress comes largely from GCMs, whereas in the hot and humid tropics or regions with arid climates, the spread attributed to GCMs and indicators is comparable.
• The proposed BC method slightly modifies the magnitude of the change in heat stress frequencies, but leaves the patterns little changed. Though the methods has been proven useful in reducing the overall spread in heat stress changes, GCMs remain the largest spread factor, with the exception of the most severe heat stress classes, in which the choice of indicator becomes dominant. This supports the utility of the multi-model ensemble approach to assess the impacts of climate change on heat stress.
• The dispersion between heat stress indicators, with and without BC, varies geographically, which raises the question of their generic use at the global scale, as also questioned by previous studies focused on the present-day climate (Blazejczyk et al 2012, Bröde et al 2013, Buzan et al 2015. It is therefore important to keep in mind that heat stress indicators have their own specifications, in terms of suitable climate, as well as targeted heat stress (e.g., indoor versus outdoor; comfort versus work protection).
Relating climate change to human health is no a simple matter. There are many uncertainties in projection of heat stress related human mortality and morbidity. The uncertainties come from the imperfection of climate models, the choice of heat stress indicators, climate change scenarios and human physiological responses, etc. The present study focuses on the first two sources. In the chosen heat stress indicators, we neglected other atmospheric stressors, such as wind and sunshine; the latter is a non-negligible component of heat stress, especially in regions with arid climates. Due to the coarse spatial resolution in GCMs, we also neglected the 'urban heat island effect', because the land use tile in a GCM presents a mean state averaged over grid area (ca 250 km × 250 km). The 'urban heat island effect' can increase the likelihood of complications from human heat stress in cities (Oleson 2012, Tan et al 2009. We only examined the changes in heat stress under RCP8.5, but the changes can be significantly different under other climate change scenarios. In particular, the study of Fischer and Knutti (2013) compared two climate change scenarios, RCP4.5 and RCP8.5, and found that scenario uncertainty becomes the dominant source of uncertainty by the end of 21st century exceeding the GCM uncertainty. The selected indicators and associated thresholds for heat stress categories are built on different assumptions and as such have limitations in terms of quantifying universally applicable levels of heat stress. Thus, the estimate of heat stress at present day and the future should be considered with caution. It is better to combine the heat stress estimation based on absolute thresholds with the information on local heat stress percentile. Nevertheless, our study clearly demonstrates that it is crucial to use multi-climate models to study the response of heat stress to climate change, which has been overlooked by previous heat stress studies.

Acknowledgments
The first author was supported by the LABEX L-IPSL, funded by the French Agence Nationale de la Recherche, under the program 'Investissements d'Avenir' (grant ANR-10-LABX-18-01). To access the CMIP5 data, the work benefited from the IPSL Prodiguer-Ciclad facility, which is also supported by the LABEX L-IPSL. We acknowledge the World Climate Research Program's Working Group on Coupled Modeling, which is responsible for CMIP, and we thank the climate modeling groups (listed in table 2) for producing and making available their model output. For CMIP, the US Department of Energy's Program for Climate Model Diagnosis and Intercomparison provides coordinating support and led the development of software infrastructure in partnership with the Global Organization for Earth System Science Portals. Two anonymous reviewers are thanked for their constructive comments on an earlier version of the manuscript.