Multi-annual predictions of the frequency and intensity of daily temperature and precipitation extremes

The occurrence of extreme climate events in the coming years is modulated by both global warming and internal climate variability. Anticipating changes in frequency and intensity of such events in advance may help minimize the impact on climate-vulnerable sectors and society. Decadal climate predictions have been developed as a source of climate information relevant for decision-making at multi-annual timescales. We evaluate the multi-model forecast quality of the CMIP6 decadal hindcasts in predicting a set of indices measuring different characteristics of temperature and precipitation extremes for the forecast years 1–5. The multi-model ensemble skillfully predicts the temperature extremes over most land regions, while the skill is more limited for precipitation extremes. We further compare the prediction skill for these extreme indices to the skill for mean temperature and precipitation, finding that the extreme indices are predicted with lower skill, particularly those representing the most extreme days. We find only small and region-dependent improvements from model initialization in comparison to historical forcing simulations. This systematic evaluation of decadal hindcasts is essential when providing a climate service based on decadal predictions so that the user is informed on the trustworthiness of the forecasts for each specific region and extreme event.


Introduction
Characteristics of climate extremes are changing in a warming climate, with in particular hot temperature and heavy precipitation extremes becoming more intense and frequent, thus increasing their potential impact on nature, economy and society (Seneviratne et al 2021). Besides, internal climate variability also modulates the occurrence of extreme events (Alexander et al 2009). Trustworthy predictions are essential to develop strategic planning to adapt, build more resilience to the risk associated with extreme events and anticipate the impacts ahead of time , Curtis et al 2017, Sillmann et al 2017, Kushnir et al 2019. Predictions of changes in the frequency and intensity of extreme events may be more relevant to users than predictions of average variables, as the occurrence of extremes typically breaks the resilience of a system and cause the heaviest impacts on society and environment (Mahlstein et al 2015, Bhend et al 2017. Predictions of variations in the frequency and intensity of extremes in the forthcoming years can potentially be provided by decadal climate predictions. In addition to long-term changes due to external forcings (natural and anthropogenic), decadal predictions aim to also capture the internal variability of the climate system (slow, natural oscillations). For this reason, climate models are initialized with observation-based products (Meehl et al 2009(Meehl et al , 2021. Decadal predictions have been shown to skillfully predict essential climate variables such as near-surface air temperature and, to a lesser extent, precipitation (Smith et al 2019, Delgado-Torres et al 2022 in many regions of the world. However, the predictability of mean quantities and extreme events might differ (Liu et al 2019).
Only a few studies have previously evaluated temperature and precipitation extremes in decadal predictions. Eade et al (2012) found a generally high skill for temperature extremes, and limited skill over parts of North America for precipitation extremes with the Met Office Decadal Prediction System (DePreSys) (Smith et al 2007(Smith et al , 2010. Also, they found a slight skill improvement due to initialization beyond the first year, pointing to external forcings as the primary source of skill. Besides, they found slightly lower skill for extreme predictions than mean quantities, except for those regions with a greater trend in extremes than in the mean variables.  also found significant skill for multi-year predictions of summer temperature extremes with DePreSys over Europe, while the skill for winter extremes was lower. They also found lower skill in predicting temperature extremes than mean quantities. Decadal predictions from four models contributed to the Coupled Model Intercomparison Project Phase 5 (CMIP5) (Taylor et al 2012) were evaluated by  for European summer extremes, showing higher skill than the predictions based on observed climatology. Also, they found that the skill did not improve with initialization, except for one of the models. DePreSys has also been shown to outperform climatology and persistence forecasts for a set of daily temperature extreme indices over parts of Europe during summer for 10-year predictions (Hanlon et al 2015). Nevertheless, the model initialization did not improve the forecast quality. High skill for European temperature extremes and lower skill for precipitation extremes was also found with the Mittelfristige Klimaprognosen (German term for 'midterm climate forecast'; MiKlip) system (Moemken et al 2021).
However, to our best knowledge, all previous studies were based on single forecast systems or limited regions to evaluate the decadal forecast quality for climate extremes. We perform a forecast quality assessment of multi-model decadal predictions of annual and seasonal extreme temperature and precipitation indices with all available hindcasts contributed to the Decadal Climate Prediction Project Component A (DCPP-A) (Boer et al 2016) of the Coupled Model Intercomparison Project Phase 6 (CMIP6) (Eyring et al 2016). The evaluation is performed globally for predictions of the next five years. The skill for extreme predictions is compared to that for mean temperature and precipitation variations, and the impact of model initialization is assessed by comparing the skill of decadal predictions and historical forcing simulations.

Data
Daily minimum and maximum near-surface air temperature and precipitation have been used to compute the extreme indices. Besides, monthly means of near-surface air temperature (TAS) and precipitation (PR) have been used to compare the prediction skill for extreme indices and mean variables.
All the available CMIP6 decadal hindcasts (DCPP) have been used. Besides, the CMIP6 historical simulations (HIST) performed with the same forecast systems as DCPP have been used to estimate the impact of the model initialization on the prediction skill. The number of DCPP and HIST members provided by each forecast system and their basic information can be seen in table S1. The DCPP and HIST multi-model ensembles consist of 133 and 134 members, respectively.
Two different gridded observation-based datasets per variable have been used as the reference for the evaluation (table S2) to account for the observational uncertainty (Sillmann et al 2013. The Berkeley Earth Surface Temperatures (BEST) and Rainfall Estimates on a Gridded Network (REGEN)  datasets provide gridded fields of daily minimum and maximum temperatures and daily PR, respectively. These have been used to calculate the extremes indices and are used as observational references. The HadEX3 (Dunn et al 2020) dataset (which provides gridded extremes indices calculated for each station and then interpolated onto global grids) has been used as an additional reference dataset. The results obtained with BEST and REGEN are shown in the main text, while those obtained with HadEX3 are shown in the Supplementary Material. The Global Historical Climatology Network version 4 (GHCNv4) (Menne et al 2018) and Global Precipitation Climatology Centre (GPCC)  datasets have been used as references for TAS and PR, respectively.

Methods
The Expert Team on Climate Change Detection and Indices (ETCCDI) defined a set of extreme climate indices to detect, characterize and monitor changes in the frequency and severity of extreme events, such as heat waves, cold spells, floods and droughts (Zhang et al 2011). We have selected six extreme indices: • TN10p: seasonal or annual percentage of days when minimum temperature is below the 10th daily percentile. • TNn: seasonal or annual minimum of daily minimum temperature. • TX90p: seasonal or annual percentage of days when maximum temperature is above the 90th daily percentile.
• TXx: seasonal or annual maximum of daily maximum temperature. • R95p: annual sum of precipitation in days where daily PR exceeds the 95th percentile of daily precipitation. • Rx5day: seasonal or annual maximum 5-day consecutive precipitation.
Thus, for each variable, we evaluate a measure of relatively moderate extremes, which occur on average several times per year or season, and a measure representing the most intense event of the year or season. The R-based software package we use to compute the ETCCDI indices (climdex.pcic) (Bronaugh 2020) provides the TN10p, TNn, TX90p, TXx, and Rx5day indices at seasonal and annual frequencies, while it only provides the R95p index at annual aggregation. Therefore, all six indices have been evaluated at annual frequency. Besides, the boreal winter (DJF; December-January-February) and boreal summer (JJA; June-July-August) indices that can also be computed at seasonal frequency have been included in the analysis. Similarly, annual and seasonal averages of TAS and PR have been analyzed.
The extreme indices have been evaluated globally during the 1961-2014 period using the 1981-2010 period as the baseline period for the percentilebased indices calculation. The same reference period has been used to compute the climatology and thresholds between the tercile probabilistic categories. For DCPP, the average of the forecast years 1 to 5 has been evaluated. Thus, start dates 1960-2009 have been used. In the case of the forecast systems not initialized in January (see table S1), the first forecast months have been discarded to define the analysis to calendar years (i.e. from January to December). A 5year running mean has been applied to HIST and reference datasets to make a consistent comparison with the forecast period of DCPP. The multi-model ensembles have been built by pooling all members together. The extreme indices have been computed in the native grid of each dataset (see tables S1 and S2) to avoid smoothing the extreme values when interpolating daily fields. For the evaluation, the mean variables and extreme indices have been interpolated to a common 2.8 • × 2.8 • grid resolution using the conservative interpolation method. Both deterministic and probabilistic predictions have been evaluated. The deterministic forecasts are based on the multi-model ensemble mean, while the probabilistic forecasts are based on the percentage of ensemble members that fall into each tercile category (below lower tercile, near average, and above upper tercile conditions). The tercile categories have been estimated based on the 33.33% and 66.67% thresholds of the corresponding probability density functions.
The Spearman's anomaly correlation coefficient (ACC) (Wilks 2011), which estimates the linear relationship between the observed and predicted time series, has been used to evaluate the deterministic forecasts. Spearman's correlation has been chosen to avoid assuming that the data are normally distributed. We have also tested the sensitivity of using Pearson's correlation coefficient instead of Spearman's, finding correlation values generally very similar. The ACC ranges between −1 (worst forecast) and 1 (perfect forecast). The residual correlation (Smith et al 2019) has been used to assess whether DCPP predicts any of the observed variability that is not already captured by HIST forced signal, and also ranges from −1 to 1. The residual correlation is computed as follows: the residuals of both the DCPP ensemble mean and reference dataset are calculated by linearly regressing out the HIST ensemble mean from the DCPP ensemble mean and reference dataset, respectively. The residual correlation is computed as the correlation between both residuals. Positive (negative) values indicate that DCPP predicts the observed variability better (worse) than HIST.
The quality of the probabilistic predictions has been evaluated with the ranked probability skill score (RPSS) (Wilks 2011), which measures the quality of a forecast in comparison with a reference forecast, and ranges between minus infinity and 1. Negative values indicate that the reference forecast is more skillful than the forecast, while positive values mean the opposite. The DCPP forecasts have been compared to the climatological forecast (defined as the equiprobable forecast, with a probability of 33.33% for each tercile category) and to the HIST.
The statistical significance of the ACC has been estimated with a one-sided t-test (Wilks 2011) accounting for the time series auto-correlation following Zwiers and von Storch (1995) with the null hypothesis that the ACC is not positive. We use a one-sided test as only positive correlation values carry useful predictive information. The same test but two-sided (in order to identify both potential improvements and deteriorations from initialization) has been applied for the statistical significance of the residual correlation with the null hypothesis that the residual correlation equals zero. The statistical significance of the RPSS using the climatological forecast as the reference forecast has been estimated with a one-sided Random Walk test (Delsole and Tippett 2016) with the null hypothesis that DCPP has less than or equal to 50% probability of being more skillful than the climatological forecast. The same test but two-sided has been applied for the statistical significance of the RPSS using HIST as the reference forecast with the null hypothesis that DCPP has a probability different than 50% of being more or less skillful than HIST. We control for multiple testing by applying

Multi-model skill for annual extreme indices
The DCPP shows high and significant skill in predicting TAS over most of the globe (figure 1(a)), with 99.5% of the global regions being statistically significant. The high and significant skill of the DCPP multi-model is consistent with the results for mean temperature and PR reported by Smith et al (2019) and Delgado-Torres et al (2022). For extreme temperature, the skill is generally lower, and significant skill is found over smaller areas of the globe than for TAS, especially for the indices representing the annual most extreme day (TNn and TXx; figures 1(f) and (g), respectively). The lower skill found for predictions of extremes than for TAS is consistent with Eade et al (2012). However, they found higher skill for extremes than for TAS in some cases. Specifically, they showed that DePreSys presents a higher skill for extremes in regions where trends in extremes are larger than for TAS (e.g. rainfall over Europe, hot extremes over northern Eurasia, and cold extremes over the USA). With the CMIP6 multi-model, we find opposite results for hot extremes. For example, predictions of hot extremes over northern Eurasia (e.g. TX90p; figure 1(d)) show less grid points with significant skill than for TAS ( figure 1(a)). For cold extremes (TN10p; figure 1(c)), results are consistent with those reported by Eade et al (2012) over the USA, with some areas showing a higher skill for TN10p than for TAS. Liu et al (2019) found a potentially higher predictability for moderate temperature extremes than for TAS in a perfect-model experiment. However, the CMIP6 multi-model ensemble, initialized with observation-based initial conditions, shows a generally higher skill for TAS than for moderate extremes.
Still, the skill for hot and cold extremes is high and significant over many regions, particularly over some areas of North America, North Africa, and parts of Eurasia and Australia. The comparison between hot and cold extremes shows a generally higher skill and larger areas with significant skill for minimum temperature extremes (TN10p and TNn; figures 1(c) and (d), respectively) than those based on maximum temperature (TN90p and TXx; figures 1(d) and (g), respectively).
The skill for PR is more limited than for TAS, with 7.4% of the global region being significant ( figure 1(b)). The regions in which DCPP are skillful in predicting PR are parts of Northern Africa, Northern Europe, Australia and Eurasia. As for temperature, the predictions of PR extremes are generally less skillful than for PR (figures 1(e) and (h)). Besides, the skill maps for PR extremes are noisier and positive significance is found over very limited regions.
The higher skill for extreme temperature predictions compared to extreme PR is consistent with previous studies evaluating predictions for mean variables (Smith et al 2019, Delgado-Torres et al 2022, and can be caused by several reasons. The primary source of prediction skill for TAS is the response to forcings, which causes a trend that climate models capture relatively well (Smith et al 2019). On the contrary, PR variability depends more on atmospheric circulation, which is often less well simulated by climate models (vanUlden and vanOldenborgh 2006). In addition, PR is more strongly affected by the signalto-noise issue in climate models Smith 2018, Smith et al 2019), which makes PR a variable more challenging to predict. Furthermore, the low resolution of climate models does not allow smallscale processes, such as convection, to be resolved (Merryfield et al 2020). Therefore, a parametrization is needed, which also contributes as an error source. However, PR predictions are still skillful over some regions such as Europe and Sahelian Africa, and the Atlantic multidecadal variability may be the source of predictability for PR over these regions (Doblas-Reyes et al 2013).
The evaluation of probabilistic forecasts (figure S1) shows a benefit from using decadal predictions instead of climatology since positive RPSS values are found over most regions for TAS and temperature extremes. Low RPSS values are found over most of the globe for PR and PR extremes, indicating small or no benefit from using DCPP instead of the climatological forecast.
The results obtained with HadEX3 (figures S2 and S3) are generally consistent with the previously reported results, with similar skill values and significance for most indices and regions. The most noticeable exception is TN10p over South America and TX90p over Asia, for which the skill is higher and more positive significance is found when using HadEX3 instead of BEST as reference dataset. It should be noted that not all the regions can be compared due to the limited global coverage of HadEX3.

Impact of initialization for annual extreme indices
The residual correlations obtained for TAS show significant added skill from initialization over regions of Central America, North Africa, Eurasia and Southern Australia ( figure 2(a)). Instead, DCPP show a significantly reduced skill for TAS predictions over the parts of Eurasia and Canada. Still, the global area with residual correlations showing positive significance is higher than negative significance (23.6% and 4.3%, respectively). For PR, the significant added skill is lower and mainly restricted over Central Africa ( figure 2(b)). Nevertheless, there are more regions showing positive than negative significance (4.7% and 1.4%, respectively). These results are in line with what was reported by Smith et al (2019) and Delgado-Torres et al (2022) for TAS and PR. For temperature and PR extremes, the patterns of the impact of initialization that we find differs from those found with DePreSys (Eade et al 2012). The differences may be caused by using a multi-model ensemble instead of a single model, the different generations of prediction systems, the metric used to assess the impact of initialization, and the different forcings (which have been shown to provide more skill in CMIP6 than in CMIP5 (Borchert et al 2021)).
The initialization aims at phasing the simulations with the observed climate state. However, model initialization is a nontrivial procedure with various issues (Merryfield et al 2020) that can reduce the skill compared to HIST. For example, initialization shocks due to inconsistencies between the initial conditions and model climatology can degrade the skill (Kröger et al 2018, Bilbao et al 2021. Also, those initial conditions used for the model initialization are based on observation-based products, which may not be of sufficient quality, especially over regions with poor observational coverage. Besides, the drivers for different variables and extreme indices are different, thus contributing to the region-dependent impact of initialization. The maps of residual correlation for temperature extremes show a different pattern compared to that for TAS, and there is a generally lower added skill for extreme predictions. Still, some regions show a positive impact of initialization, and some of them are similar between TAS and extreme temperature (e.g. southern part of Australia). For example, minimum temperature extremes (TN10p and TNn; figures 2(c) and (f), respectively) show significantly positive values over parts of South America, Africa, and Australia. However, the TN10p index shows a larger fraction of the global region with negative than positive significance, meaning that DCPP capture less variability than HIST. The different patterns of residual correlations for mean and extreme temperature may be due to annual averages being aggregated over the whole year, thus having multiple different weather conditions. However, extremes represent a small number of days when specific weather situations and drivers may occur and play a role, therefore showing a different pattern for the impact of initialization compared to that for the mean. Also, the temporal aggregation to create annual means smooths the time series, removing the high-frequency temporal variations and thus The reference datasets used for the mean near-surface air temperature and PR are the GHCNv4 and the GPCC datasets, respectively. The reference datasets used for the temperature and PR indices are the BEST and REGEN datasets, respectively. Grey colors over land regions correspond to those grid points with missing values in the reference dataset. Crosses indicate that the values are not statistically significant using a two-sided t-test accounting for autocorrelation and controlling the FDR with αFDR = 0.1. removing some noise and making them more predictable than extreme indices based on daily data.
As for PR, the maps of residual correlation for PR extremes are noisy, and the regions with positive significance are mainly restricted over parts of western Africa and South America (figures 2(e) and (h)).
The maps of RPSS show no added skill from initialization for the probabilistic forecasts, with RPSS values being negative or close to zero (figure S4). Besides, the fraction of the global region with significantly negative values is higher than for positive significance. The results obtained with HadEX3 (figures S5 and S6) show that, for minimum temperature extremes, most regions with significant improvement due to initialization coincide among both reference datasets. The results differ more for maximum temperature extremes, particularly for the TX90p index over Asia (higher added skill when using HadEX3 as reference) and TXx index over America (higher added skill when using BEST as reference).

Multi-model skill for seasonal extreme indices
In addition to the annual extremes, skillful predictions of seasonal extremes are highly important as their impacts on climate-dependent sectors, nature, and society may be more relevant in particular seasons of the year.
As for annual TAS, the DCPP multi-model ensemble shows high skill in predicting TAS in both DJF and JJA, with 92.6% and 97.1% of the global land area being statistically significant, respectively (figures 3(a) and (b)). This fraction is slightly lower than for annual means (99.5%; figure 1(a)). For temperature extremes, the skill patterns differ between seasons. Overall, there is a higher skill and larger areas showing statistically significant skill for temperature extremes during JJA than DJF (figures 3(c)-(j)). For instance, no significance is found around the Mediterranean region for DJF extreme predictions, while most grid points show statistical significance during JJA. These results are consistent with , who also found higher-quality predictions in the Mediterranean region for JJA than for DJF. However, there are also regions where some indices are predicted more skillfully during DJF than during JJA (for example, TN10p over Central Canada and TNn over some parts of South America). Besides, the skill is generally higher for (more moderate) percentilebased than for (more intense) absolute extremes, consistent with Hanlon et al (2015).
The skill patterns also differ between seasons for PR. In DJF, positive skill is restricted to Northern Europe and some parts of Central and Eastern Asia, although without significance (0.5% of the global fraction; figure 3(k)). In JJA, the regions showing statistical significance are mainly located over South America and Northern Africa, representing 6.7% of the global region ( figure 3(m)). The skill for extreme PR (figures 3(l) and (n)) shows overall similar patterns as those for PR (figures 3(k) and (m)). The reference datasets used for the mean near-surface air temperature and PR are the GHCNv4 and the GPCC datasets, respectively. The reference datasets used for the temperature and PR indices are the BEST and REGEN datasets, respectively. Grey colors over land regions correspond to those grid points with missing values in the reference dataset. Crosses indicate that the values are not statistically significant using a one-sided t-test accounting for autocorrelation and controlling the FDR with αFDR = 0.1.
The seasonality of the skill may be caused by several factors. For instance, summer PR over Central Europe is more convective (Llasat et al 2021), which may decrease the prediction skill because the limited spatial resolution of climate models does not allow for resolving the small-scale processes (Merryfield et al 2020). Also, the large-scale drivers and their local response are different across seasons (Müller et al 2012), thus affecting the prediction skill. For example, (Palmer et al 2008) showed that the skill of seasonal forecasts for winter European heat waves is reduced due to model deficiencies in representing blocking systems. Besides, the spatial distribution of trends also differs (Lee et al 2021), as well as the signal-to-noise ratio (Schubert et al 2009), which also may contribute to the seasonality of the prediction skill.
The probabilistic evaluation shows a general benefit from using DCPP with respect to the climatological forecast for mean and extreme temperature, as the RPSS is positive and significant over large regions of the globe (figure S7). For PR, although lower RPSS values are found, there are regions where DCPP outperform the climatological forecast. For instance, significantly positive RPSS values are found over several regions of Eurasia for mean and extreme PR during DJF, and over Central Africa during JJA. The results are generally consistent with those obtained with HadEX3 (figures S8 and S9). The highest discrepancies are found for some extreme temperature indices, especially over the Americas.

Impact of initialization for seasonal extreme indices
The impact of initialization for seasonal TAS shows different patterns during DJF and JJA (figures 4(a) and (b), respectively). For DJF temperature predictions, the greatest added skill is found in Africa, particularly over some Central and Northern regions. Other regions like Southern Australia, Southern Asia and some parts of the Americas also significantly benefit from initialization. Still, negative residual correlation is found over regions like northern Eurasia. During JJA, significant added skill is mainly found over Central Asia, Northeastern Africa, and some parts of the Americas, Australia and Europe. Contrarily, significant skill decrease is found over large regions of North America, Africa and Asia.
The regions where a significant impact of initialization is found are different between the extreme temperature indices (figures 4(c)-(j)) and are also different from those obtained for TAS. The fraction of the globe showing a significantly added skill is higher than the fraction showing a worsening for all the indices considered. Besides, the fraction of the The reference datasets used for the mean near-surface air temperature and PR are the GHCNv4 and the GPCC datasets, respectively. The reference datasets used for the temperature and PR indices are the BEST and REGEN datasets, respectively. Grey colors over land regions correspond to those grid points with missing values in the reference dataset. Crosses indicate that the values are not statistically significant using a two-sided t-test accounting for autocorrelation and controlling the FDR with αFDR = 0.1. global region showing added skill from initialization is higher for seasonal indices than for annual indices. Still, there is a smaller added skill for all indices than for TAS.  found that initialization did not improve the skill for European summer extremes with the CMIP5 multi-model ensemble. However, they found added skill when assessing the MPI-ESM1.2-LR model individually. This points to the importance of also assessing each forecast system individually, as the multi-model ensemble does not necessarily outperform all single forecast systems (Mishra et al 2018, Delgado-Torres et al 2022. For PR, the patterns also differ between seasons (figures 4(k) and (m)), being the one for JJA more similar to that for annual means ( figure 2(b)). Besides, the patterns for seasonal extreme indices (figures 4(l) and (m)) are similar to those for seasonal means.
There is no or limited added skill from initialization for the probabilistic forecasts, as the RPSS values are close to zero for the seasonal means and extreme indices of both temperature and PR (figure S10). The results obtained using HadEX3 show similar results (figures S11 and S12, respectively).

Summary and conclusion
We have evaluated some aspects of the forecast quality of the CMIP6 decadal forecast systems in predicting TAS, PR, and a set of extreme indices based on daily minimum and maximum temperatures and PR for predictions of the next five years, considering annual, DJF and JJA extremes. The prediction skill of deterministic and probabilistic forecasts has been estimated and compared to that of the HIST to assess the impact of model initialization.
The DCPP multi-model shows high skill in predicting mean and extreme TAS indices computed at annual frequency over most of the globe. The skill is lower and limited to some regions for mean and extreme PR. There is a generally higher skill in predicting the mean variables than the extreme indices. The skill for both extreme TAS and PR is higher for the moderate extremes (TN10p, TX90p and R95p; related to frequency) than for the most extreme extremes (TNn, TXx and Rx5day, related to intensity). The comparison between DCPP and HIST shows a region-dependent impact of initialization on the skill. The added skill due to initialization is higher for the mean variables than for the extreme indices. Besides, such skill differences differ between indices, especially those representing extreme temperature.
For seasonal means and seasonal extreme indices, the DCPP multi-model also shows a generally high skill for mean and extreme TAS, especially during JJA. Similar to annual indices, the skill is higher for the moderate extreme indices than for the most intense extremes for both DJF and JJA. The skill is also more limited for mean and extreme PR than for temperature on seasonal time scales. Still, there is high skill during DJF over Northern Europe and some regions of Asia, and during JJA over some regions of South America, Central Africa and Northeastern Asia. The residual correlations show different patterns for seasonal mean variables and extreme indices compared to those for annual frequency. For temperature, a lower added skill is found for extreme than for mean temperature. In addition, the added skill for extreme temperature depends on the region and season. For PR, the impact of initialization is similar for extreme and mean PR. Comparing seasons, the added skill is more similar between JJA and annual, while it differs more for DJF.
In conclusion, we find that the CMIP6 decadal forecast systems can skillfully predict characteristics of climate extremes, in particular extremely hot and cold temperatures. While the prediction skill for the extremes indices is mostly lower than for annual or seasonal means, these forecast systems still provide useful predictions for the more impactrelevant aspects of climate. However, to exploit all the potential usefulness of decadal predictions, useroriented indicators could be explored to facilitate their applicability in climate-sensitive sectors, which might be based on variables other than temperature and precipitation, such as wind speed and solar radiation. Besides, the analogous forecast quality assessment should be performed for the particular region, index and forecast period for each user-specific need (Hanlon et al 2015. This systematic evaluation of decadal hindcasts is essential when providing a climate service based on decadal predictions so that the user is informed on the trustworthiness of the forecasts. Also, comparing decadal hindcasts and historical simulations might help climate services providers to select the highest-quality climate information for each particular case.

Data availability statement
The data that support the findings of this study are openly available at the following URL/DOI: https:// esgf-node.llnl.gov/search/cmip6/.