Uncertainties in the observational reference: Implications in skill assessment and model ranking of seasonal predictions

The probabilistic skill of seasonal prediction systems is often inferred using reanalysis data, assuming these benchmark observations to be error free. However, an increasing number of studies report non‐negligible levels of uncertainty affecting reanalysis observations, especially when it comes to variables like precipitation or wind speed. We consider different possibilities to account for such error in forecast quality assessment, either exploiting the newly produced ensemble reanalyses (e.g., European Centre for Medium‐Range Weather Forecasts Reanalysis version 5–Ensemble of Data Assimilations, ERA5 ‐ EDA) or applying methodologies that use scores that take observational uncertainty into account. We illustrate the benefits of employing ensemble reanalyses over traditional reanalyses and show how the true skill can be approximated, whatever the observational reference. We ultimately emphasise the perils and quantify the error committed when the observational reference, either reanalysis or point dataset, is selected arbitrarily for verifying a seasonal prediction system.


INTRODUCTION
Climate predictions on seasonal time-scales are becoming increasingly skilful, mostly due to the recent advances in understanding of climate processes and their modelling (Merryfield et al., 2020).Still, the quality of such predictions is far from that of weather forecasts, whose higher quality allows for issuing timely early warnings for extreme events and thus to reduce human and material losses.Though some of high-impact climate events can also be correctly anticipated by seasonal predictions, there is a general reluctance to make any decision based upon them.
Generally, seasonal predictions for essential climate variables (e.g., temperature, precipitation, or wind speed) are provided together with a measure of their quality, or skill, which is determined after a comparison against climate observations.Past forecasts of a prediction system are statistically assessed to judge how good that system is, or which is the "best" system among a set of candidates.Besides quality, the goodness of a forecast can also be measured by their consistency and value (Murphy, 1993), but these two aspects will not be the focus of this work.
On a typical skill assessment, reference observations are considered as the "ground truth", unquestionably exact.However, these observations might have some degree of uncertainty, especially when they come from processed datasets such as reanalyses.Reanalysis products provide gridded observational data resulting from a combination of a numerical weather prediction model and the assimilation (ingestion) of past observations from several sources.In this regard, random uncertainties in the observations (e.g., measuring errors) mix with the uncertainty coming from the numerical model (e.g., parametrisations and systematic model errors) and with the uncertainty associated with the data assimilation method used, and all propagate into the reanalysis response values.Some researchers, however, argue that the uncertainty in the observational reference (whatever type of data, not necessarily reanalysis) is significantly low when compared with the uncertainty in the seasonal prediction (Saetra et al., 2004;Santos & Ghelli, 2012).This is true, in principle, for the furthest forecast horizons (say beyond 5 or 6 months) where skill levels typically decay significantly.
At shorter forecast times, current research shows that these effects are far from negligible.What is more, different quality outcomes can be obtained depending on which observational reference has been utilised.Juricke et al. (2018) evaluated the observational uncertainty by assessing monthly and seasonal forecasts against two reanalyses with different horizontal resolutions (1 • and 0.25 • ).The results revealed a strong dependence of the Brier skill score (BSS) and reliability on the reference data.Similarly, Sunyer et al. (2013) and Gómez-Navarro et al. (2012) ranked different regional climate models from best to worse using varied metrics, such as the mean bias or spatial correlation.They both conclude that, for most of the metrics analysed, the performance of a model is sensitive to the choice of the observational reference, and so are the scoring rankings obtained.Reichler and Kim (2008) went beyond that and showed the error in the reanalyses can exceed that of a climate model.They refer to the fact that the driving models and assimilation schemes in some of the currently used reanalyses were designed many years ago, so that they can be considered as deprecated nowadays.For example, the National Centers for Environmental Prediction-National Center for Atmospheric Research Reanalysis 1 (R1) (Kalnay et al., 1996) is still widely used but was generated more than 20 years ago and has remained unaltered.
Apart from considering several data sources as a reference, the observational uncertainty can be inferred using other approaches.For example, Massonnet et al. (2016) tackle the problem with an opposite perspective by assessing different observational datasets against a set of climate models used as a reference.Others (Ben-Bouallègue, 2020; Candille & Talagrand, 2008) studied and recommended the so-called "perturbed-ensemble" approach, which randomly perturbs the predicted ensemble of a probabilistic prediction with a noise meant to simulate the uncertainty in the observations.The magnitude and distribution of such noise can be based upon knowledge of the error statistics or arbitrarily assumed.Finally, other approaches consider the observation as a probability distribution (Candille & Talagrand, 2008) or even propose a method to rebuild the scoring rules to make them account for observation error (Ferro, 2017).Most of these approaches have already been tested either with synthetic experiments or by adding arbitrary errors to the models or observations.
Very recently, global reanalysis products have started to provide ensembles to allow for inferring estimates of the observational uncertainty.But to the best of our knowledge, there are no published verification analyses using these datasets as observational references.In this article, we first investigate whether these ensemble reanalyses provide any added value over traditional reanalyses products.Then, we qualitatively look at the impacts of observational uncertainty in the verification of seasonal predictions.Lastly, we quantify the error committed when observational uncertainty is not accounted for.Section 2 describes two approaches to consider the observational uncertainty in forecast verification.Sections 3 and 4 respectively list the datasets employed and present the methodology.Results are described in Section 5 and further discussed in Section 6.Finally, conclusions are drawn in Section 7.

SCORING RULES THAT TAKE INTO ACCOUNT OBSERVATIONAL UNCERTAINTY
In this section, we summarise two existing methodologies to account for observation error in skill assessment.These were proposed in Candille and Talagrand (2008) and Ferro (2017).The reader is referred to these two publications to find complete and detailed information on the approaches.The present work includes a new extension for Ferro (2017).
The "observational-probability" method introduced in Candille and Talagrand (2008) considers the verifying observation y as a probability distribution g that is then compared against the probability distribution defined by the predicted ensemble f .Therefore, the observed probabilities are no longer restricted to 0 or 1, but they take values within [0,1].Scoring rules such as the mean Brier score (BS) computed over a verification sample of size n remain defined in this method: where the CT08 subscript refers to Candille and Talagrand (2008), which is algebraically identical to its original definition: Nevertheless, some researchers, such as Ferro (2017), consider the Candille and Talagrand (2008) approach as inappropriate, since the new scoring rules are no longer proper (see definition of "proper scoring rule" in Jolliffe & Stephenson, 2012, Sect. 2.7), which implies that better forecasts of the truth are not always rewarded.Furthermore, Ferro (2017) argues that the scoring rules of Candille and Talagrand (2008) appear biased under the observation model, indicating that the expected value of the score (the metric that determines the overall quality of a forecast) obtained using the Candille and Talagrand (2008) method is not generally the same that would be obtained if the true value was known.It is therefore desirable that the expected values of the skill scores remain unaffected by observation error.Ferro (2017) proposes a general method to construct scoring rules that are both proper and unbiased.These error-corrected scoring rules account for observation error in a way that forecasters who issue better forecasts of the truth are favoured.In particular, for categorical predictands, the new BS is where subscript F17 refers to Ferro (2017) and where y is the observed probability that, albeit uncertain, takes the unique values 0 and 1.The observation error is included in the so-called misclassification probabilities r 0 = Pr(y = 1|x = 0) and r 1 = Pr(y = 0|x = 1).The parameter r y t has to be changed to r 0 or r 1 depending on the observed probability outcome y t .x refers to the true value of the observation, which is generally unknown.After performing this adjustment, Ferro (2017) notes that BS F17 may happen to be negative, which is mathematically meaningless.In such cases, Ferro (2017) recommends truncating the reported BS F17 to zero, but warns of the loss of both propriety and unbiasing conditions.Equation ( 3) is obtained assuming that y and f are conditionally independent given x.However, there are some cases for which this assumption does not hold.Such cases can arise when categorical predictands are created by dichotomizing numerical predictands, a common way of delivering the predictions at the seasonal time-scale.
Seasonal predictions are most commonly probabilistic forecasts derived by thresholding so that Equation (3) is no longer valid as it is.Instead, a slight modification has to be introduced here; that is, the extended error-corrected BS needs to consider the dependencies r 0 (f ) and r 1 (f ): It should be noted that BSF17-Ext is no longer proper because the forecaster can determine which values of r 0 and r 1 are used in the scoring rule by changing their forecast f .The extended error-corrected BS, Equation (4), merely yields an unbiased estimate of the score that would be obtained if there were no observation error, as long as r 0 (f ) and r 1 (f ) accurately reflect how the misclassification probabilities vary with the forecasts issued by the forecaster.

DATASETS
The hindcasts for surface wind speeds from five different operationally produced seasonal prediction systems have been used in this study (see details in Table 1).All five hindcasts have been retrieved from the Climate Data Store (https://cds.climate.copernicus.eu)data portal in a regular grid of 1 All five hindcasts will be referred to as predictions or forecasts throughout the text.Surface wind speeds from the five prediction systems have been derived from the zonal and meridional components at six-hourly resolution.
Then, monthly and seasonal averages have been prepared.Lead-zero predictions (i.e., those initialised at the beginning of the forecasted month or season) have been considered in this study.The fact that the nominal start date is set to the first of the month offers an advantage to the systems that use burst initialisation.However, this choice is determined by common practice (for instance, it is the one chosen by the Copernicus Climate Change Service), but researchers warn of the fact that results would be generally different for other nominal start dates.Nine observational references providing surface wind speeds have been considered to evaluate the set of predictions at monthly and seasonal time-scales.Specific details can be found in Table 2.The set of datasets includes reanalyses-namely, ECMWF Reanalysis-Interim (ERAI), Japanese 55-year Reanalysis (JRA55), ECMWF Reanalysis version 5 -High Resolution (HRES), Modern-Era Retrospective analysis for Research and Applications, version 2 (MERRA-2), and National Centers for Environmental Prediction-National Center for Atmospheric Research R1; see Ramon et al. (2019) for a review-station-based datasets (i.e., Hadley Centre Integrated Surface Database [HadISD]), reanalyses with ensemble members (i.e., ERA5-Ensemble of Data Assimilations [EDA]), and a multi-reanalysis (MR) plus the multi-reanalysis mean (MR-MEAN).All datasets cover the 1981-2017 period.
HadISD version 2.0.2.2017f (Dunn et al., 2014) is a station-based dataset consisting of 8103 stations spanning from 1931 to the end of 2017.A wide variety of quality-controlled climate variables is freely accessible at https://www.metoffice.gov.uk/hadobs/hadisd/.Zeng et al. (2019) filtered all these 8103 stations and selected only those that have that have uninterrupted, continuous monthly records during the 1978-2017 period, thus assuring a top-level quality for the monthly mean wind speeds.The process left a sample of 1,542 stations, which are located mainly inland, but covering the whole globe (Figure 1).
The newly released reanalysis ERA5-EDA (or EDA for simplicity) contains 10 exchangeable members aiming to represent estimates of analysis and short-range forecast uncertainty at each grid point (Hersbach et al., 2020).However, authors in Hersbach et al. (2020) warn of the fact that this uncertainty does not represent the total observation error.Even though the generation of Aiming to provide a more accurate representation of the total observation error, an extra observational reference has been generated in this study, that is, the MR.Differences between multiple global reanalyses have already been studied for surface and near-surface winds, and have been shown to be non-negligible (Ramon et al., 2019).The MR uses ERAI, JRA55, HRES, MERRA2 and R1 as source data and its generation is presented in the following section.The discrepancies between reanalyses quantified as the spread of the set can provide a lower bound estimate of the total observational uncertainty (Bellprat et al., 2017;Parker, 2016).Besides, the ensemble mean of the MR (MR-MEAN) has also been added to the set of observational references since it has been reported to provide better estimates of the skill than the individual products (Mortimer et al., 2020).

Computation of the probabilities
Predictions are evaluated at the 1,542 HadISD point locations (Figure 1), and the results correspond to the average over all locations.Gridded data (both predictions and reanalysis observations) are interpolated bilinearly to each of the stations.After interpolation, the predicted and observed probabilities are obtained at each location.The present work focuses on events exceeding the 90th percentile, which is computed separately for each dataset from its climatological distribution.A probability of 1 is assigned when winds exceed the 90th percentile and 0 otherwise.In the case of having ensemble members, the resulting probability of occurrence is obtained by pooling (i.e., gathering) all members together, converting them into binary values, and then rounding to one decimal.Both the predicted probabilities f and the observed probabilities g are computed in this way.

Generation of the MR
The MR has been produced after pooling ERAI, JRA55, HRES, MERRA2, and R1 data into one dataset.In this way, for each location and month/season, a set of five wind values representing an ensemble distribution is available.Before that pooling, however, the observed probabilities have to be computed for each dataset separately.This is because, unlike the five prediction systems or the EDA, the ensemble of the MR has been generated from entirely independent datasets (including numerical weather prediction models and data assimilation schemes).In other words, members are not exchangeable and may have differing climatologies.

Forecast verification with ensemble reanalyses
The first part of the present work looks into the possibilities that ensemble reanalyses offer to obtain estimates of the skill of a seasonal prediction system.The quality of SEAS5 in predicting extreme winds-wind speeds exceeding the 90th percentile of the climatological distribution-is assessed with the BS in January, April, July, and October for the 1981-2017 period.The metric, BS90 henceforth, is computed with each of the nine observational references (Table 2).For the single-value references (i.e., all references but EDA and MR) a traditional verification procedure is followed (Jolliffe & Stephenson, 2012).A new verification method is introduced here for the case when ensemble reanalyses are employed: the BS90 is computed separately for each realisation of the ensemble and later averaged to obtain the resulting BS90.
The variations of the BS90 results have also been studied with the decomposition of the BS into resolution, reliability, and uncertainty terms (Murphy, 1973): where ∑ n t=1 y t , and N i are the number of times each forecast f i is used in the collection of forecasts being verified.The three terms on the right are identified as the reliability, resolution, and uncertainty terms respectively.These three terms are computed separately for each realisation, and the average for reliability, resolution, and uncertainty are presented in this study.
In order to test the sensitivity of the skill estimates to hindcast length, the whole process described before has been repeated 100 times.In each repetition, a random number of years between 20 and 37 within the 1981-2017 period is sampled without replacement.
It is worth noting that this verification procedure averages over observation error without considering its potential to bias the score.This "reference" methodology will be referred to as REF in this article, in contrast to the methods described in Section 2.

Observation error in a monthly prediction
The second part of this work quantifies the errors arising when verification scores are computed by averaging over observation error (i.e., using the REF method) or by inserting observational probabilities (i.e., using CT08 with ensemble reanalyses) in the BSS90 of monthly predictions from SEAS5 in the 1981-2017 period.In particular, we analyse 12 monthly predictions, each one initialised at the beginning of each month of the year, and for lead time zero.For a better visualisation, stations where SEAS5 does not offer skill (BSS90 < 0) have been removed for each month separately, leaving a sample of 520-697 point locations, depending on the analysed month.
The BSS90 is computed using the REF, CT08, and F17-Ext methods.The REF method follows the methodology explained in the previous subsection to compute the skill with all nine observational references.It is worth recalling that when ensemble observations are available the resulting BS90 is the average of the BS90 obtained for each observation separately.Then, the CT08 method is used, Equation ( 1), only with EDA and MR, whose ensemble observations are employed to build the probability distribution of the verifying observation g.
Finally, F17-Ext, Equation (4), is employed using the nine observational references assuming that the "true" observed value at each location x is that of HadISD.These "true" observations are also utilised to compute r 0 (f ) and r 1 (f ), which in this case can be seen as a measure of the quality of the set of reanalyses.To this end, the forecast probabilities f are discretised in bins of 0.1.For each month and observational reference, r x (f ) is estimated to be the proportion of observations from all the 520-697 stations equal to 1 − x when HadISD equals x among those cases where the forecast is f .
To account for the imperfections that may arise in the computation of r 0 and r 1 , a leave-one-out cross-validation has been applied.In this regard, the verifying observation is excluded from the computation of r 0 and r 1 , in a way that in each monthly forecast from each year the misclassification probabilities are recomputed.The resulting BSS90 value for each month is the average skill over all years in 1981-2017.
In addition, the BS90 for the climatological forecast is also computed using the REF, CT08, and F17-Ext methods.The climatological forecast issues a unique probability of 0.1 for the occurrence of wind speeds above the 90th percentile, which is obtained from the SEAS5 past climatology.The climatological forecast is used here as a baseline forecast, allowing the computation of the BSS90.

Preparation of the model rankings
The third part of our work investigates how observational uncertainty alters a set of rankings, in which the five prediction systems (Table 1) are ordered from best to worst based upon their skill, calculated with and without taking observation error into account.Since the comparison involves different systems with different numbers of ensemble members, the fair version of the BS90, fBS90, has been chosen.The fBS gives an estimate of the BS that would be obtained as the ensemble size increases to infinity (Ferro, 2014).To better visualise the effects of accounting for observation error, (a) the skill score version of the fBS90 (i.e., the fBSS90) is presented with climatology as reference forecast, and (b) stations where seasonal predictions do not show skill (fBSS90 < 0), using HadISD as reference data, are filtered out.This reduces the verification sample from 1,542 to 143 point locations.
Since the fair version of the BS preserves propriety, the F17-Ext method is still applicable.First, the BS90 is obtained for each fixed f and y, Equation (2), and later transformed into its fair version, fBS90.Then, BS F17-Ext , Equation (4), the extended error-corrected BS, is applied to obtain the score values that account for observation error.It has to be noted that r 0 (f ) and r 1 (f ) are computed separately for each prediction system.The fBS90 of the climatological forecast is computed analogously, obtaining finally the fair version of the BSS, fBSS90.

How does the choice of the observational reference affect the skill estimates?
The BS of the 90th percentile of climatological winds from the SEAS5 prediction system has been obtained using nine different observational references and is shown in Figure 2. Boxplots display how the score varies with the length of the verification period, which is known to cause large deviations with short hindcasts (Kumar, 2009;Lledó et al., 2020).At a first glance, we observe that EDA is by far the reference against which SEAS5 scores best and with the lowest sensitivity to changes in the length of the verification period.The different range of BS90 values obtained with EDA compared with the other references is due to variations in the uncertainty term (shown in the right panel of Figure 3).
Whereas reliability and resolution estimates depend on which years have been sampled and which observational reference is used for the verification, the variation in the uncertainty term is only due to the variation in the number of sampled years (ranging from 20 to 37).Ideally, the uncertainty term should be unaffected by changes in the size of the verification sample and should tend towards the value 0.09 for an event with a climatological probability of 0.1-as shown in the rightmost term of Equation ( 5).However, when a finite and reduced sample is used for verification, the estimate of the uncertainty term may differ from its theoretical value.When there are 20 or 30 years, there will be exactly 10% of the observations above the 90th percentile (resulting in an uncertainty value of 0.09), but for other numbers of years there will be more than 10% (resulting in an uncertainty value greater than 0.09).In contrast, EDA has 10 times as many observations, so the number of observations ranges from 200 to 370 and there will always be exactly 10% of observations above the 90th percentile, resulting in an uncertainty value of 0.09.In addition, the exchangeability of the EDA members makes it more advantageous compared with MR, which delivers the mean score obtained by the five single-value reanalyses used to generate it.In the absence of EDA, the bias in the BS90 can be partially adjusted by inferring the skill score BSS90 because it is compensated with that of the reference forecast (e.g., climatology).
We also observe differences in the BS90 between all observational references (shown in Figure 2), and these differences are of the same magnitude as the differences between period subsets.However, the pairwise differences  between references are generally similar for all period subsets (not shown), and these differences would persist if longer hindcasts were available.These two types of differences are explained by differences in the reliability and resolution components (shown in the left and central panels of Figure 3).The verification of SEAS5 against HadISD, the station dataset, shows the worst reliability and resolution values, which may be due to representativeness errors in the gridded data of the predictions.Among the reanalyses, JRA55 displays the worst reliability and resolution outcomes, but this dataset has been shown to have poorer performance in estimating surface winds (Ramon et al., 2019).The sensitivity of skill estimates to the choice of observational reference will be the focus of the following subsections.

Obtaining unbiased estimates of the prediction skill
For the whole hindcast, we evaluate the potential of the CT08 and F17-Ext methodologies in obtaining unbiased skill estimates, whatever the observational reference.Figure 4 compares the BSS90 of monthly predictions computed using the REF, CT08, and F17-Ext methodologies.Compared with the BSS90 computed with HadISD, REF tends to underestimate the skill (Figure 4 left).The highest bias is noted for JRA55, which systematically shows the worst skill for all months.On the contrary, HRES is the single-value reanalysis that shows the closest skill score values to those obtained with HadISD but still shows a negative bias.
The skill assessment against the ensemble reanalyses shows slightly different results.The verification of the monthly predictions with CT08 and MR overestimates the true BSS90.This is also observed for EDA, but only for certain forecast months (e.g., February and April).These results may be due to either (a) observation error not being fully or correctly accounted for in the ensemble reanalyses or (b) the CT08 methodology failing to properly adjust for observation error.
The right panel of Figure 4 shows that these biases in the BSS90 values are greatly reduced with F17-Ext.
The variations in the BSS90 values observed in Figure 4 (left) are due to differences in the quality of the observational references.Whereas JRA55 is not a good observational reference for surface wind speeds, HRES can approximate the true skill better owing to its higher quality (Ramon et al., 2019).To investigate this further, we examine the misclassification rates r 0 and r 1 and their dependence on f (shown in Figure 5).First, we note that while r 0 increases with f , r 1 decreases.This is because the event is rarely observed for low f , which leads to a high number of occurrences of y = 0 and x = 0-this is achieved by a minimally skilful prediction.r 0 is almost zero for low f because it measures the misclassification rate of occurrences in uncertain observations (y = 1), which is very small for the lowest values of f in proportion to the "true" non-occurrences (x = 0).The opposite happens with r 1 : it measures the misclassification rate of non-occurrences (y = 0) with respect to occurrences (x = 1).
The misclassification rates are also varied depending on which reference is used.Noticeable differences in the r 0 and r 1 estimates for each reference are seen, especially as f increases.Sharp forecasts (big f for the case studied here) are rare for these monthly predictions, which sometimes makes it impossible to infer r 0 and r 1 values; this occurs specially for r 0 (Figure 5, left column).There are two reasons for this situation: (i) there are no occurrences for some f values (e.g., f = 1 in October) and (ii) for the highest values of f , it happens that the event is almost always observed, thus impeding the computation of some values of r 0 .Though this means that the results for F17-Ext use a slightly different set of cases than the results for REF and CT08, it does not make any material difference to the conclusions because these cases are rare.Having said that, we can expect the estimates of the misclassification rates for the highest f to be less robust.Concerning the reanalyses individually, we do not spot any of them standing out systematically from the others but in general, it appears that EDA and HRES show the lowest values for r 0 and r 1 , respectively, mirroring the good quality of the ERA5.

Choice of observational reference in model ranking
This subsection infers the fBSS90 of the five prediction systems, ranking them from best to worst according to their fBSS90 estimates (Figure 6).Model rankings are different depending on (a) whether the verification takes the error in the observational reference into account and (b) the method to account for observation error.
When observation error is averaged over using the REF method (Figure 6, left), the five prediction systems rank differently depending on which observational reference is used.No consensus is reached about which prediction system is best.For example, SPS3 is best when verified against JRA55, MERRA2, R1, and MR-MEAN, is second best when verified against HRES, EDA, MR, and HadISD, and occupies the central position when ERAI is employed.It is also noted that some reanalyses tend to favour forecasts from prediction systems that share some configuration components (e.g., atmospheric model).Such is the case of SEAS5 when assessed against ERAI, HRES, and EDA, all of them produced at the ECMWF.In these evaluations, SEAS5 scores best (Figure 6, left).For other references, like JRA55 and MERRA2, SEAS5's skill appears worse; that is, second and third best respectively.Whether this leads to an overestimation of the fBSS90 should be further investigated in future studies.With all this in mind, it appears that having fBSS90 estimates that are independent of an arbitrary choice of the dataset would ensure correct verification outcomes and comparisons.
However, neither CT08 nor F17-Ext can depict identical model rankings for all observational references (Figure 6, central and right panels) with the seasonal prediction data used in this study.The uncertainty in the fBSS90 estimates is still of the same order of magnitude as the differences in skill between models.In F17-Ext, the error in the fBSS90 value is propagated from the imperfectly known values of r 0 and r 1 .
If we were to know r 0 and r 1 perfectly (i.e., without error), or have a very good estimate of them, we would see that F17-Ext is able to remove the bias in the fBSS90 as in Figure 7 (right).Whereas SEAS5 stands out as the best prediction system, MF6 shows the lowest skill estimates whatever the observational reference.Those cases would be achievable in seasonal prediction with a large verification sample.Unfortunately, long hindcasts are hardly ever available, although recent studies have started providing centennial seasonal predictions for testing (Weisheimer et al., 2021).Other time-scales, such as sub-seasonal predictions or weather forecasting, deal with large amounts of data with which the error distribution of the observations can be inferred more accurately.

DISCUSSION
The verification of seasonal predictions is a fundamental procedure because a forecast product is useless without an estimate of its quality.Traditional verification methods are well established and used routinely in many applications, including operational seasonal forecasting.However, different issues affect the estimates of seasonal forecasts' quality, producing misleading results.This work explores how the inherent errors in the verifying observations affect the verification outcomes and provides a methodology to adjust for those errors in the skill estimates.Indeed, observation error has been shown to play a crucial role in the skill assessment of seasonal predictions.Different possibilities have been proposed in the literature to uptake such errors.Here, we have focused on those that consider the reference observation associated with its error, leaving the predicted ensemble elements as they are.Three different approaches have been tested and compared: the traditional approach (REF), which considers one single verifying observation with no error, the observational probability method in Candille and Talagrand ( 2008) with ensemble reanalyses, and F17-Ext with all the set of observational references together with their uncertainty.
None of the methodologies yielded exactly the same scores that would be obtained if observations were error free.In the case of REF, the real performance of the F I G U R E 7 Same as in Figure 6, but r 0 and r 1 are perfectly known.
prediction systems is underestimated, in accordance with previous studies (e.g., Candille & Talagrand, 2008;Ferro, 2017;Santos & Ghelli, 2012).The different fBSS90 and BSS90 values obtained with the REF and CT08 methodology mirror the important disagreements between reanalyses themselves and between reanalyses and point observations (Ramon et al., 2019).To catch the discrepancies between reanalyses, an MR dataset has been created in this work.Though the MR combined the data from five global reanalyses, providing a more accurate estimate of the observational uncertainty, it still offered a biased skill score because the values that a perfect reference would achieve have not been obtained by any of the observational references.Neither is sufficient to have the best single reanalysis (i.e., HRES), to have unbiased estimates of the score values.This suggests that the differences between reanalysis values and point observations (i.e., representativeness errors) are large and are not traditionally accounted for in the verification.
To better account for observation error, Equation ( 4) is a reasonable proposal.Although the extended scoring rules do not preserve propriety and cannot adjust for the true skill perfectly with a small verification sample, they can provide good estimates of the true expected score if good estimates of r 0 and r 1 are provided.This is key in many aspects.For example, an apparent skilful prediction verified against an uncertain reference may be profitable if the observation error is properly considered.The downside of the method, however, is that either the misclassification probabilities r 0 and r 1 and their dependence on f have to be known beforehand, or true observations have to be available for generating the error distribution of the observations.In this study, we assumed the in-situ dataset to be perfect, neglecting the instrumental errors contained in the station data.Other evaluations, for example, might assume a reanalysis-preferably a high-resolution or regional reanalysis-to be the ground truth.That being said, more investigations are needed to determine whether the F17-Ext method is always effective and whether it can be improved upon.
The F17-Ext methodology cannot make the verification outcomes insensitive to changes in the hindcast length.Equation (4) just corrects for the true skill, and this depends on the size of the verification sample.However, we have showcased another important problem faced in the computation of specific skill scores when using a relatively short hindcast-as is often the case in seasonal forecasting.This work has evidenced that the BS90 can appear biased depending on the number of sampled years used in the verification.We have shown how to correct for such error using an ensemble reanalysis with exchangeable members.Selecting EDA as the reference provides another added value: since the product is generated with the same model configuration as HRES, the high quality of the reference observations can, in principle, be guaranteed.

CONCLUSIONS
The main conclusions of this study are as follows: • The use of ensemble reanalyses like EDA is advantageous over single-value reanalyses to obtain robust quality estimates of seasonal predictions for extreme events and to approximate better the true skill scores.
• Still, ensemble reanalyses do not solve the issue of observational uncertainty impacting the skill assessment.
• Skill scores are highly sensitive to the choice of the observational reference, especially when they are inaccurate.This poses a risk in many situations; for example, when selecting the best prediction system among a set of potential candidates.
• F17-Ext is a promising tool, but it can only be used if observational products are accompanied by uncertainty estimates of their quality (with reference to true values at grid or point scale).
• F17-Ext can be used to understand the skill of gridded forecasts interpolated at a point scale, if representativeness errors are inferred with misclassification rates between the two scales.
• The need for information about observational uncertainty should be taken into account in the production of future reanalysis datasets.Estimates of this uncertainty are beneficial for many applications, not only for forecast verification.
Many users of seasonal predictions employ gridded forecasts to infer future anomalies of climate variables at a point scale and then verify these predictions using point data.Such is the case of the wind power industry, where wind observations taken at wind farms (point scale) are utilised in skill assessments of forecasts that represent a larger scale quantity (grid scale; Ramon et al., 2021).Our work provides insights into the outcomes of verifications of gridded forecasts using point data (HadISD) and compares this against an analogous verification using gridded observations (reanalyses).We even go beyond this, showing how point observations can be used to adjust observation error in reanalyses, allowing reanalysis data to expand the verification to other locations where in-situ measurements are unavailable.Moreover, the results presented here can be extended to other temporal scales, such as weather forecasting or decadal predictions.
Future studies could be devoted to generalising the calculation of the estimates of the error distribution of the observations, avoiding the need to assume a ground truth.Even though this article emphasises the strong dependency of the misclassification rates on the observational reference dataset, a relationship between these and the predicted probabilities can be established in the form of, for example, a parametrisation equation.However, the extent to which it can be generalised has to be explored in detail.

F
I G U R E 1 Spatial distribution of the 1,542 Hadley Centre Integrated Surface Database stations.EDA takes into account random uncertainties in the observations, sea-surface temperature and the physical parametrisations of the model, other sources of uncertainty such as systematic model errors are not considered.

F
Boxplotsrepresenting the distribution of the Brier score of the 90th percentile (BS90) values for SEAS5 lead-zero monthly predictions for January, April, July and October in the period 1981-2017.The skill has been computed after sampling a random number of years between 20 and 37.This process has been repeated 100 times and the distribution of BS90 values is represented with the boxplots.ERAI: European Centre for Medium-Range Weather Forecast (ECMWF) Reanalysis-Interim; JRA55: Japanese 55-year Reanalysis; HRES: ECMWF Reanalysis version 5 (ERA5)-High Resolution; MERRA-2: Modern-Era Retrospective analysis for Research and Applications, version 2; R1: National Centers for Environmental Prediction-National Center for Atmospheric Research Reanalysis 1; HadISD: Hadley Centre Integrated Surface Database; MR-MEAN: multi-reanalysis mean; ERA5-EDA: Ensemble of Data Assimilations; MR: multi-reanalysis.

F
Boxplots of the decomposition of the Brier score of the 90th percentile (BS90) into the reliability, resolution, and uncertainty terms.It is worth recalling that, unlike reliability and uncertainty, the resolution term is negatively oriented: higher values contribute to improve skill, and vice versa.Seasonal predictions correspond to SEAS5 lead-zero monthly predictions for January in the period 1981-2017.Results for April, July, and October can be found in the Supporting Information.ERAI: European Centre for Medium-Range Weather Forecast (ECMWF) Reanalysis-Interim; JRA55: Japanese 55-year Reanalysis; HRES: ECMWF Reanalysis version 5 (ERA5)-High Resolution; MERRA-2: Modern-Era Retrospective analysis for Research and Applications, version 2; R1: National Centers for Environmental Prediction-National Center for Atmospheric Research Reanalysis 1; HadISD: Hadley Centre Integrated Surface Database; MR-MEAN: multi-reanalysis mean; ERA5-EDA: Ensemble of Data Assimilations; MR: multi-reanalysis.

F
Model rankings for the fBSS90 of five seasonal prediction systems and nine observational references.All predictions correspond to lead time zero and for December-January-February in the 1993-2015 period.Two asterisks (**) indicate that the differences with the fBS90 of HadISD are statistically significant according to a Diebold-Mariano test at the 95% of confidence level plus a Bonferroni correction for multiple testing.On the left, the observational reference values are considered perfect.The central panel computes the skill score using the CT08 methodology.On the right, the observational error in the BSS90 is considered as in F17-Ext, assuming HadISD to be the ground truth.Within a panel, the lower the square, the worse the skill is.ERAI: European Centre for Medium-Range Weather Forecast (ECMWF) Reanalysis-Interim; JRA55: Japanese 55-year Reanalysis; HRES: ECMWF Reanalysis version 5 (ERA5)-High Resolution; MERRA-2: Modern-Era Retrospective analysis for Research and Applications, version 2; R1: National Centers for Environmental Prediction-National Center for Atmospheric Research Reanalysis 1; HadISD: Hadley Centre Integrated Surface Database; MR-MEAN: multi-reanalysis mean; ERA5-EDA: Ensemble of Data Assimilations; MR: multi-reanalysis.
Specific details of the seasonal prediction systems employed in this study.Specific details of the nine observational datasets employed in this study.
TA B L E 1 a European Centre for Medium-Range Weather Forecasts.b Centro Euro-Mediterraneo sui Cambiamenti Climatici.TA B L E 2 • lat × 0.625 • long Molod et al. (2015) R1 NCEP-NCAR d -Six-hourly 1.875 • lat × 2 • long Kalnay et al. (1996) a European Centre for Medium-Range Weather Forecasts.b Japan Meteorological Agency.c National Aeronautics and Space Administration-Global Modeling and Assimilation Office.d National Centers for Environmental Prediction-National Center for Atmospheric Research.