Locally tail-scale invariant scoring rules for evaluation of extreme value forecasts

Statistical analysis of extremes can be used to predict the probability of future extreme events, such as large rainfalls or devastating windstorms. The quality of these forecasts can be measured through scoring rules. Locally scale invariant scoring rules put equal importance on the forecasts at diﬀerent locations regardless of diﬀerences in the prediction uncertainty. This can be an unnecessarily strict requirement when mostly concerned with extremes. We propose the concept of local tail-scale invariance, describing scoring rules fulﬁlling local scale invariance for large events. Moreover, a new version of the weighted Continuous Ranked Probability score (wCRPS) called the scaled wCRPS (swCRPS) that possesses this property is developed and studied. We show that the score is a suitable alternative to the wCRPS for scoring extreme value models over areas with varying scale of extreme events, and we derive explicit formulas of the score for the Generalised Extreme Value distribution. The scoring rules are compared through simulation, and their usage is illustrated in modelling of extreme water levels in the Great Lakes and annual maximum rainfalls in the Northeastern United States.


Introduction
The main aim for statistical analysis of extremes in fields such as hydrology, meteorology, climate science, and finance is forecasting, or prediction, of the risk of occurrence of future dangerous extreme events, such as huge rainfall events or very large fluctuations of prices of a financial portfolio.Typically, the forecast is in terms of an estimated distribution, for example the estimated distribution of the size of the largest event which will occur during some specified period of time.In this case, the forecast is said to be probabilistic.The forecast distribution is then often presented to decision makers in a reduced form, perhaps as a predicted probability of exceeding one, or several, high thresholds.In many cases, there are several competing forecast distributions, and a standard way of choosing between these competing forecasts is to compute goodness of fit measures of the distributions based on the existing data, and then use the distribution that fits best.This approach is, for example, broadly used in hydrology (Cugerone and De Michele, 2015).However, an alternative gaining increasing interest is to base the selection on proper scoring rules (Zamo and Naveau, 2018).
A scoring rule is a functional S(P, y) that takes in an outcome y and a probabilistic forecast P, the predictive distribution for the outcome.If the outcome Y ∼ Q is a random variable with distribution Q, the score S(P, Y ) is also a random variable, with expected value denoted by S(P, Q).To be able to make an earnest prediction, it is preferred that the expected score is maximised under the correct model, i.e., that S(Q, Q) ≥ S(P, Q), for all predictive distributions P. In this case the score is said to be proper (Gneiting and Raftery, 2007).If strict inequality holds for all P ̸ = Q, the scoring rule is strictly proper.
An example of a commonly used proper scoring rule is the Continuous Ranked Probability Score (CRPS), which is defined as where F is the cumulative distribution function of the forecast distribution P, and 1(•) is the indicator function.Another popular example is the logarithmic score LS(P, y) = log f (y), where f is the density function of P. Both of these scores are widely used in meteorology, climate science, and finance (Opschoor et al., 2017;Ingebrigtsen et al., 2015;Haiden et al., 2019).The choice of scoring rule allows an evaluator to emphasise the desired features of the prediction.For extreme events, one typically is interested in the behaviour of the prediction in the tail of the distribution.One way of doing this is to simply multiply the proper score S 0 with a weight function w and form a new score S(P, y) = w(y)S 0 (P, y).However, this score is not proper unless w(y) is constant (Gneiting and Ranjan, 2011) and alternative ways of weighting should be used.Several such proper weighted scores exist (see e.g.Diks et al., 2011;Tödter and Ahrens, 2012;Gneiting and Ranjan, 2011), with a popular one being the threshold weighted CRPS (wCRPS) by Gneiting and Ranjan (2011), where a weight function is included in the kernel of the CRPS, wCRP S(P, y) = − R w(x)(F (x) − 1(x ≥ y)) 2 dx.
Here w(x) is a non-negative function that can be chosen to put more emphasis on large values.However, what weight function to use is not obvious.The forecaster's dilemma, as presented by Lerch et al. (2017), describes the problem that forecasters can be encouraged to exaggerate their predictions, since the effect of missing a calamity is worse than wrongfully predicting one.To avoid this, the choice of the weight function should ideally be made by the user of the forecasts rather than by the forecaster.
In the case of multiple observations, y = (y 1 , , , , y n ), and a forecast P = (P 1 , ..., P n ) the joint score of the forecast is defined as the average score, and this is used to inform the selection of prediction method.
As recently shown by Bolin and Wallin (2023), many common scoring rules, such as the CRPS, are "scale dependent" in the sense that they put higher importance on forecasts with higher prediction uncertainty in the case when the observations y i in the average score have varying predictability.This is, sometimes, an undesirable property.For example, if one wants to predict extreme rainfalls in some spatial area where the scale of the events varies, infrastructure in the parts where the scale of events is small and in parts where the scale is large will have adapted to these differences, and accuracy is equally important throughout the area.Bolin and Wallin (2023) introduced the concept of "local scale invariance" to describe scoring rules which do not have this feature, and showed that the logarithmic score is an example of a score that is both strictly proper and locally scale invariant.However, use of the logarithmic score requires computation of the probability density function of the predictive distribution, which is not always easy to compute, or might not even exist.
Even though local scale invariance is an important property, it might be an unnecessarily strict requirement if one is mostly concerned with extremes.In this work, we therefore introduce a new concept of "local weight-scale invariance" to describe scoring rules which have the property of local scale invariance for a certain region of interest.As a special case, when the region of interest is large events, we obtain the concept of "local tail-scale invariance".We develop and study a new version of the wCRPS called the scaled wCRPS, or swCRPS, which is locally weight-scale invariant.The Generalised Extreme Value (GEV) distribution is of special interest in extreme value statistics, and we derive explicit formulas of swCRPS for the GEV distribution.These scores are analysed in terms of local scale invariance.We compare the swCRPS and the censored likelihood score to the more common logarithmic score, the CRPS, and the wCRPS in a simulation study and in case studies where different models for water levels in the Great Lakes, extreme rainfall events in the Northeastern U.S., and air pollution in the Piemonte region in Italy are evaluated using the different scoring rules.
The article is structured as follows: Section 2 gives a background on scoring rules, weighting of scores, and local scale independence of scores.In Section 3, scale independence of weighted scores is discussed.Section 4 contains simulation studies, and Section 5 presents case studies.Finally, discussion and conclusions can be found in Section 6. Derivation of explicit formulas for swCRPS for the GEV distribution, a background on modelling and scoring of extremes, and proofs of propositions are included in the appendices of the article.

Background
In this section we provide more background information about the CRPS and its weighted variants, provide details about scale dependence of scoring rules, and introduce some commonly used models for annual flow and precipitation maxima.In the following, E P [g(X)] denotes the expectation of g(X) when X ∼ P is a random variable with distribution P, and E P,Q [g(X, Y )] denotes the expectation of g(X, Y ) for independent random variables X ∼ P and Y ∼ Q.

The CRPS and weighted scoring rules for extremes
A different representation of CRPS, obtained by Baringhaus and Franz (2004), is where X, X ′ are independent random variables with the same distribution P and finite first moment.This expression can be derived by noting that 1), using Fubini's theorem to compute E|X − Y | in terms of the distribution functions F and G of X and Y respectively, and finally setting G(x) = 1{x ≥ y}.This representation is useful in cases where a closed form expression for F does not exist, which might be the case when using ensemble forecasts in weather prediction, or when evaluating the predictive performance of complicated hierarchical models.In these cases, the expected values can be approximated through Monte Carlo simulation from the predictive distribution.Zamo and Naveau (2018) compared different estimators of CRPS and made a recommendation on how to choose the most accurate one based on the ensemble type available.
Similarly for the wCRPS, one can show that wCRP S(P, y) = 1 2 where g w (x, x ′ ) := for some real u.Instead of choosing the weight function as the indicator function w u , one might want to consider some other variation to incorporate the whole dataset instead of just the extremes.For example, Thorarinsdottir and Schuhen (2018) suggested w 2 (y) = 1 + 1w u (y) and w 3 (y) = 1 + w u (y)u as alternatives, where u was chosen as the 97.5% observed quantile.
In an example in Subsection 6.3. of Thorarinsdottir and Schuhen (2018), the best choice was to use w 2 as a weight function in the wCRPS.All these weight functions can be written as w(y) = a + bw u (y), where a, b, u are constants, and result in a weighted sum of the wCRPS with threshold weight function and the CRPS.Taillardat et al. (2022) and Brehmer and Strokorb (2019) show that expected scores are not suitable for scoring of "max-functionals", functionals which are determined by the behaviour of the distribution at +∞, with an example being the so-called Extreme Value Index.They then argue that this means that use of expected scores may be unsuitable for extremes.However, standard practical use of extreme value statistics is not only concerned with the prediction of tail functionals.If one is interested in predicting maxima, e.g.yearly maxima of daily rainfall, one obtains a sample of earlier yearly maxima, fits some distribution, perhaps a GEV distribution, to this sample and uses the fitted distribution for prediction.This is completely parallel to fitting a normal distribution to a sample and using the fitted distribution for prediction, and understanding and use of expected scoring rules are the same for the two cases.If one instead is interested in prediction of occurrences and sizes of excesses of a high threshold, one obtains a sample of such excesses and uses the sample for prediction.Again this is completely parallel to using a fitted normal distribution for prediction.Thus, we believe that there is still a value in using expected scoring rules for extremes as long as one is not solely interested in max-functionals such as the extreme value index.Taillardat et al. (2022) also suggested to treat observed scores as random variables, and to use qq-and pp-plots and Cramer-von Mises statistics to compare scores.This seems like a quite useful idea, both for scoring extremes and for scoring non-extreme values, although it is not entirely clear how to use this to obtain generally applicable rules for ranking models.Taillardat et al. (2022) and Brehmer and Strokorb (2019) discuss scoring of extreme value tail functionals, i.e., functionals that only depend on the behaviour of the distribution at +∞, such as the so-called extreme value index.For a number of examples of scoring rules, they show that to any "true" distribution, there are alternative distributions for which the scoring rules have expectations which are arbitrarily close to the expected score of the true distribution, but dramatically different tail functionals.The intuition, and sometimes the proofs of these results, use the following model for the alternatives, where ε is a positive number which can be chosen arbitrarily small, F 1 is the true distribution function, and F 2 has a heavier tail.Then the tail functional of F , e.g., the extreme value index, is the same as the that of F 2 , since F 2 dominates at +∞ for any ε > 0. But still it is possible to make the expected score for F and F 1 arbitrarily close by making ε small enough.
This fact is used as an argument against using average scoring rules for extremes.However, it should be noted that the same argument can be made against using scoring rules in non-extreme contexts.For example, if one is interested in the variance of the true distribution F 1 , then for arbitrarily small ε one can choose F 2 so that F has any variance, but the scoring rules give arbitrarily close expected values for F and F 1 .The only way to avoid this is by mathematical assumption, which for extremes would be to rely on the standard asymptotic theory for how the tails are connected with the observed data.More generally, any scoring rule can only address the parts of the sample space which is explored by the data, and in particular if one has n data points, either any values, or just large values, and nε is small, then it is not possible to use the data to distinguish between the models F and F 1 .This applies to both extremes and ordinary values.Taillardat et al. (2022) also suggested to treat observed scores as random variables, and to use qq-and pp-plots and Cramer-von Mises statistics to compare scores.This seems like a useful idea, both for scoring extremes and non-extreme values, although it is not entirely clear how to use it to obtain a generally applicable rule for ranking models.

Local scale invariance and kernel scores
As mentioned above, the CPRS is a scale dependent scoring rule.Bolin and Wallin (2023) argued that this can sometimes lead to wrong conclusions, e.g., if one uses predictions from different spatial locations such as in the extreme rainfall analysis discussed in the introduction.In this section, we make this notion of scale dependence more specific and discuss kernel scores.
For a given probability measure Q, let Q θ for θ = (µ, σ) ∈ R × R + denote a location-scale transform of this measure.That is, if Z ∼ Q, then µ + σZ ∼ Q θ .Note that Q (0,1) = Q.Now consider a small location-scale misspecification of θ that is proportional to σ, i.e., Q θ+kσr , where r = (r 1 , r 2 ) is a two dimensional unit vector representing the direction of perturbation and k ∈ R is a constant.For example, a model with the correct location, µ, but a perturbation of the scale parameter σ + kσ, can be written as Q θ+kσr with r = (0, 1).The idea of local scale invariance is that, for small perturbations, the difference between the score for the perturbed model and the true model should not depend on the scale.We make this precise in the following definition.
In the definition and later, for a scoring rule S we use the notation P S for the set of probability measures on (Ω, F) such that if P ∈ P S and Q ∈ P S then |S(P, Q)| < ∞.Further, for a set of probability measures Q 0 , we write for the set of probability measures in P S which can be obtained as location-scale transforms of measures in Q 0 .If the scoring rule S is proper and twice differentiable with respect to θ, a second order Taylor expansion yields that as k ↘ 0. The score S(Q θ+kσr , Q θ ) has a maximum at k = 0 due to the properness of S.
Hence, ∇ θ S(Q θ , Q)| Q=Q θ = 0 and the first order term of the Taylor expansion vanishes.Here, the function Definition 2.1.(Bolin and Wallin, 2023) Let S be a proper scoring rule with respect to some class of probability measures P on (R, B(R)), and assume that Q 0 is a set of probability measures we say that S is locally scale invariant on Q.One can show that the CRPS has scale function s(Q θ ) = σs(Q (0,1) ) for θ = (µ, σ), which means that the difference in Eq. ( 4) scales linearly with σ.The log-score on the other hand has a scale function s(Q θ ) = 1 σ 2 H Q for a matrix H Q that is independent of θ.The factor σ 2 therefore cancels in Eq. ( 4), so that the expression is independent of the scale (Bolin and Wallin, 2023).
A scoring rule that can be written as where g is a non-negative continuous negative definite kernel is called kernel score (Dawid, 2007).
In order to construct locally scale invariant scoring rules, Bolin and Wallin (2023) introduced a generalisation of this class of scoring rules, the generalised kernel scores, which are proper scoring rules defined as where h is any monotonically decreasing, convex, and differentiable function on R + .For the particular choice h(x) = − 1 2 log(x) and g(x, y) = |x − y| in Eq. ( 6), one obtains the scoring rule SCRP S(P, y) + 1, where SCRPS is the scaled CRPS scoring rule SCRP S(P, y) which was shown to be locally scale invariant in Bolin and Wallin (2023).Another way of deriving this scoring rule is to use the fact that for any negative proper scoring rule S(P, x) on a set of probability measures P, the transformed score is also a proper scoring rule on P (Bolin and Wallin, 2023).

Scaled weighted scoring rules
In this section, we introduce the concept of local weight-and tail-scale invariance as less restrictive alternatives to local scale invariance.We then survey and discuss the use of the scaled wCRPS to score extremes and derive its scaling properties.In the final subsection, we propose alternative combined scoring rules for extremes, concentrating on a scoring rule based on the logarithmic score.

Local weight-scale invariance
When scoring extremes, we are mostly interested in the tail properties of the distributions, and in other cases the interest might be on some specific range of values which not necessarily is extreme.Therefore, it might not be all that important to have full local scale invariance as long as this property holds in the region of interest.To make this precise, recall that Q is a family of location-scale transformations of probability measures in Q 0 .Let w be some weight function representing the region of interest as {x : w(x) > 0}, such that P In other words, the conditional distributions of location-scale transformations are also locationscale transformations.Let Q w denote the set of all these conditional distributions corresponding to Q ∈ Q, and we assume for simplicity that the conditional distributions of the measures in Q 0 are in Q 0 , so that Q w ⊂ Q.
Because we are interested in the region {x : w(x) > 0}, we define the concept of weight-scale invariance by considering the scaling properties of S when restricted to Q w .Definition 3.1.Let S be a proper scoring rule with respect to some class of probability measures P on (R, B(R)), and assume that Q 0 is a set of probability measures such that Q ⊆ P ∩ P S .Suppose that w is a weight function and let Q w ⊂ Q be the set of conditional probability measures as defined above.We say that S is a locally weight-scale invariant scoring rule with respect to a w on Q if S is locally scale invariant on Q w .Remark 3.2.Clearly, if S is locally scale invariant, it is also locally weight-scale invariant.However, as we will see later, there are scoring rules that are locally weight-scale invariant but not locally scale invariant.
Recall that the scoring rule S is locally scale invariant if its scale function satisfies s( Remark 3.3.If the interest is focused on the region w(•) > 0, we could simply use S restricted to Q w as a scoring rule, which means that we only consider the score based on the conditional distributions.However, this disregards the probability of w(•) > 0 which may be important in other situations.This probability is typically dependent on the scale, and local weight-scale invariance can thus be thought of as relaxation of local scale invariance where the scale dependence of the probability of w(•) > 0 is ignored.
In the special case where the weight function is the indicator weight function w u we refer to local weight-scale invariance as local tail-scale invariance.Specifically, we have the following definition.
Definition 3.4.Let w u be the indicator weight function in Eq. (3).Under the same assumptions as in Definition 3.1, we say that S is a locally tail-scale invariant scoring rule on Q if for each θ there exists u θ ∈ supp(Q (0,1) ) ∩ supp(Q θ ) such that S is locally scale invariant on Q wu for every u ≥ u θ .Similarly, we say that S is locally lower tail-scale invariant on Q if S is locally scale invariant on Q 1−wu for every u ≤ u θ .

Scaled weighted CRPS
For any kernel score we can use the construction in Eq. ( 6) for generalised kernel scores to construct a corresponding scaled version by using h(x) = − 1 2 log(x) as done for the SCRPS.By the following proposition, this can be done for the wCRPS.
This proposition was shown previously by Allen et al. (2022), and for completeness, a proof is given in Appendix C. Thus, by this proposition combined with the construction in Eq. ( 6), the scoring rule defined as swCRP S(P, y) is proper, and can be a suitable alternative for evaluating extremes.As for the SCRPS, an alternative definition of the swCRPS would be to use Eq. ( 8) on the wCRPS.This definition was proposed in (Vandeskog et al., 2022), and since the wCRPS is a kernel score the two definitions in fact coincide, up to an additive constant.
No matter which of the arguments we use to define the swCRPS, it is not guaranteed that it is locally scale invariant.We now derive to what extent the swCRPS is scale invariant.The following assumption will be used on the distributions we consider.
Assumption 3.6.The Borel probability measure Q on R has density exp(Ψ) with respect to Lebesgue measure for some twice differentiable function Ψ such that the expectations Under this assumption, we have the following result for the wCRPS and the swCRPS, which is proven in Appendix C. Proposition 3.7.Let Q be a set of location-scale transformed probability measures Q θ satisfying Assumption 3.6.Then the following holds: (i) The wCRPS with indicator weight function w u is neither locally scale invariant nor locally tail-scale invariant on Q.
(ii) The swCRPS as defined in Eq. ( 9) with indicator weight function w u is locally tail-scale invariant but not locally scale invariant on Q.
Other versions of the CRPS, the rCRPS and the rSCPRS as introduced in Bolin and Wallin (2023), are deinfed by inserting the kernel function into the definition kernel scores in Eq. ( 5) and into Eq.( 9).They showed that neither of these score were locally scale invariant.However, one can show their local weight-scale invariance.
Proposition 3.8.(i) The rCRPS is not locally weight-scale invariant with respect to the indicator weight function w u (y) = 1{|y| < u} for any u > 0.
(ii) The rSCRPS is locally weight-scale invariant with respect to w u for u ≤ c 2 .
The steps taken above by using the construction in Eq. ( 6) to create the swCRPS can be done on the more general threshold weighted kernel score, proposed by Allen et al. (2022), and defined as where ρ is a continuous negative-definite kernel and v(x) − v(x ′ ) = x ′ x w(t)dt is a measurable chaining function.This results in a scaled score However, we focus on the more specific swCRPS to keep the exposition reasonably short and leave further investigations of these scaled scores for future research.
As mentioned earlier, one main benefit of kernel scores is that the expected values can be computed with Monte Carlo methods without knowing the true distribution.However, under an assumed model distribution, the kernel scores can often be computed analytically.This is in particular the case for the Generalised Extreme Value (GEV) distribution, as shown in Appendix A.

Local weight-scale invariance of the censored likelihood score
It follows from the definition that a sum of two proper scores is also proper, more generally that a weighted sum of several proper scores is proper, and additionally that if at least one of the summands is strictly proper, then the sum is also strictly proper.This can be used to propose alternative scores which may be less likely to suffer the problems of overweighting heavy tails as discussed above for the quantile scores (see also, e.g., Lerch et al., 2017).
If one is interested in good prediction of values exceeding some threshold u, there are two complementary questions: a) how well does a method predict the sizes of the excesses which occur, and b) does the method give good prediction of the probability of exceedance.Here a) can be approached by only scoring the values of the excesses which actually occur, using the conditional distribution of these, f (x)/(1 − F (u)).Question b) instead is a binary prediction problem and one can use any of the many scoring rules for such problems.Further, it seems reasonable to use related scoring rules for a) and b).Diks et al. (2011) uses this idea to combine the logarithmic score for excesses where f and F are the density and cumulative distribution functions corresponding to P, with the binary logaritmic scoring rule for exceedance of the threshold u, to obtain the so-called censored likelihood score This is a proper scoring rule and is studied in simulations and application in the next two sections.We also have the following result, proven in Appendix C.
Proposition 3.9.Let Q be a set of location-scale transformed probability measures Q θ satisfying Assumption 3.6.Then the censored likelihood score in Eq. ( 13) is locally tail-scale invariant on Q, but in general not locally scale invariant.
The above type of weighted scoring rule was generalised in Holzmann and Klar (2017) as follows.First, let p w represent the re-normalised density of p with respect to weight function w, i.e.
If S 0 is a proper scoring rule, then owS(P, y; w) = w(y)S 0 (P w , y), where P w is the distribution with density p w , is a proper scoring rule.This is called an outcome weighted scoring rule in Allen et al. (2022).
Next, for a strictly proper scoring rule s(α, z) for the success probability α ∈ (0, 1) of a binary outcome variable z ∈ {0, 1}, the score is said to be a localising proper weighted scoring rule for the density forecast p. Finally, adding Eq. ( 15) and Eq. ( 16) yields a proper scoring rule Ŝ(P, y, w) = S s (P, y; w) + owS(P, y; w).
With S 0 and s as log scores, the score Ŝ as defined in Eq. ( 17) becomes a censored likelihood with a general weight function w instead of the weight function w u defined in Eq. ( 3).
Proposition 3.11.Suppose that S 0 and S 1 are locally weight-scale invariant with respect to w.Then (i) If w is an indicator weight function, then owS as defined in Eq. ( 15) is locally weight-scale invariant with respect to w.
(ii) S 0 + S 1 is locally weight-scale invariant with respect to w.

Simulation studies
This section contains two examples with simulated data, which illustrate the effect of scale dependence for extremes using the wCRPS and the swCRPS.

Benchmark example
We start by comparing the predictive ability of CRPS against SCRPS in terms of tail regimes.
A comparison using CRPS was previously done using a benchmark example in Taillardat et al. (2022).They considered the hierarchical model where 1 > ξ > 0, Gamma(α, β) denotes a gamma distribution with shape parameter α and rate parameter β, and Exp(δ) denotes an exponential distribution with scale parameter δ.Using this as the true model, four forecasting models are compared.The ideal model is an exponential model using observed values of Z.The extremist model instead underestimates the scale parameter of the exponential distribution, assuming it to be Z/ν, where ν > 1.The climatological model has no information about Z but instead uses the unconditional and distributionally equivalent Generalized Pareto distribution, GP(1, ξ), with scale parameter one and shape parameter ξ.
And finally, the τ -informed model is a mixture distribution between the ideal and climatological model distributions.The models are listed in Table 1.
The CRPS for the extreme forecast and the τ -informed forecast are and respectively, as shown by Taillardat et al. (2022).Here, Γ u (a, τ ) = ∞ τ t a−1 e −t dt denotes the upper incomplete gamma function.From the CRPS scores of Eq. ( 18) and ( 19), one can compute the SCRPS by first noting that and , and then using Eq. ( 7) to compute the SCRPS.We simulate 10 6 observations of the model, for two choices of ξ, ξ = 0.25 and ξ = 0.5.The higher ξ is, the lower is the variability in Z, and thus the scale of Y |Z.Comparing the ratio of the mean scores of the different models to the corresponding means for the ideal model for CRPS and SCRPS creates an ordering of the models, as seen in Table 2.Both scores order as expected within the τ -informed models and within the extremist models, with decreasing τ and increasing ν respectively.However, the ordering within the two models depends on the choice of scoring rule.Hence, by choosing the scoring rule, one makes an implicit choice regarding which aspects of the model that are important.For example, for ξ = 0.25, the climatological forecast is worse than the extremist forecast with ν = 1.8 according to the CRPS but better according to the SCRPS, probably due to how the CRPS penalises the forecasts with larger uncertainty more than the SCPRS does, while the SCRPS prefers the models with the same relative error over the climatological model.
The sensitivity and power of the respective scores are also of interest, for different values of ν.For a score S, the extreme model is compared against the ideal model on a simulated dataset with 1000 observations using a two-sided pairwise t-test.This is repeated 1000 times with different simulated datasets.The SCRPS better identifies the ideal model from the extremist model, as seen in Figure 1, with higher power when comparing the ideal and the extremist model.Moreover, the power of the SCRPS does not change for increased values of ξ while the power of the CRPS decreases.

Score dependence on scale and threshold
For an observation from X ∼ Q θ = GEV (µ = 0, σ, γ = 0.12), let us consider the effect of choosing a wrong model by computing the expected score difference S(Q θ , Q θ ) − S(P, Q θ ) where P = GEV (µ = 0, 2σ, γ = 0.12).Here the value γ = 0.12 of the shape parameter is chosen to mimic the behaviour of the extreme rainfall events which are studied in Section 5.The mean and standard deviation of the score difference is computed from 50.000 simulated observations of X.For the weighted scores, the thresholds q(p) are obtained as the p-th quantile of Q θ , q(p) = µ + σ γ ((− log p) −γ − 1) for p > 0, and q(0) represent the unweighted scores.Figure 2 shows that for the wCRPS, differences increase with increased σ while they remain stable for the swCRPS.The behaviour is similar for each choice of threshold.Variability increases with scale for the wCRPS but it remains constant for the swCRPS.Thus the local tail-scale invariance does not only affect the mean score but also the score variance.The fact that the mean and variance of scores remain the same for different scales simplifies understanding and use of empirical distributions of scores from predictions with different scales.It can also help scoring scaled locations simultaneously, as we will see in the following section.

Scaling effect on expected scores
As mentioned above, one often evaluates forecasts at multiple locations though average scores.If the individual scores are proper, their average is also a proper score.However, if the distributions at these locations have different scales, the use of a scale dependent scoring rule leads to an implicit ranking of the importance of the different locations, in the sense that some locations might be more important to predict well to get a high average score.
As an example, consider two observations following GEV distributions that differ only in scale, X i ∼ Q θ i = GEV (0, σ i , 0.12), i = 1, 2, with σ 1 = 1.5 and σ 2 = 3.Using the predictions P i = GEV (0, σ i , 0.12), where σ i is an estimated scale parameter, the paired observations are scored using the scoring rule S(P, y) = 1 2 (S 1 (P 1 , y 1 ) + S 2 (P 2 , y 2 )), where S 1 and S 2 are either wCRPS scores, or else swCRPS scores with indicator weight functions w q 1 and w q 2 for the two observations.Here q 1 and q 2 are chosen as the 90th quantiles of Q θ 1 and Q θ 2 respectively.The expected score was computed as the mean of 100.000 simulations.
Figure 3 shows that the expected score S(P, Q) with σ i = k i σ i for wCRPS is more sensitive to changes in k 2 than in k 1 , meaning that it is more important to have a good prediction for X 2 .For the swCRPS, the scores are on the other hand quite symmetrical in k 1 , k 2 , so the predictions P 1 , P 2 are scored more equally.Figure 3 further shows that the swCRPS is close to being locally scale invariant while the wCRPS is not.Finally, note that the symmetry in k 1 , k 2 for all considered values indicates that the score difference does not depend on the scale even for large model misspecifications, which is not guaranteed to hold by the definition of local scale invariance.This is not uncommon, and for example also holds for Gaussian distributions as can be seen in Bolin and Wallin (2023, Figure 1).Score threshold q(0) q(0.5) q(0.6) q(0.7) q(0.8) q(0.9) q(0.95) Figure 2: Simulated mean (top) and standard deviation (bottom) of the score difference S(Q θ , Q θ ) − S(P, Q θ ) using wCRPS (left) and swCRPS (right) for two predictions of a random variable X ∼ Q θ = GEV (µ = 0, σ, γ = 0.12), with P = GEV (µ = 0, 2σ, γ = 0.12), as functions of the scale parameter σ for different thresholds q(p), chosen as the p-th quantile from Q. Threshold q(−∞) results in the unweighted scores, CRPS and SCRPS.

Case studies
This section uses data on water levels, precipitation, and air pollution to compare scoring rules in practice.The models used were fitted by maximising the log likelihood in the first two case studies and through the INLA approach (Rue et al., 2009) in the final study.

Extreme water levels
In this section, we consider data containing annual maximal water levels at five representative stations in the Great Lakes; Lake St. Clair, Lake Michigan-Huron, Lake Ontario, Lake Superior, and Lake Erie.The dataset is provided by NOAA and dates from 1918 to 2020.With different sizes and depths of the lakes, the data differs in scale, making it interesting to see the effect of misspecifications when using wCRPS and swCRPS.First, a stationary GEV model was fitted to the five representative stations, assuming that the behaviour of the lake-wide average water levels has not changed during the observed time series.The estimated parameters are listed in Table 3.The shape parameter at all locations is Figure 3: Simulated expected score S(P, Q) using wCRPS and swCRPS for a pair (X 1 , X 2 ) of random variables with X i ∼ GEV (µ i , σ i , γ), as functions of k i , i = 1, 2 using a prediction that has the correct location parameters, µ i , and scale parameters σ i = k i σ i , i = 1, 2. For the true model, µ 1 = µ 2 = 0, γ = 0.12, σ 1 = 1.5 and σ 2 = 3.The weight function was chosen as the 0.90 quantile for each score.
. negative, suggesting that the annual maxima follow a Weibull distribution.The density functions of the estimated GEV distributions are shown in Figure 10 in the appendix.Next, the P GEV λ model, with a trend in λ and temperature as covariate, was fitted to each of the stations.For further information on the PGEV model, see Olafsdottir et al. (2021).The temperature used was a lowess smoothing of the yearly average Northern Hemispheric temperature, obtained from NOAA (NOAA National Centers for Environmental Information, 2019).The fit suggested that Lake Michigan-Huron and Lake Superior did not have a trend in the expected number of extreme water levels, λ, but trends existed in the other three lakes, Lake St. Clair, Lake Ontario and Lake Erie, using significance level α = 0.05.From the Lake System profile, the stationarity might be explained by flow from both Lake Michigan-Huron and Lake Superior to the other lakes.The number of dams in the great lakes exceeds 7.000 so the effect of them on the data is hard to visualise.
In (NOAA, Great Lakes Environmental Research Laboratory, 2015) it is noted that "since September 2014, all of the Great Lakes have been above their monthly average levels for the first time since the late 1990s".However, they come to the same conclusion, that probably the Lake Superior and Lake Michigan-Huron will remain stationary around the mean levels while there might be non-stationarity in the remaining three lakes.
The above parameter estimates are used through simulation to compare the effect of param-eter misspecification on the mean and standard deviation of the score differences.Further, since the stationary stations, Lake Superior and Lake Michigan-Huron are also the ones that differ most in scale, these are used in the simulation study to compare how often one misspecified model is preferred over another equally misspecified model using different scores.

Simulation
Using the estimated parameters from Table 3, we simulate observations from the five different stations.For each station, 1000 independent stationary time series of length 100 are simulated.These time series have been scored for two models.The first model uses the true parameters µ, σ, γ from the simulation and the second model uses the true location and shape parameters µ and γ, but perturbs the scale parameter by a factor k = 1.5.The scale parameter of the second model therefore becomes 1.5σ.We see that for the parameters from station 4, Lake Superior, both the mean score difference and its variation is smaller than for the other stations when using wCRPS (Figure 4).This difference between stations disappears when instead using swCRPS on the same data.For a given perturbation factor k, the score difference at station i is defined as ∆ i := S σ i −S kσ i where S σ is the score using estimated scale parameter σ, and σ i is the true scale parameter at station i (i.e., the parameter that was used when simulating the data).The factor k describes the estimated scale parameter's error relative to the true scale parameter.For scale invariant scores, the expected difference will only depend on the perturbation factor and not on the scale itself.
We want to compare score A against score B, i.e., S A := S σ2 + S σ4 against S B := S σ2 + S σ4 .However, S A −S B = ∆ 2 −∆ 4 so it suffices to compare the differences ∆ i at the stations.For this, 1000 time series of length 100 were simulated using the estimated parameters of Lake Superior and Lake Michigan-Huron.The perturbation factor was fixed at k = 1.5.Figure 5 shows how often model A was preferred over model B using different scores.Since the models both have one correct scale parameter, and the same proportional error in the other parameter, the models A and B should be chosen with equal probability if not influenced by scale.This corresponds to choosing A over B 50% of the time.Using the wCRPS gives a high bias toward model A while the swCRPS yields a more fair comparison of models A and B, as expected.The deviation from the 0.5 line for large thresholds for the swCRPS is due to the difference in shape parameters at the simulated stations.If we assume that the shape parameters are the same for the two stations, this deviation disappears.

Extreme rainfall events and climate change
In this section we use the scoring rules introduced above to compare five models for extreme rainfall in the Northeastern United States.The dataset is a part of NOAA Atlas 14 Volume 10, part 3 (NOAA's National Weather Service, 2019) and contains the annual maximum rain from 685 stations in the Northeast with time series ranging from approximately 1900 to 2014.Score threshold q(0) q(0.5) q(0.6) q(0.7) q(0.8) q(0.9) q(0.95) Figure 4: Mean and standard deviations of score differences, ∆ i , at stations i ∈ {1, 2, 3, 4, 5} with k = 1.5, using different types of CRPS scores on simulated data using the estimated parameters from the stations listed in Table 3 Only stations with at least 60 years of data are used.We call this data set the NE-data.The estimated scale parameters differ between stations (Figure 6), meaning that the stations with higher scale parameters will be given greater weight if a scale dependent scoring rule such as CRPS and wCRPS is used.

Model and inference details
The four models for annual maxima rainfall compared are (i) Gumbel: a GEV distribution with shape parameter zero, (ii) GEV: a GEV distribution with shape parameter different from zero, (iii) GEV µ : a GEV distribution with trend in the location parameter, i.e. µ(t) = µ 0 +µ 1 t, and (iv) PGEV λ : a PGEV distribution with trend in the frequency parameter, i.e. ln λ(t) = λ 0 +λ 1 t.For both GEV µ and PGEV λ , the covariate t is a lowess smoothing of the yearly average Northern Hemispheric temperature, obtained from NOAA (NOAA National Centers for Environmental Information, 2019).The parameters of the GEV and the PGEV models are estimated by maximising the loglikelihood.Both models assume a single regional shape parameter and station-wise location and scale parameters.The estimated scale parameter varies over space, motivating the use of locally (tail-) scale invariant scores.The thresholds for the weighted scores are determined empirically from the time series as the p-th quantile.
Besides overall mean scores, we also consider the distribution of scores.For station i, with N i observations y = (y 1 , ..., y N i ), the mean score is where P ij is a predictive distribution for observation j at station i.The station-wise score difference between two predictions P and Q, at station i, is denoted ∆ i (P, Q) = S i (P, y) − S i (Q, y).A positive difference means that prediction P scored better, and a negative difference means that prediction Q scored better.The standard way to evaluate if P is a better model than Q for time series is to perform a Diebold-Mariano test (Diebold and Mariano, 1995) on difference of this kind, that takes into account correlation in the data.However, for the station-wise differences, the spatial correlation is low, and we therefore perform pairwise t-test of equal predictive performance to compare two models.
The variances of the station-wise average scores are affected by the varying scale parameters in the data, but also by the lengths of the time series, which are slightly different for the different stations.However, since only time series of length greater than 60 years are used, the latter difference is small.

Results
As shown in Table 4, the PGEV λ model gives the highest overall mean scores, while the Gumbel model has the lowest score, for all statistically significant scores.Figure 7 shows p-values from two-sided t-tests, where one can note that all scoring rules reject the null hypothesis except for very high thresholds, where only the LS q score rejects.
Instead of only considering average scores, one can, as suggested by Taillardat et al. (2022), also check for the existence of trends by permuting covariates (in this paper temperature) and computing prediction scores for the model which uses the permuted covariates instead of the ordered ones.In the absence of trends, these predictions should be similar to the ones which Table 4: Mean scores for four different models for NE-data, with means obtained as the mean of the mean scores at the individual stations.The scores where t-test showed significance between the PGEV λ and GEV µ are shown in bold.

LS
CRPS SCRPS LS q wCRPS swCRPS 90% 99% 90% 99% 90% 99% Gumbel -1.148 -0.483 -1.048 -0.427 -0.0903 -0.0865 -0.010178 -0.1556 -1.531 GEV -1.129 -0.481 -0.966 -0.419 -0.0856 -0.0860 -0.010201 -0.0795 0.893 GEV µ -1.108 -0.473 -0.958 -0.415 -0.0850 -0.0857 -0.010199 -0.0644 0.906 PGEV λ -1.107 -0.472 -0.956 -0.414 -0.0847 -0.0854 -0.010195 -0.0603 0.901 use the ordered covariates.By plotting the ordered average station scores from the original station data D 1 = (t i , y i ) n i=1 against the ordered average scores obtained by using the permuted covariates D 2 = (t π(i) , y i ) n i=1 , one can see if the scores from D 1 are larger, which would point at the existence of a trend, or if they distribute evenly around the 45 degree line, in which case trend is unlikely.For the PGEV model with trend in frequency, Figure 8 suggests the existence of a trend regardless of which scoring rule is used.Here, the same conclusion is reached regardless of score chosen.In particular, the rankings of the models do not change if we focus only on high values.Whether this is the case is application dependent, and Section 5.3 shows an example where the conclusions differ when only focusing on high values.

Particle matter concentration
So far, the examples have been focused on extremes, due to the natural interest in predictions above a certain threshold.However, there are many cases where there is a reason to focus on specific regions which are not necessarily extreme.For example, high concentration of particle matter of size less than 10 nm in diameter (PM10) have negative health impact.According to WHO (Geneva: World Health Association, 2021), the daily PM10 should not exceed 45 µg/m 3 .Moreover, based on studies on the connection between long-term PM10 and non-accidental mortality, they found that the long-term level of PM10 should not exceed 15 µg/m 3 .This is yet to become a recommendation, but it might be of interest to evaluate how well models can predict PM10, focusing on values above 15 µg/m 3 .The assumption is that PM10 concentrations below 15 µg/m 3 have less health implications, and thus the prediction of lower values not as important for the model evaluation.As an illustration, we focus on concentrations above three distinct thresholds; 15 µg/m 3 , 30 µg/m 3 , and 45 µg/m 3 when comparing predictions of PM10 concentrations in the Piemonte region in Italy.We use the data from Cameletti et al. ( 2013 Here, W m are covariates, β m regression coefficients, and e i are independent centred Gaussian Table 5: Mean scores from one-step ahead cross-validation used to compare PM10 predictions in the Piemonte data for separable and non-separable space-time models as in Krainski (2023).The higher score is represented in bold.The thresholds used for the weighted scores are 15 µg/m 3 , 30 µg/m 3 , and 45 µg/m 3 .variables representing measurement noise.Further, u(s, t) is a spatio-temporal Gaussian random field with a separable covariance function.

CRPS
Since separability is rarely a realistic assumption for spatio-temporal statistical models, we compare the predictions based on this model with those of a model where u is replaced by a Gaussian random field with a non-separable covariance function.Specifically, we use the "critical diffusion" model model of Lindgren et al. (2023), which is a Gaussian random field defined as a solution to a stochastic partial-differential equation (SPDE).Both models are fitted to the data using the code provided in (Krainski, 2023).
To compare the models in terms of predictive performance, a one step ahead cross-validation was preformed by extending the cross-validation function of the R package rSPDE (Bolin and Simas, 2023) to include swCRPS and wCRPS, with the test set at fold i defined as all observations at time t i+1 , and the test set as all observations up to that time, for i = 1, ..., 181.The predictions are compared in terms of CRPS, SCRPS, wCRPS and swCRPS.The results of the best scores are shown in Table 5.In line with the results found in (Lindgren et al., 2023), the CRPS and SCRPS indicate that the non-separable model has better predictive performance than the separable model.However, for the weighted scores, the wCRPS indicates a better performance of the non-separable model, while the swCRPS indicates a better performance of the separable model.Thus, this application is an example of where the choice of scoring rule affects the rankings, and we in particular see that the swCRPS indicates that the separable model might be preferable when focusing on harmful levels of PM10.and S opt n are the scores for a reference method and the hypothetical optimal score, respectively.However, these scores are often improper, even when they are based on a proper scoring rule (Gneiting and Raftery, 2007).Bolin and Wallin (2023) addressed this issue by introducing the SCRPS, a proper scoring rule which retains most of the desirable properties of the CRPS but in addition also is locally scale invariant.We extend this construction to the weighted CRPS to obtain the swCRPS and investigate the properties of this score.We show that neither the wCRPS nor the swCRPS are scale invariant, but that the swCRPS is locally tail-scale invariant, meaning that conditioned on exceeding the threshold used in the weighting, the score is locally scale invariant.

Conclusion and discussion
Through simulation studies we showed that for a number of extreme value models swCRPS works in a similar way as the SCRPS does for full data sets.In particular, the mean and variability of score differences remain similar when scale parameters are changed, and average scores over random variables with different scales keep the relative error in scale parameter estimation approximately equal.These properties can be important, for example when evaluating weather models that operate on different scales at different spatial locations, such as temperature, water levels, or rainfall amounts at different measuring stations.
As an example, we used different scores to evaluate five models on annual maximum rainfall data in the Northeastern USA.The comparison led to similar conclusion for all scores, with the best tested model being a PGEV model with a temperature-wise trend in the frequency parameter, meaning that for this case study we come to the same conclusion when only considering the tail of the distribution.This was not the case for the particle matter application, where focusing on high values changed the model ranking.
It should be noted that scale dependence can sometimes be preferred, as mentioned in (Bolin and Wallin, 2023).In such situations, the wCRPS would likely be a better choice than the swCRPS.However, we believe that such situations are rare for typical applications in extremes.Moreover, the LS q seems to have the best power, as expected, but this score might not be possible to compute for more complicated models.In such cases the swCRPS can be a good alternative.
Issues regarding tail weighted scores have been raised as the forecaster's dilemma (Lerch et al., 2017).The lack of information from observations below a certain threshold can unintentionally draw the score towards the wrong model based on the weight of the models' tail behaviour.This has been addressed by also including the unweighted CRPS by using a weight function of the form a + bw u (x) (Thorarinsdottir and Schuhen, 2018).Another solution might be to simply not choose a "too high" threshold u.
Weighted scores are not solely relevant for extremes but can be applied whenever predictions within a specific region are of higher importance, as seen for the PM10 concentration model, where a Gaussian spatio-temporal model was used to predict the concentration of PM10 in the Piemonte region of Italy (Cameletti et al., 2013;Lindgren et al., 2023).Moreover, the regions of interest can be defined through other weight functions than the indicator weight function w u (y) = 1{y ≥ u}.The notation of local weight-scale invariance extends the concept of local tail-scale invariance to accommodate for local scale invariance within these types of regions.

A Extreme value models and scoring rules
When considering block maxima data, such as annual maxima of daily precipitation data, classical extreme value theory (see e.g.Coles, 2001) shows that the Generalised Extreme Value (GEV) distribution is a suitable model for the data.Different variations of this approach exist, as shown below.

A.1 Extreme value theory models
The GEV distribution has location, scale and shape parameters µ, σ > 0, and γ respectively, and distribution function In this model, one can introduce trends in the parameters, for example to model climate change.Further, one can use a reparameterisation, PGEV, of the GEV distribution in terms of parameters λ, σ u , γ, u, with Here the parameter λ describes the frequency of exceedances of a high level u, and σ P GEV is a scale parameter of the distribution of sizes of excesses of u.A reason for this reparameterisation is that it gives a clearer physical understanding of the behaviour of extreme events.The distribution function of the PGEV can be written as .
For details and more information, see Olafsdottir et al. (2021).For models without trends in parameters, the PGEV distribution is the same as the GEV distribution.However, the trends in λ for the PGEV model and in µ for the GEV model behave differently.When P is the GEV or PGEV distribution, and γ < 1, the CRPS has a closed form expression (see e.g.Friederichs and Thorarinsdottir, 2012), tdt is the exponential integral, and C is the Euler-Masceroni constant.For γ ≥ 1, the CRPS does not exist.In Appendix B we derive the corresponding closed-form expressions for the wCRPS.
Through the following reformulation of Eq. ( 9), swCRP S(P, y) = wCRP S(P, y) we can also evaluate swCRP S(P, y) for the GEV distribution through the closed form expression for the wCRPS in combination with the expression See Appendix B for the derivation of these expressions.

B Closed form expressions for GEV scores
Let P be a distribution with CDF F and PDF f , and let w q be the indicator weight function as described in Eq. ( 3).Then Using the formulation in Eq. (2), we have The expressions in Eq. ( 21) and ( 22) can be rewritten in terms of quantiles of the distribution F , since and Let us now use these results to evaluate the scoring rules for the GEV distribution.We first consider the case γ = 0, corresponding to the Gumbel distribution, and then the case γ ̸ = 0.

C Proofs
Proof of Proposition 3.5.Let g w (x, y) := | x y w(t)dt|, where w(x) is a non-negative function.Then g w : Ω × Ω → R is a symmetric real-valued function.Moreover, n i=1 n j=1 a i a j g w (x i , x j ) ≤ 0 for all n, all a 1 , ..., a n ∈ R s.t.n i=1 a i = 0 and all x 1 , ..., x n ∈ Ω, since g(x, y) = |x − y| is negative definite.Hence, the kernel g w is negative definite and the weighted CRPS, wCRP S(P, y) = 1 2 is indeed a kernel score.
Proof of Proposition 3.7.Note that for any Q for which Assumption 3.6 holds, the assumption also holds for Q wu , where u ∈ supp(Q), and w u is the indicator weight function.
We start proving (i).Assume that Q θ is a probability measure satisfying Assumption 3.6.
)] denote the wCRPS where q θ is the density of Q θ .By Tailor expansion around θ, we have that From Assumption 3.6, s(θ) = ∇ 2 θ S(Q θ , Q θ )| Q=Q θ exists and is continuous and we have We can now follow the steps in the proof of Lemma 1 in Bolin and Wallin (2023), where we replace their kernel g c (x, y) with our weighted kernel g w (x, y), since the only thing required in the proof is that the kernel is a positive negative-definite kernel and that g w (x, y) ≤ g(x, y), both of which hold whenever |w(t)| ≤ 1 for all t ∈ R.This holds certainly for our indicator weight function w q but even for other choices of w.This results in and where v(X) is a vector and H(X, Y ) is a 2 × 2 matrix, both independent of θ, and For w = w q , we have Since we cannot let w depend on θ = (µ, σ), we cannot choose any w such that g w (σx + µ, σy + µ) = σg w (x, y).
(28) However, letting k = q−µ σ , we have that g w (σx + µ, σy + µ) = σ|x − y| for all x, y ≥ k.Then the wCRPS has scale function where H Q (X, Y ) is a 2 × 2 matrix independent of θ and h(X, Y ) is a function dependent on θ.Furthermore, there exists a constant k < ∞, such that h(X, Y ) is independent of θ for all X, Y ≥ k.From this scale function we directly see that the wCRPS is neither locally scale invariant nor locally tail-scale invariant.
We now prove (ii).Let S be the swCRPS defined in Eq. ( 9).As in the proof of statement (i), we perform the Taylor expansion shown in Eq. ( 24).Since S is proper, ∇ θ S(Q θ , Q θ ) = 0 and we only need to consider the term ∇ 2 θ S(Q θ , Q)| Q=Q θ that exists and is continuous.For simplified notation, let Inserting ( 25) and (26) instead of E Ṗ ,P and E Ṗ , Ṗ into the equation above results in and for u ≥ k, i.e. s(Q wu θ ) = s(Q wu (0,1) ) for u ≥ k, showing that the swCRPS is locally tail-scale invariant.
Proof of Proposition 3.8.(i) According to Bolin and Wallin (2023), the scale function of rCRPS with kernel function where . Therefore, rCRPS is not locally weight-scale invariant with respect to w u (x) = 1{|x| < u} (ii) Following the proof of Proposition 3.7, using the fact of Eq. ( 31) instead of Eq. ( 27), yields that the rSCRPS is locally weight-scale invariant for weight function w(x) = 1{|x| < c/2}.
Proof of Proposition 3.9.For u ≥ q, the censored likelihood score, LS q , and the logarithmic score, LS, have the same conditional expectation, since The logarithmic score is locally scale invariant (Bolin and Wallin, 2023), and thus also locally tail-scale invariant according to Remark 3.2.Therefore, the conditional likelihood is also locally tail-scale invariant.
Proof of Proposition 3.10.First note that CLog w (P, y) = w(y) log   3, after shifting the x-axis by the estimated mode of each station.
(ii) Consider the score S = S 1 + S 2 .The weight-scale function of S is the sum of the weightscale functions of S 1 and S 2 .If S 1 and S 2 locally weight-scale invariant with respect to w, their weight-scale functions fulfil the requirements for local scale invariance, and hence, the weight-scale function of S does too.Therefore, S is locally weight-scale invariant.

D Estimated flood densities
The annual maximum water levels were modelled with a stationary GEV distribution and the estimated parameters shown in Table 3. Stations Michigan Huron (station 2) and Lake Superior (station 4) were used for simulations since those stations suggested stationarity and since the scale parameter of Lake Michigan Huron was almost twice the scale parameter of Lake Superior.All stations had negative shape parameter.In Figure 10 the estimated density function is compared at all five stations, where the x-axis has been shifted by each respective mode for readability.

xx
′ w(t)dt , provided the expectations above are finite.If one is interested only in the upper-tail behaviour, one might choose the indicator weight function

Figure 5 :
Figure5: Proportion of times model A was preferred over model B when simulating time series of length 100 when using wCRPS and swCRPS, with individual shape parameter (left) and joint shape parameter (right).The score threshold is chosen as the p% quantile.

Figure 6 :
Figure 6: Estimated scale parameter from a P GEV model with trend in the frequency parameter λ for stations in the Northeastern U.S.

Figure 7 :
Figure7: P-values from t-test station-wise score differences ∆ i (P, Q) for predictions P and Q, where A) P = GEV , Q = GEV γ=0 , B) P = P GEV λ , Q = GEV γ=0 , C) P = P GEV λ , Q = GEV and D) P = P GEV λ , Q = GEV µ .The 0.05 level is marked with a red dashed line.Note that the y-scales are different in the different plots.
), containing observations for 182 days from October 2005 to March 2006 at 24 locations.The data is illustrated in Figure 9. Denoting the n = 4368 observations of log-PM10 by y i , i = 1, . . ., n, Cameletti et al. (2013) proposed using a model of the form

Figure 8 :
Figure 8: Sorted average station scores for a PGEV model with trend in λ plotted for original dataset D 1 against permuted dataset D 2 .The colour represents if the points fall above (green) or below (orange) the diagonal.

Figure 9 :
Figure 9: The position of the 24 stations in the Piemonte region along with observations of PM10 concentration on March 13th 2006 (right), and a histogram showing the observations from the 24 stations at the 182 different dates (left).

Figure 10 :
Figure10: Density of estimated GEV distributions for flood stations as described in Table3, after shifting the x-axis by the estimated mode of each station.

Table 1 :
Forecasting models for benchmarking, taken fromTaillardat et al. (2022).The true model is Y |Z

Table 2 :
Ratio of the mean CPRS and mean SCRPS with respect to the corresponding means for the ideal forecast for two different choices of ξ.The mean was taken over 10 6 independently simulated observations.

Table 3 :
Parameter estimates (standard deviations) using stationary GEV distribution.
SCRPS wCRPS swCRPS15 µg/m 3 30 µg/m 3 45 µg/m 3 15 µg/m 3 30 µg/m 3 45 µg/m 3 Desirable properties such as straightforward computation of scores by Monte Carlo approximation have made the CRPS a popular alternative to the logarithmic score.However, using the CRPS requires sacrificing the local scale invariance of the logarithmic score.This can cause different observations, say at different spatial location, not to be equally important for average scores.To account for this, a common choice in the literature is to use so called skill scores (S n − S ref and EP ,Q = ∇ 2 θ E P θ ,Q [g(X, Y )].One can show that Ṗ .