Evaluating probabilistic forecasts of extremes using continuous ranked probability score distributions

Verifying probabilistic forecasts for extreme events is a highly active research area because popular media and public opinions are naturally focused on extreme events, and biased conclusions are readily made. In this context, classical verification methods tailored for extreme events, such as thresholded and weighted scoring rules, have undesirable properties that cannot be mitigated, and the well-known continuous ranked probability score (CRPS) is no exception. In this paper, we define a formal framework for assessing the behavior of forecast evaluation procedures with respect to extreme events, which we use to demonstrate that assessment based on the expectation of a proper score is not suitable for extremes. Alternatively, we propose studying the properties of the CRPS as a random variable by using extreme value theory to address extreme event verification. An index is introduced to compare calibrated forecasts, which summarizes the ability of probabilistic forecasts for predicting extremes. The strengths and limitations of this method are discussed using both theoretical arguments and simulations.


Introduction
By definition, the rarity of extreme events makes difficult to issue relevant forecasts, whose performance assessment is an even greater challenge. In particular, the scarcity of extremes imposes that verification schemes have to be built and understood in a probabilistic sense. The general framework for probabilistic forecast evaluation compares an observation y with a probabilistic forecast F , represented by its cumulative distribution function (cdf). The framework also assumes that y is drawn from a random variable Y with cdf G. For a better utilization of the forecasts, it is generally convenient, and even recommended (Ferro and Stephenson, 2011), to further assume that the forecast F is calibrated (Dawid, 1984;Diebold et al., 1997), i.e., that the predictive distribution resembles the distribution of the observations given the information contained in the forecast. For a formal definition of autocalibration (calibration in the following), we refer to the works of Tsyplakov (2011) and Strähl and Ziegel (2017) summarized in Appendix A.
Calibrated forecasts can be commonly evaluated based on their sharpness, also called refinement by Winkler et al. (1996), which usually refers to their spread. This leads to the paradigm of 'maximizing sharpness subject to calibration', introduced by  and later formally justified by Tsyplakov (2011).
Probabilistic forecasting has become more and more popular over the last years in various fields such as economics and finance (Galbraith and Norden, 2012), demography and social science (Raftery and Ševčíková, 2021), health , energy (Hong et al., 2016), hydrology and hydraulics (Tiberi-Wadier et al., 2021). In this work, we focus on weather probabilistic forecasts (Leutbecher and Palmer, 2008). Indeed, probabilistic forecasts are nowadays issued by most National Weather Services (NWS) and F is known through a sample of finite size called "ensemble" (see, e.g., Zamo and Naveau, 2017). In this context, forecast verification is performed by computing scoring rules such as the Continuous Ranked Probability Score (CRPS) (Epstein, 1969;Hersbach, 2000;Bröcker, 2012) where y ∈ R, and X and X are independent random variables with common cdf F . The CRPS is attractive as it does not require predictive densities, is inferred non-parametrically, and has simple interpretation. The right hand side of Equation (1) decomposes the CRPS into, in this order, a calibration and a sharpness term . Alternative decompositions are also available; see Taillardat et al. (2016); Bessac and Naveau (2021) and Appendix B.
For the forecast evaluation of extreme events, proper weighted scoring rules were introduced by Gneiting and Ranjan (2011) and Diks et al. (2011). For a non-negative function w(x), the weighted CRPS with W (x) = x −∞ w(t)dt, aims to emphasize a region of interest, for instance distributional tails. When w is continuous, an alternative expression of the weighted CRPS is available and can be found in Appendix B. The choice of the weight function w(x) is complex and depends on the different stakeholders, such as forecast users and forecasters; see, e.g., Ehm et al. (2016); Gneiting and Ranjan (2011); Patton (2014); Smith et al. (2015); Taillardat (2021b). Even in the hypothetical case where w(x) could be objectively defined, it is essential that the verification process has to be made on the whole set of observations (Lerch et al., 2017) and one can wonder if the corresponding weighted CRPS correctly discriminates between two competitive forecasts with respect to extreme events.
In this work, we show that the expected weighted CRPS cannot discriminate forecasts with different extremal tail behaviors, a potentially redhibitory defect for extremal evaluation. To address this issue, we view the CRPS as a random variable. Its tail behavior is derived and compared to the tail regime of observations using Extreme Value Theory (EVT) (see, e.g. De Haan and Ferreira, 2007).
This work is organized as follows: Section 2 provides an analysis of the weighted CRPS with respect to the notion of tail equivalence, the main backbone of EVT. In particular, we propose a benchmark to compare the tail properties of forecast verification tools allowing us to pinpoint the shortcomings of the CRPS and its weighted counterpart for scoring extreme events. In Section 3, we study the CRPS as a random variable and we make theoretical links between its tail behavior and the observational tail distribution. These mathematical connections help us to propose and study a new index to assess the skill of calibrated probabilistic forecasts with respect to extreme events. The paths and pitfalls of this index and potential future works are discussed in the Section 4.

2.
Limitations of the (w)CRPS as a proper scoring rule for extremes 2.1. Tail modelling using EVT Thanks to the pioneering work of Gumbel (1935) andDe Haan (1970), EVT provides a theoretically justified framework to model the tail of random variables, more precisely excesses above a large threshold; see, e.g., Embrechts et al. (1997); Beirlant et al. (2004). For any random variable X with cdf F , EVT models assume the existence of a domain of attraction, i.e., that there exists a positive auxiliary function b, such that where F = 1 − F corresponds to the survival, also called tail function, and x F = sup{x : F (x) < 1} is the upper endpoint of F . Under condition (3), noted F ∈ D(H), the Pickands-Balkema-de Haan's theorem (De Haan, 1970;Pickands, 1975) establishes that H has to belong to the family of generalized Pareto (GP) survival functions, i.e., where x ∈ {x : 1 + γx > 0}. As a consequence, the GP tail appears to be the ideal candidate to approximate the survival function of exceedances over a large threshold u > 0, i.e., where x ∈ {x : 1 + γx/σ > 0} and σ > 0. The GP family covers the three possible regimes of tail decay which is determined by the value of its tail index γ: when γ = 0 the decay is polynomial and has an upper bound when γ < 0. For γ = 0, the GP survival function becomes exponential, i.e., H 0 (z) = e −z/σ .

Tail equivalence and proper scoring rules
The comparison of the tail behavior of two random variables, or equivalently their respective cdfs F and G, can be framed using the notion of tail equivalence. Definition 1. (Embrechts et al., 1997, Section 3.3) Two random variables X and Y with respective cdf F and G are tail equivalent if they have equal upper endpoint x F = x G = x * and if their survival functions F and G satisfy Tail equivalence can also be simply expressed as the equality of tail indexes. In terms of extremal forecast, we expect that, between two forecasters, one should favor the one that is tail equivalent to the observations. In practice, this may be difficult. For instance, consider two GP distributed random variables X 1 and X 2 with survival functions H 1 (x) and H 1+ (x/σ) with σ = (1 + )/(2 1+ − 1). By construction, the medians of X 1 and X 2 are both equal to one. Still, their tail behavior widely differ even for small : The 100 year return level for X 1 is 99, while it is equal to 138 for X 2 with = 0.1. In other words, if the precedent random variables were to represent water levels, a small difference of 0.1 in tail index, implied a difference of 39 meters which would most likely cause massive and destructive flooding.
This short example illustrates how issuing forecasts with the right tail regime, i.e., as close as possible to the observational one, is a priority for extreme events and that a verification methodology should reward forecast with close, if not equal, tail regime. Ideally, the measure of forecast performance should give not only the distance but also the 'direction', i.e., if the forecast is more likely to over-or under-estimate the high quantiles. Indeed, let γ G ∈ R be the tail index of observations. If the forecast satisfies γ F > γ G , the forecast over-estimates the risk producing a pessimistic or risk averse scenario. On the contrary, γ F < γ G falls on the optimistic side by under-estimating the likelihood of extreme events.
Classical methods for forecast evaluation, even when designed to focus on extreme events, do not conserve tail equivalence. For instance, for any positive η and observation distribution G, it is always possible to construct a non-tail equivalent cdf F , such that proof can be found in Appendix C. More precisely if G ∈ D(H γ G ), then it is possible for any arbitrary γ F ∈ R to find F ∈ D(H γ F ) satisfying Equation (4). Thus the CRPS is unable to discriminate properly forecasts with different tail regime, as non-tail equivalent forecasts can perform almost equally well as the ideal forecast G. A detailed illustration of this result for GP forecasts is given in Appendix D. We also refer to Brehmer and Strokorb (2019), who obtained a more general result, proving that proper scoring rule expectations are not suitable to distinguish tail properties, see their Theorem 5.4.

A benchmark for assessing forecasts of extremes
Following  and Strähl and Ziegel (2017), we propose a benchmark to assess the behavior of forecast evaluation procedures with respect to tail regimes. The design relies on a hierarchical model based on Gamma-exponential mixtures with γ > 0 where Exp(δ) refers to an exponential random variable with scale δ > 0. The fact that Y follows a heavy tailed GP distribution, see relation (5), can be proved using Laplace transforms. For analogy with weather forecasting, we present the benchmark in a temporal setting. At each time t = 1, . . . , T > 1, an observation y is drawn independently from an exponential distribution whose scale δ is a realization of ∆. In this setting, Y has an exponential tail which is conditioned by the information brought by its scale δ, representing the a priori knowledge of the system, for instance the weather at previous time. Thus the ideal forecast for each time step is Exp(δ), and requires the knowledge of δ. Using relation (5), we see that the climatological forecaster F clim is a GP distribution with tail index γ and unit scale. Climatology is a commonly used forecast reference in meteorology. In other fields, it can be viewed as the unconditional distribution of the truth, and an estimation of a climatological forecast can be done based on a sample of past and analogs observations. This setting is attractive as the ideal and the climatological forecasters belong to two different regimes of tail decay. We introduce alternative competitors modelling partial knowledge of the conditional state: the λ-informed forecaster F λ , λ ∈ [0, 1] is a mixture between the climatological and ideal forecasts, where a weight, say λ ∈ [0; 1], indicates the contribution of each one, see Table 1 for the definition.
Finally, the extremist forecaster F extr simply adds a multiplicative bias to the ideal forecaster: while it is not calibrated, such forecast has the same tail behavior as the ideal forecaster ; see Appendix A for detailed discussion on calibration. The benchmark is summarized in Table 1 and later referred to as the "Model GE". Table 1: Benchmark to assess the behavior of forecast evaluation procedure with respect to different tail regimes. All forecasts but F extr are calibrated.
Closed forms of the CRPS are available for each forecast of the proposed benchmark. For instance, the extremist forecast F extr , satisfies Besides, combining (B.1) and (6) yields the following formula for the λinformed forecast, λ ∈ [0, 1], Table 2 gives the relative ratio of the empirical means of the CRPS for the benchmark with γ = 1/4. The CRPS being a proper score, the ideal forecast cannot be beaten in average in the Table 2. Moreover, there is a clear ranking among calibrated forecasts, based on the nested information sets (Holzmann and Eulert, 2014). Following the principle of tail equivalence presented in Section 2.2, the extremist forecast should be the forecast the closest to the ideal as they both belong to the same regime of tail decay; however, we observe that the CRPS average gives Table 2: Relative ratio of the mean CRPS, in percent, with respect to the ideal forecast for the model GE with γ = 1/4, based on T = 10 6 observation/forecast pairs.
a performance in between the least informed forecaster and the climatology. An alternative measure for forecast evaluation, satisfying the tail equivalence principle is thus required. A good candidate commonly used in forecast science is the ROC curve (Gneiting and Vogel, 2018). However, in the case of Model GE, all the ROC curves, except the climatological one, coincide whatever the event, which illustrates its invariance under calibration (Kharin and Zwiers, 2003). Further alternatives should thus be investigated.

The CRPS as a random variable
3.1. The random CRPS and its properties Section 2 pointed out the difficulty of summarizing forecast performance for meaningful comparisons for extreme observations. We illustrated in particular that a single number such as the mean of the CRPS, or its weighted counterpart, fails to deliver relevant comparisons. As an alternative, we propose to study the distribution of the CRPS when treated as a random variable, see also Ferro (2017); Bessac and Naveau (2021).
For simplicity, we use the setting and corresponding notations of the benchmark presented in Section 2.3. From equations (B.1) and (6), the climatological and ideal scores can be treated as random variables whenever y t is replaced by Y t . At this stage, it is important to remind that a forecast is issue with only a partial knowledge of the system: the exact value of δ t and the distribution of Y t are unknown, and only the observation y t is available. Table 3 summarizes quantities that are available to forecasters. Thus, to evaluate forecasts performance, it is only possible to compute CRPS(F t , y t ) for each t. The climatological distribution, that we now note G and whose existence needs to be hypothesised in practice, is characterized by the observed sample (y 1 , . . . , y t ), considered as a sample of independent realizations of the random variable Y .
For any set of forecasts {F t } t=1,...,T and sample y 1 , . . . , y T , two types of sets of random variables can be defined: where π is a random permutation of {1, . . . , n}. Applying π breaks the conditional dependence between y t and F t , quantified by δ t in the benchmark, creating alternative less informative forecasts. Thus for a given forecaster, represented by the set F T = {F t } i=1,...,T and permutation π, we introduce two random variables S(F T ) and S * (F T ) characterized by their respective empirical cdf.
The climatological forecaster is the only forecaster satisfying as by definition it discards any information about the system conditioning. The first equality in (8) is a direct consequence of auto-calibration, see Appendix A; the second equality follows from the permutation invariance of the data from the point of view of the climatological forecaster. The distributional properties of S(F T ), S * (F T ), and S(G) give relevant insights on the behavior of the forecaster. For illustration, Figure 1 gives qq-plots of the distributions of S * (F T ) against S(F T ) for each forecast of the benchmark with γ = 1/4. We observe that the ideal, λ-informed and extremist forecasts deviate from the diagonal, illustrating the influence of the loss of information caused by the permutation: such a visual diagnostic summarizes how S(F T ) and S * (F T ) capture relevant information from the conditioning modelled here by the random variable ∆. The right panel of Figure 1 displays these distributions on the probability scale and highlights how the discrepancy of the λ-informed forecaster evolves with the parameter λ. Extremist forecasts, with multiple values of the scale parameter ν, are displayed here for the sole purpose to illustrate how such visual diagnostics behave when calibration is not satisfied. In Figure 1, we can also see that forecast dominance Table 3: Availability status of the quantities of interest. It can be an a posteriori availability.

Object
Definition Availability in practice F t Distribution of the forecast for time t yes y t Observed realisation at time t

Tail properties of the random CRPS
We now study the upper tail behavior of the random CRPS, using EVT to develop a meaningful forecast evaluation for extreme events. To lighten the technicality of this section, all proofs are relegated to Appendix E. In terms of notations with respect to any conditional model that depends on ∆ = δ, we want to emphasize the difference between a conditional forecast, say F δ , and an unconditional forecast F . Note that δ depends on the time index t, but for notation simplicity, we drop this index; ∆ might also change over time but here assumed invariant.
Let X and Y be two random variables with absolutely continuous cdfs F and G with common upper bound x F = x G . Suppose that there exists γ < 1 such that G ∈ D(H γ ) and that c F = 2E F (XF (X)) is finite. Then conditionally on ∆ = δ, one has as u δ tends to x G δ , with 1 + γ δ x > 0. So at any fixed state δ (state of the atmosphere for a weather forecast, say), the CRPS upper tail behavior (conditionally on ∆ = δ) is equivalent to the observation tail behavior and formalizes what could be intuited from (B.1). Now, unconditionally, one can also get a result for the climatological forecast, thanks to its property of invariance under permutation (see Section 3.1). If there exists γ < 1 such that G ∈ D(H γ ), then for any x such that 1 + γx > 0. In the case where γ > 0, convergence in Equation (10) also holds for c G = 0 as the latter vanishes due to the linear behavior of the auxiliary function b in Equation (3), e.g., see Embrechts et al. (1997).
The benchmark presented in Table 1 illustrates these results. The choice of working with a time indexed couple (F t , Y t ) or with an invariant (G, Y ) impacts significantly the tail behavior of the CRPS random variables: according to Table 1, the former case implies that the limit in (9) exhibits an exponential tail, whereas the climatological tail given by (10) is heavy, i.e., γ > 0.

Assessing the forecaster tail behavior
In this section, we propose a tail-equivalent forecast performance index inspired from equations (9), (10), and Figure 1. We aim only to provide the intuition behind the index and leave formal theoretical analysis for future work. We assume that the forecasts lie in the domain of attraction of some distribution H γ,σ . For sufficiently large u, the null hypothesis H 0 : S(F T )|Y > u d = H γ,σu should be rejected for any calibrated forecast with tail behaviour closer to the ideal forecast than the climatological reference.
To go further, assume that the variables in S(F T ) are iid. This assumption may not be always satisfied, as for instance temperature measures of two consecutive days are likely to be dependent, but can be reasonably satisfied for measurements from sufficiently far apart. For each forecast, we can compute a Cramér-von Mises criterion S,u is the empirical distribution of the observations in S(F T ) exceeding the threshold u. The empirical nature ofK (m) S,u allows to simplify where m denotes the number of observations exceeding u and s 1 , . . . , s m are the ordered values of S(F T ). A detailed algorithm for the computation of Ω F u is provided in Table F.4 of Appendix F. As suggested by Figure 1, we assume that Ω F u > Ω G u , for any calibrated forecasts and climatology G. Also, for two calibrated forecasts F 1 and F 2 , we conjecture that Ω F 2 u ≥ Ω F 1 u if F 2 has a tail behaviour closer to the ideal forecast than F 1 . Under these assumptions, we can summarize simply the comparison between Ω F u and Ω G u through The behaviour of the index T u is illustrated with the help of model GE; Figure 2 displays the evolution of T u as a function of the threshold u for T = 10 6 and γ = 1/4. The behaviour of the index is shown to be consistent with our conjecture: first, the ideal forecast performs best, while the climatology has the lowest index. Performance ranking among calibrated forecasters is stable as the threshold increases, with the ideal forecast always obtaining the largest index. The extremist forecasters, displayed here to illustrate the behaviour of the index for non-calibrated forecast, obtain a high index, even larger than the ideal forecast, stressing the importance of calibration which must be carefully assessed before any interpretation of T u .

Discussion
In this work, we have argued with the help of a carefully designed benchmark that the mean of the CRPS, or its weighted counterparts, are unable to successfully discriminate a forecast upper tail regime, as demonstrated by Brehmer and Strokorb (2019). Ehm et al. (2016) have introduced the socalled "Murphy diagrams" for assessing dominance in point forecasts. This original approach allows to appreciate dominance among different forecasts and anticipate their skill area; a similar visual diagnostic is presented in Figure 1 for calibrated forecasts.
Inspired by Friederichs and Thorarinsdottir (2012), we apply EVT directly on common verification measures. By considering the CRPS as a random variable, see also Bessac and Naveau (2021) for non-extreme cases, one can view this contribution as a first step in considering other functionals of the scores distributions rather than their means. The new index introduced in Section 3.3 can be considered as a probabilistic alternative to the scores introduced by Ferro (2007) and Ferro and Stephenson (2011). We make a link between the paradigm of maximizing the sharpness subject to calibration from    Higher index values are assumed to reflect a tail behaviour closer to the ideal forecaster. Validity of the index is limited to calibrated forecast and Non-calibrated extremists forecast are shown to recall that calibration must be first carefully checked before interpreting such graphics.
for extreme events subject to calibration. In a same vein, Murphy (1993) has presented the differences between forecast quality (accordance between forecasts and observations) and forecast value (ability to bring information to realize a benefit by choosing a forecast), the forecast value seems to be the most important for extreme events, where decision making is crucial. For deterministic weather forecasts, such tools are well-known, see e.g. Richardson (2000); Zhu et al. (2002). Other widely-used scores based on the dependence between forecasts and observed events have been considered in Stephenson et al. (2008);Ferro and Stephenson (2011). It would be worthwhile to further study the theoretical properties of this CRPS-based tool. Another potentially interesting investigation could be to extend this procedure to other scores like the mean absolute difference, the Dawid-Sebastiani score (Dawid and Sebastiani, 1999) or the ignorance score (Smith et al., 2015;Diks et al., 2011). Classical tools in verification relies on a verification period, as a consequence evaluation is always done a posteriori. Thus, an interesting manner to pursue this work would be to consider sequential evaluation of rare events, in the spirit of the e-values (Vovk and Wang, 2021) introduced to assess and monitor calibration continuously (Arnold et al., 2021). Eventually, we invite scientists to work on new theory of scoring rule departing from the score's averages.

Acknowledgments
Part of this work was supported by the French National Research Agency (ANR) project T-REX (ANR-20-CE40-0025) and by Energy oriented Centre of Excellence-II (EoCoE-II), Grant Agreement 824158, funded within the Horizon2020 framework of the European Union. Part of this work was also supported by the ExtremesLearning grant from 80 PRIME CNRS-INSU and the ANR project Melody (ANR-19-CE46-0011). This work was partially supported by the ANR LABEX MILYON (ANR-10-LABX-0070) of Université de Lyon, within the program "Investissements d'Avenir" (ANR-11-IDEX-0007).

Implementation details
The implementation of the index relies on the extremeIndex package (Taillardat, 2021a). The R code generating simulation data and Figures is available upon request.

Appendix A. Prediction framework and calibration
The theoretical framework considered in this paper is the now classical prediction space already introduced by Murphy and Winkler (1987); Gneiting and Ranjan (2013);Ehm et al. (2016), and generalized in a serial context by Strähl and Ziegel (2017). It starts formally with a probability space (Ω, A, Q) and a collection of sub-σ-algebras A 1 , . . . , A k ⊂ A, where A i represents the information available to forecaster i. In a meteorological context, it can be seen as the representation of the atmosphere done by each forecaster. In the benchmark considered in Section 2.3, we will consider for simplicity that the information set is generated by a random variable ∆.
A real-valued outcome Y is observed and seen as a (real-valued) random variable. A probabilistic forecast i for Y is identified with its so-called "predictive distribution" with cdf F i . Rigorously speaking, F i : Ω × B(R) → [0, 1] is a kernel 1 from (Ω, A i ) to (R, B(R)), but as done by previous authors, we will identify the kernels with random cumulative cdf, see e.g. Strähl and Ziegel (2017) for more details. For each x ∈ R, we might in particular use the notation F i (x) meaning the random element ω → F i (ω, (−∞, x]).
In such a framework, a forecast F i is termed ideal with respect to A i if F i = L(Y |A i ) almost surely. Tsyplakov (2011) also refers to this property saying that F i is calibrated with respect to A i . He additionally defines the auto-calibration as the property for F i to satisfy F i = L(Y |σ(F i )) almost surely. Here, σ(F i ) denotes the σ-algebra generated by F i , that is to say the smallest σ-algebra such that ω → F i (ω, x) is measurable for all x ∈ R. Note that if a forecast is calibrated with respect to A i , then it is auto-calibrated, but the converse does not hold in general. As a particular case considered in Section 2.3, the climatological forecaster is ideal with respect to the trivial σ-algebra.
In practice, one is not only concerned with predictions for an outcome Y at a single time point. The framework introduced above also allows to deal with independent replicates at times t = 1, 2, . . ., as is done in Section 2.3. If such an assumption of independence sounds unrealistic in several situations, as argued by Strähl and Ziegel (2017), it can nevertheless provide a first step and takes advantage of a lighter context. We chose therefore to keep it in this paper for simplicity.

Appendix B. An alternative expression of the weighted CRPS
The weighted CRPS defined by (2) can be reformulated in the following way, as soon as the weight function w(.) is continuous, (B.1) Assume that the weight function w(.) is continuous. By integrating by parts (2) can be rewritten as where the last line follows from the fact that F W (X) (W (X)) and F (X) have the same distribution, which is uniform on (0, 1). As W (x) is non-decreasing, one has {W (X) > W (y)} = {X > y}, and it follows that as announced in (B.1).

Appendix C. Proof of the inequality (4)
Let u be a positive real. Denote Z a non-negative random variable with finite mean and cdf H. Assume that Z and Y are independent and have same right end point. We introduce the new random variable Note that the decreasingness of F u yields in particular that for all x, Besides, equation (C.2) and the monotonicity of W allows to write that for any The stochastic ordering that holds between X u and Y implies that the quan- For x > u we can write that in the second one. As a consequence, one gets This last expression combined with (C.5) leads finally to Note that this inequality is true for any u and H, and its right hand side does not depend on H(x). Thus, the tail behavior of the random variables Y and Z can be completely different, although the CRPS of G and G can be as closed as one wishes. The right hand side goes to 0 due to the finite mean of W (Y ).
Appendix D. A detailed example related to Section 2.2 In this appendix, we illustrate the fact that the CRPS fails at discriminating forecasts with different tails. We consider GP distributed forecasts and observations. In this case, closed form of the CRPS are available, as detailed in the following.
Besides, as G −1 (v) = σ γ (1 − v) −γ − 1 , one can thus rewrite, denoting by U a random variable uniformly distributed on (0, 1), If c = ξσ βγ = 1, then this simplifies to In particular, m 0 = 1 γ B(1, 1/ξ + 1/γ) = 1 + γ ξ −1 and It follows that, if γ σ = ξ β , then we have This gives the minimum CRPS value for ξ = γ and σ = β, concluding the proof of Lemma 1. Lemma 1 allows to study the effect of changing the forecast's tail behavior captured by ξ and the spread forecast encapsulated in β, when F and G have proportional parameters, i.e., β = aσ and ξ = aγ for some a > 0. In this case, the CRPS simplifies to leading when a > 1 to a forecaster with heavier-tail, overestimating the true upper tail behavior, and to the opposite when a < 1.
Counter examples as the previous one can thus be found, illustrating how weighted scoring rules fail to compare tail behaviors. They should therefore be handled with a particular care, especially for forecast makers, as already advocated by Gilleland et al. (2018); Lerch et al. (2017).
So far, we have shown that, for some large z 0 , there exist non negative α and β such that We still need to prove that this statement also holds for z ≤ z 0 . Define As γ < 1, β 0 is finite and, as F (z) ≥ F (z 0 ) for all z ≤ z 0 , we have We have now two cases: either β < β 0 F (z 0 ) or β ≥ β 0 F (z 0 ) . In the latter case, we have 2E F ((Z − z)1l Z>z ) ≤ β 0 ≤ F (z)(αz + β), and so, the required result is obtained. In the case of β < β 0 F (z 0 ) , it is always possible to increase β chosen when z > z 0 , and bring it above β 0 F (z 0 ) . We are now ready to prove (9) as announced.
Proof of (9): Given the conditional forecast F δ , the CRPS can be computed with respect to the conditional observation y δ in the following way To simplify notations, we drop the subscript δ in the rest of the proof, but it will be back at the end. The previous lemma allows to write Y ≤ CRPS(F, Y ) + c ≤ (1 + αF (Y ))Y + βF (Y ) a.s.
Let now work conditionally on Y > u, for a large u close to x F = x Y . We then get Y ≤ CRPS(F, Y ) + c ≤ (1 + αF (u))Y + βF (u) a.s. This holds when the right end point of Y is non-negative. If this was not the case, note that one can simply write Y ≤ CRPS(F, Y )+c ≤ Y +βF (u) a.s..
The main idea of the proof is to notice that F (u) goes to zero as u gets large, and consequently, the above inequalities indicate that the thresholded random variable Y We recognize the probability (conditionally on Y > u) for Y to be in an interval denoted by , tb(u) + u .
The remaining part of the proof consists in showing that this conditional probability tends to 0 as u → x F . We can write where J u = tb(u) − F (u)(α + β) 1 + αF (u) , tb(u) . For u large enough, the latter probability can be approximated by a GPD, so that where g GP denotes the probability density function associated to the GPD. This implies the convergence to 0 of the latter probability. Since this is true conditionally on ∆ = δ, it can be rewritten, after reintroduction of the subscript δ, as as u tends to x G δ , with 1 + γ δ x > 0.
Appendix F. Algorithm for the computation of the Cramer-von-Mises criterion Note that for large u, under the null hypothesis, the statistic Ω F u follows a Cramér-von Mises distribution. The associated p-values p F u ∈ [0, 1] could have been computed, but they are actually subject to numerical instabilities (Prokhorov, 1968;Csörgő and Faraway, 1996). Furthermore, Ω F u is sufficient to compare the effect size of the deviation.