Stable reliability diagrams for probabilistic classifiers

Significance Probabilistic classifiers assign predictive probabilities to binary events, such as rainfall tomorrow, a recession, or a personal health outcome. Such a system is reliable or calibrated if the predictive probabilities are matched by the observed frequencies. In practice, calibration is assessed graphically in reliability diagrams and quantified via the reliability component of mean scores. Extant approaches rely on binning and counting and have been hampered by ad hoc implementation decisions, a lack of reproducibility, and inefficiency. Here, we introduce the CORP approach, which uses the pool-adjacent-violators algorithm to generate optimally binned, reproducible, and provably statistically consistent reliability diagrams, along with a numerical measure of miscalibration based on a revisited score decomposition.


Introduction
Calibration or reliability is a key requirement on any probability forecast or probabilistic classifier.In a nutshell, a probabilistic classifier assigns a predictive probability to a binary event.The classifier is calibrated or reliable if, when looking back at a series of extant forecasts, the conditional event frequencies match the predictive probabilities.For example, if we consider all cases with a predictive probability of about .80, the observed event frequency ought to be about .80 as well.While for many decades researchers and practitioners have been checking calibration in myriads of applications (1,2), the topic is subject to a surge of interest in machine learning (3), spurred by the recent recognition that "modern neural networks are uncalibrated, unlike those from a decade ago" (4).

Reliability diagrams: Binning and counting
The key diagnostic tool for checking calibration is the reliability diagram, which plots the observed event frequency against the predictive probability.In discrete settings where there are only a few predictive probabilities, such as, e.g., 0, 1  10 , . . ., 9  10 , 1, this is straightforward.However, statistical and machine learning approaches to binary classification generate continuous predictive probabilities that can take any value between 0 and 1, and typically the forecast values are pairwise distinct.In this ubiquitous setting, researchers have been using the "binning and counting" approach, which starts by selecting a certain, typically arbitrary number of bins for the forecast values.Then, for each bin, one plots the respective conditional event frequency versus the midpoint or average forecast value in the bin.For calibrated or reliable forecasts the two quantities ought to match, and so the points plotted ought to lie on, or close to, the diagonal (2,5).
In Fig. 1(a,c,e) we show reliability diagrams based on the binning and counting approach with a choice of m = 10 equally spaced bins for 24-hour ahead daily probability of precipitation forecasts at Niamey, Niger in July-September 2016.They concern three competing forecasting methods, including the world-leading, 52-member ensemble system run by the European Centre for Medium-Range Weather Forecasts (ENS, 6), a reference forecast called extended probabilistic climatology (EPC), and a purely data-driven statistical forecast (Logistic), as described by Vogel et al. (7, Fig. 2).
Not surprisingly, the classical approach to plotting reliability diagrams is highly sensitive to the specification of the bins, and the visual appearance may change drastically under the slightest change.We show an example in Fig. 2(a-c) for a fourth type of forecast at Niamey, namely, a statistically postprocessed version of the ENS forecast called ensemble model output statistics (EMOS), for which choices of m = 9, 10, or 11 equidistant bins yield drastically distinct reliability diagrams.This is a disconcerting state of affairs for a widely used data analytic tool, and contrary to well-argued recent pleas for reproducibility (8) and stability (9).Similar instabilities under the binning and counting approach have been reported for numerical measures of calibration, even when the size n of the dataset considered is large (10,p. 6,11,Sect. 3.1).
While methods for the choice of the binning and related implementation decisions for reliability diagrams have been proposed in the literature (5,12,13), extant approaches lack theoretical justification, are elaborate, and have not been adopted by practitioners.Instead, researchers across discplines continue to craft reliability diagrams and report associated measures of (mis)calibration, such as the Brier score reliability component (14)(15)(16), based on ad hoc choices.In this light,   also include[d] point-wise confidence intervals."Here we introduce a new approach to reliability diagrams and score decompositions, which resolves these issues in a theoretically optimal and readily implementable way, as illustrated on the forecasts at Niamey in Figs.1(b,d,f) and 2(d).In a nutshell, we use nonparametric isotonic regression and the pool-adjacent-violators (PAV) algorithm to estimate conditional event probabilities (CEPs), which yields a fully automated choice of bins that adapts to both discrete and continuous settings, without any need for tuning parameters or implementation decisions.We call this stable, new approach CORP, as its novelty and power include the following four properties.
Consistency The CORP reliability diagram and associated numerical measures of (mis)calibration are consistent in the classical statistical sense of convergence to population characteristics.We leverage existing asymptotic theory (18)(19)(20) to demonstrate that the rate of convergence is best possible, and to generate large sample consistency and confidence bands for uncertainty quantification.
Optimality The CORP reliability diagram is optimally binned, in that no other choice of bins generates more skillful (re)calibrated forecasts, subject to regularization via isotonicity (21,Thm. 1.10,22,23).
Reproducibility The CORP approach does not require any tuning parameters nor implementation decision, thus yielding well defined and readily reproducible reliability diagrams and score decompositions.
Pool-adjacent-violators (PAV) algorithm based CORP is based on nonparametric isotonic regression and implemented via the PAV algorithm, a classical iterative procedure with linear complexity only (24,25).Essentially, the CORP reliability diagram shows the graph of the PAV-(re)calibrated forecast probabilities.
In the remainder of the article we provide the details of CORP reliability diagrams and score decompositions, and we substantiate the above claims via mathematical analysis and simulation experiments.
3 The CORP approach: Optimal binning via the pool-adjacentviolators (PAV) algorithm The basic idea of CORP is to use nonparametric isotonic regression to estimate a forecast's CEPs as a monotonic, non-decreasing function of the original forecast values.Fortunately, in this simple setting there is one, and only one, kind of nonparametric isotonic regression, for which the PAV algorithm provides a simple algorithmic solution (24,25).To each original forecast value, the PAV algorithm assigns a (re)calibrated probability under the regularizing constraint of isotonicity, as illustrated in textbooks (26, Figs.2.13 and 10.7), and this solution is optimal under a very broad class of loss functions (21, Thm.1.10).In particular, the PAV solution constitutes both the nonparametric isotonic least squares and the nonparametric isotonic maximum likelihood estimate of the CEPs.
The CORP reliability diagram plots the PAV-calibrated probability versus the original forecast value, as illustrated on the Niamey data in Figs.1(b,d,f) and 2(d).The PAV algorithm assigns calibrated probabilities to the individual unique forecast values, and we interpolate linearly inbetween, to facilitate comparison with the diagonal that corresponds to perfect calibration.If a group of (one or more) forecast values are assigned identical PAV-calibrated probabilities, the CORP reliability diagram displays a horizontal segment.The horizontal sections can be interpreted as bins, and the respective PAV-calibrated probabilities are simply the bin-specific empirical event frequencies.For example, we see from Fig. 1(b) that the PAV algorithm assigns a calibrated probability of .125 to ENS forecast values between 9  52 and 20 52 , and a calibrated probability of .481 to ENS values between 21  52 and 42 52 .The PAV algorithm guarantees that both the number and the positions of the horizontal segments (and hence the bins) in the CORP reliability diagram are determined in a fully automated, optimal way.
The assumption of nondecreasing CEPs is natural, as decreasing estimates are counterintuitive, routinely being dismissed as artifacts by practitioners.Furthermore, the constraint provides an implicit regularization, serving to stabilize the estimate and counteract overfitting, despite the method being entirely nonparametric.Under the binning and counting approach, small or sparsely populated bins are subject to overfitting and large estimation uncertainty, as exemplified by the sharp upward spike at about .25 in Fig. 2(b).The assumption of isotonicity in CORP stabilizes the estimate and avoids artifacts (Fig. 2d).In contrast to the binning and counting approach, which has not been subject to asymptotic analysis, CORP reliability diagrams are provably statistically consistent: If the predictive probabilities and event realizations are samples from a fixed, joint distribution, then the graph of the diagram converges to the respective population equivalent, as a direct consequence of existing large sample theory for nonparametric isotonic regression estimates (18)(19)(20).Furthermore, CORP is asymptotically efficient, in the sense that its automated choice of binning results in an estimate that is as accurate as possible in the large sample limit.In Appendix B we formalize these arguments and report on a simulation study, for which we give details in Appendix A, and which demonstrates that the efficiency of the CORP approach also holds in small samples.
Traditionally, reliability diagrams have been accompanied by histograms or bar plots of the marginal distribution of the predictive probabilities, on either standard or logarithmic scales (e.g., 27).Under the binning and counting approach, the histogram bins are typically the same as the reliability bins.In plotting CORP reliability diagrams, we distinguish discretely and continuously distributed classifiers or forecasts.Intuitively, the discrete case refers to forecast values that only take on a finite and sufficiently small number of distinct values.Then we show the PAV-calibrated probabilities as dots, interpolate linearly inbetween, and visualize the marginal distribution of the forecast values in a bar diagram, as illustrated in Fig. 3(a,b).For continuously distributed forecasts, essentially every forecast takes on a different value, whence the choice of binning becomes crucial.The CORP reliability diagram displays the bin-wise constant PAV-calibrated probabilities in horizontal segments, which are linearly interpolated inbetween, and we use the Freedman-Diaconis rule (28) to generate a histogram estimate of the marginal density of the forecast values, as exemplified in Fig. 3(c,d).In our software implementation (29) a simple default is used: If the smallest distance between any two distinct forecast values is 0.01 or larger, we operate in the discrete setting, and else in the continuous one.The CORP reliability diagrams in Figs.1-3 also display a new measure of miscalibration (MCB), discussed in detail later on as we introduce the CORP score decomposition.

CORP uncertainty quantification
Bröcker and Smith (30) convincingly advocate the need for uncertainty quantification, so that structural deviations of the estimated CEP from the diagonal can be distinguished from deviations that merely reflect noise.They employ a resampling technique for the binning and counting method in order to find consistency bands under the assumption of calibration.For CORP, we extend this approach in two crucial ways, by generating either consistency or confidence bands, and by using either a resampling technique or asymptotic distribution theory, where we leverage existing theory for nonparametric isotonic regression estimates (18)(19)(20).
Consistency bands are generated under the assumption that the probability forecasts are calibrated, and so they are positioned around the diagonal.There is a close relation to the classical interpretation of statistical tests and p-values: Under the hypothesized perfect calibration, how much do reliability diagrams vary, and how (un)likely is the outcome at hand?In contrast, confidence bands cluster around the CORP estimate and follow the classical interpretation of frequentist confidence intervals: If one repeats the experiment numerous times, the fraction of confidence intervals that contain the true CEP approaches the nominal level.The two methods are illustrated in Fig. 3, where the right column (b,d) shows confidence bands, and the left column (a,c) shows consistency bands, as do the CORP reliability diagrams in Figs.1(b,d,f) and 2(d).
In our adaptation of the resampling approach, for each iteration the resampled CORP reliability diagram is computed, and confidence or consistency bands are then specified by using resampling percentiles, in customary ways.For consistency bands, the resampling is based on the assumption of calibrated original forecast values, whereas PAV-calibrated probabilities are used to generate confidence bands.While resampling works well in small to medium samples, the use of asymptotic theory suits cases where the sample size n of the dataset is large -exactly when the computational cost of resampling based procedures becomes prohibitive.Existing asymptotic theory is readily applicable and operates under weak conditions on the marginal distribution of the forecast values, and (strict) monotonicity and smoothness of (true) CEPs (18)(19)(20).
The distinction between discretely and continuously distributed forecasts becomes critical here as the asymptotic theory differs between these cases.For discrete forecasts, results of El Barmi and Mukerjee (18) imply that the difference between the estimated and the true CEP, scaled by n 1/2 , converges to a (mixture of) normal distribution(s).For continuous forecasts, following Wright (19), the difference between the estimated and the true CEP, magnified by n 1/3 , converges to Chernoff's distribution (31).The distinct scaling laws imply that the convergence is faster in the discrete than in the continuous case, since in the former the CORP binning stabilizes as it captures the discrete forecast values, and thereafter the amount of samples per bin increases linearly, in accordance with the standard n 1/2 rate.In either setting, asymptotic consistency and confidence bands can be obtained from quantiles of the asymptotic distributions in customary ways.As a caveat, both resampling and asymptotic techniques operate under the q q q q q q q q q q q q q q q Uniform Linear Beta Mixture  assumption of independent, or at least exchangeable, forecast cases, which may or may not be warranted in practice.We encourage follow-up work in dependent data settings, as recently tackled for related types of data science tools (32).
In our software implementation (29), we use the following default choices.Suppose that the sample size is n and there are k unique forecast values.For consistency bands, if n ≤ 1000, or if n ≤ 5000 and n ≤ 50k, we use resampling, else we rely on asymptotic theory.In the latter case we employ the discrete asymptotic distribution if n ≥ 8k 2 , while otherwise we use the continuous one.For confidence bands, the current default uses resampling throughout, as the asymptotic theory depends on the assumption of a true CEP with strictly positive derivative.In the simulation examples in Fig. 3, which are based on n = 1024 observations, this implies the use of resampling in panels (b,c,d) and of discrete asymptotic theory in panel (a).Fig. 4 shows coverage rates of 90% consistency and confidence bands in the simulation settings described in Appendix A, based on the default choices.The coverage rates are generally accurate, or slightly conservative, especially in large samples.

CORP score decomposition: Miscalibration (MCB), discrimination (DSC), and uncertainty (UNC) components
Scoring rules provide a numerical measure of the quality of a classifier or forecast by assigning a score or penalty S(x, y), based on forecast value x ∈ [0, 1] for a dichotomous event y ∈ {0, 1}.A scoring rule is proper (33) if it assigns the minimal penalty in expectation when x equals the true underlying event probability.If the minimum is unique the scoring rule is strictly proper.
In practice, for a given sample (x 1 , y 1 ), . . ., (x n , y n ) of forecast-realization pairs the empirical is used for forecast ranking.Table 1 presents examples of proper and strictly proper scoring rules.The Brier score and logarithmic score are strictly proper.In contrast, the misclassification error is proper, but not strictly proper -all that matters is whether or not a classifier probability is on the correct side of 1 2 .Under any proper scoring rule, the mean score SX constitutes a measure of overall predictive performance.For several decades, researchers have been seeking to decompose SX into intuitively appealing components, typically thought of as reliability (REL), resolution (RES), and uncertainty (UNC) terms.The REL component measures how much the conditional event frequencies deviate from the forecast probabilities, while RES quantifies the ability of the forecasts to discriminate between events and non-events.Finally, UNC measures the inherent difficulty of the prediction problem, but does not depend on the issued forecast under consideration.While there is a consensus on the character and intuitive interpretation of the decomposition terms, their exact form remains subject to despite a half century quest in the wake of Murphy's (15) Brier score decomposition.In particular, Murphy's decomposition is exact in the discrete case, but fails to be exact under continuous forecasts, which has prompted the development of increasingly complex types of decompositions (16,17).
Here we adopt the general score decomposition advocated forcefully by Siegert (34), and discussed by various other authors (e.g., 16,35).Specifically, let SX , SC = 1 n n i=1 S(x i , y i ), and SR = 1 n n i=1 S(r, y i ) [2] denote the mean score for the original forecast values of Eq. [1], the mean score for Calibrated probabilities x1 , . . ., xn , and the mean score for a constant Reference forecast r, respectively.Then SX decomposes as where we adopt, in part, terminology proposed by Ehm and Ovcharov (36) and Pohle (37).As defined in Eq. [3], the miscalibration component MCB is the difference of the mean scores of the original and the calibrated forecasts.Similarly, the DSC component quantifies discrimination ability via the difference between the mean score for the reference and the calibrated forecast, while the classical measure of uncertainty (UNC) is simply the mean score for the reference forecast.
In the extant literature, it has been assumed implicitly or explicitly that the calibrated and reference forecasts can be chosen at researchers' discretion (34,37).We argue otherwise and contend that the calibrated forecasts ought to be the PAV-(re)calibrated probabilities, as displayed in the CORP reliability diagram, whereas the reference forecast r ought to be the marginal event frequency ȳ = 1 n n i=1 y i .We refer to the resulting decomposition as the CORP score decomposition, which enjoys the following properties: • MCB ≥ 0 with equality if the original forecast is calibrated.
• DSC ≥ 0 with equality if the PAV-calibrated forecast is constant.
• The decomposition is exact.
In particular, the CORP score decomposition never yields counterintuitive negative values of the components, contrary to choices in the extant literature.The cases of vanishing components (MCB = 0 or DSC = 0) support the intuitive interpretation of CORP reliability diagrams, in that parts away from the diagonal indicate lack of calibration, whereas extended horizontal segments are indicative of diminished discrimination ability.For refined statements and proofs see Theorem 1 in Appendix C. If S is the Brier score, then in the special case of discrete forecasts with non-decreasing CEPs, the MCB, DSC, and UNC terms in Eq. [3] agree with the REL, RES, and UNC components, respectively, in the classical Murphy decomposition, as we demonstrate in Theorem 2 in Appendix C. If S is the misclassification error, MCB equals the fraction of cases in which the PAV-calibrated probability was on the correct side of 1  2 , but the original forecast value was not, minus the fraction vice versa, with natural adaptations in the case of ties.
In Table 2 we illustrate the CORP Brier score decomposition for the probability of precipitation forecasts at Niamey in Figs.1-2.The purely data-driven Logistic forecast obtains the best (smallest) mean score, the best (smallest) MCB term, and the best (highest) DSC component, well in line with the insights offered by the CORP reliability diagrams, and attesting to the particular challenges for precipitation forecasts over northern tropical Africa (7).
Interestingly, every proper scoring rule admits a representation as a mixture of elementary scoring rules (e.g., 33, Sect.3.2).Consequently, the MCB, DSC, and UNC components of the CORP decomposition admit analogous representations as mixtures of the respective components under the elementary scores, whence we may plot Murphy diagrams in the sense of Ehm et al. (38) for the MCB, DSC, and UNC components.

Discussion
Our paper addresses two long-standing challenges in the evaluation of probabilistic classifiers by developing the CORP reliability diagram that enjoys theoretical guarantees, avoids artifacts, allows for uncertainty quantification, and yields a fully automated choice of the underlying binning, without any need for tuning parameters or implementation choices.The associated CORP decomposition disaggregates the mean score under any proper scoring rule into components that are guaranteed to be non-negative.
Of particular relevance is the remarkable fact that CORP reliability diagrams feature optimality properties in both finite sample and large sample settings.Asymptotically, the PAV-(re)calibrated probabilities, which are plotted in a CORP reliability diagram, minimize estimation error, while in finite samples PAV-calibrated probabilities are optimal in terms of any proper scoring rule, subject to the regularizing constraint of isotonicity.
We believe that the proposals in this paper can serve as a blueprint for the development of novel diagnostic and inference tools for a very wide range of data science methods.For example, the popular Hosmer-Lemeshow goodness-of-fit test (39) for logistic regression is subject to the same types of ad hoc decisions on binning schemes, and hence the same types of instabilities as the binning and counting approach (10, p. 6).Tests based on CORP and the MCB miscalibration measure are promising candidates for powerful alternatives.
Perhaps surprisingly, the PAV algorithm and its appealing properties generalize from probabilistic classifiers to mean, quantile, and expectile assessments for real-valued outcomes (40).In this light, far-reaching generalizations of the CORP approach apply to binary regression in general, to standard (mean) regression, where they yield a new mean squared error (MSE) decomposition with desirable properties, and to quantile and expectile regression.In all these settings, score decompositions have been studied (37,41), and we contend that the PAV algorithm ought to be used to generate the Calibrated forecast in the general decomposition in Eq. [3], whereas the Reference forecast ought to be the respective marginal, unconditional event frequency, mean, quantile, or expectile.We leave these extensions to future work and encourage further investigation from theoretical, methodological, and applied perspectives.
Open source code for the implementation of the CORP approach in the R language and environment for statistical computing ( 42) is available on GitHub (29).  the CORP reliability diagram are asymptotically the same, and so are the respective asymptotic distributions.However, under the CORP approach the unique forecast values are always correctly identified as the sample size increases, while under the binning and counting approach this may or may not be the case, depending on implementation decisions.
Large sample theory for the continuously distributed case is more involved, and generally assumes that the CEP is differentiable with strictly positive derivative.Asymptotic results of Wright (19) for the variance and of Dai et al. (43) for the bias imply that the MSE of the CORP estimates decays like n −2/3 .We now compare to the binning and counting approach, using either m fixed, equidistant bins, or using m = m(n) empirical quantile-dependent bins.For a general sequence of m(n) bins, the magnitudes of the asymptotic variance and squared bias are governed by the most sparsely populated bin, at a disadvantage relative to the quantile-dependent case.
The classical reliability diagram relies on a fixed number m of bins, finds the respective binaveraged event frequencies, and plots them against the bin midpoints or bin-averaged forecast values.Any such approach fails asymptotically, with estimates that are in general biased and inconsistent.More adequately, a flexible number m(n) of bins can be used, with boundaries defined via empirical quantiles of x 1 , . . ., x n .Specifically, m(n) bins can be bracketed by 0, the empirical quantiles at level j/m(n) for j = 1, . . ., m(n) − 1, and 1.Then, for n sufficiently large, each bin covers about n/m(n) data points, and the bin-averaged CEPs converge to the true CEPs at the respective true quantiles with an estimation variance that decays like m(n)/n and a squared bias that decays like m(n) −2 .When m(n) is of order n α for α ∈ (0, 1), we obtain a consistent estimate with an estimation variance that decays like n α−1 and a squared bias that decays like n −2α .Consequently, the MSE of the estimates is of order n β where β = max(α − 1, −2α).The optimal choice of the exponent, α = 1 3 , results in an MSE of order n −2/3 .While this asymptotic rate is the same as under the CORP approach, the CORP reliability diagram is preferable in finite samples, as we now demonstrate.
In Fig. 5 we detail a comparison of CORP reliability diagrams to the binning and counting approach with either a fixed number m of bins, or m = m(n) = [n α ] empirical-quantile dependent bins, where [x] denotes the smallest integer less than or equal to x ∈ R. For this, we plot the empirical mean squared error (MSE) of the various CEP estimates against the sample size n, using settings described in Appendix A. Across colums, the distributions of the forecast values differ in shape, across rows, we are in the discrete setting with k = 10 and 50 unique forecasts values, and in the continuous setting, respectively.Throughout, the CORP reliability diagrams exhibit the smallest MSE, uniformly over all sample sizes and against all alternative methods, with the superiority being the most pronounced under non-uniform forecast distributions with many unique forecast values, as frequently generated by statistical or machine learning techniques.where the UNC component is the same as in the CORP decomposition in Eq. [3].Furthermore, subject to mild conditions, the decompositions agree in full.

Figure 1 :
Figure 1: Reliability diagrams for probability of precipitation forecasts over Niamey, Niger (7) in July-September 2016 under (a,b) ENS, (c,d) EPC, and (e,f) Logistic methods.At left (a,c,e), we show reliability diagrams under the binning and counting approach with a choice of ten equally spaced bins.At right (b,d,f), we show CORP reliability diagrams with uncertainty quantification through 90% consistency bands.The histograms at bottom illustrate the distribution of the n = 86 forecast values.

Figure 3 :
Figure 3: CORP reliability diagrams in the setting of (a,b) discretely and (c,d) continuously, uniformly distributed, simulated predictive probabilities x with a true, miscalibrated CEP of √ x, with uncertainty quantification via (a,c) consistency and (b,d) confidence bands at the 90% level.

Figure 4 :
Figure 4: Empirical coverage, averaged equally over the forecast values, of 90% uncertainty bands for CORP reliability diagrams under default choices for 1000 simulation replicates.The upper row concerns consistency bands, and the lower row confidence bands.The columns correspond to three types of marginal distributions for the forecast values, and colors distinguish discrete and continuous settings, as described in Appendix A. Different symbols denote reliance of the bands on resampling, discrete, or continuous asymptotic distribution theory.

Figure 5 : 1 3 , or 1 2 .
Figure 5: Mean squared error (MSE) of the CEP estimates in CORP reliability diagrams for samples of size n, in comparison to the binning and counting approach with m = 5, 10, or 50 fixed bins, or m(n) = [n α ] quantile-spaced bins, where α = 1 6 , 1 3 , or 1 2 .Note the log-log scale.The simulation settings are described in Appendix A, and MSE values are averaged over 1000 replicates.
calibrated forecast, and ȳ as the reference forecast, then becomes

Theorem 2
Under the Brier score, if the sequence o 1 /n 1 , . . ., o k /n k is nondecreasing, then MCB = REL and DSC = RES, respectively.Proof As the sequence o 1 /n 1 , . . ., o k /n k is nondecreasing, the PAV-calibrated probabilities satisfy ẑj = o j /n j for j = 1, . . ., k. Adopting the arguments in the Appendix of ref. (34), we see that MCB = SX − SC = REL and DSC = RES.
Stephenson et al. (17, p. 757) call for the development of "nonparametric approaches for estimating the reliability curves (and hence the Brier score components), which

Table 1 :
Scoring rules for probabilistic forecasts of binary events

Table 2 :
CORP Brier score decomposition for the probability of precipitation forecasts in Figs.1 and 2.