Proving prediction prudence

We study how to perform tests on samples of pairs of observations and predictions in order to assess whether or not the predictions are prudent. Prudence requires that that the mean of the difference of the observation-prediction pairs can be shown to be significantly negative. For safe conclusions, we suggest testing both unweighted (or equally weighted) and weighted means and explicitly taking into account the randomness of individual pairs. The test methods presented are mainly specified as bootstrap and normal approximation algorithms. The tests are general but can be applied in particular in the area of credit risk, both for regulatory and accounting purposes.


Introduction
Testing if the means of two samples significantly differ or the mean of one sample significantly exceeds the mean of the other sample is a problem that is widely covered in the statistical literature [see for instance Casella and Berger, 2002, Davison and Hinkley, 1997, Venables and Ripley, 2002. In this paper, we study how to perform such tests on samples of pairs of observations and predictions in order to assess whether or not the predictions are prudent. Prudence is here understood as the requirement that the mean of the differences of the observations and predictions can be shown to be significantly negative.
At the latest by the validation requirements for credit risk parameter estimates in the regulatory Basel II framework [BCBS, 2006, paragraph 501], such tests also became an important issue in the banking industry * : • "Banks must regularly compare realised default rates with estimated PDs for each grade and be able to demonstrate that the realised default rates are within the expected range for that grade", and "banks using the advanced IRB approach must complete such analysis for their estimates of LGDs and EADs".
More recently, as a consequence of the introduction of new rules for loss provisioning in financial reporting standards, the validation of risk parameter estimates also attracted interest in the accounting community [see, e.g., Bellini, 2019]. Over the course of the past fifteen years or so, a variety of statistical tests for the comparison of realised and predicted values have been proposed for use in the banks' validation exercises. For overviews on estimation and validation as well as references see Blümke [2019, PD], Loterman et al. [2014, LGD], and Gürtler et al. [2018, EAD]. Scandizzo [2016] presents validation methods for all these kinds of parameters in the general context of model risk management.
In order to make validation results by different banks to some extent comparable, in February 2019, the European Central Bank [ECB, 2019] † asked the banks it supervises under the Single Supervisory Mechanism (SSM) to deliver standardised annual reports on their internal model validation exercises. In particular, the requested reports are assumed to include data and tests regarding the "predictive ability (or calibration)" of PD, LGD and CCF (credit conversion factor) ‡ parameters in the most recent observation period. Predictive ability for LGD estimation is explained through the statement "the analysis of predictive ability (or calibration) is aimed at ensuring that the LGD parameter adequately predicts the loss rate in the event of a default i.e. that LGD estimates constitute reliable forecasts of realised loss rates" [ECB, 2019, Section 2.6.2]. The meanings of predictive ability for PD and EAD / CCF respectively are illustrated in similar ways.
ECB [2019] proposed "one-sample t-test[s] for paired observations" to test the "null hypothesis that estimated LGD [or CCF or EAD] is greater than true LGD" (or CCF or EAD). ECB [2019] also suggested a Jeffreys binomial test for the "null hypothesis that the PD applied in the portfolio/rating grade at the beginning of the relevant observation period is greater than the true one (one sided hypothesis test)".
Recall that the possible outcomes of testing a null hypothesis against an alternative are 'the null hypothesis is not rejected' or 'the null is rejected and the alternative is accepted'. Not rejecting the null hypothesis does not mean accepting it because in hypothesis testing the type II error (not rejecting the null hypothesis although the alternative is true) cannot be controlled and, therefore, can be rather large. In contrast, the type I error (rejecting the null hypothesis although it is true) can be controlled and usually is kept small by choosing a significance level like 5% or 1%. Hence, if the null hypothesis is rejected the alternative can be accepted at properly controlled risk. In the following, we understand the acceptance of an alternative hypothesis by rejection of the null hypothesis as statistical 'proof' with an error probability tag (i.e. the significance level or p-value).
In this paper, • we make a case for also testing the null hypothesis that the estimated parameter is less than or equal to the true parameter in order to be able to 'prove' that the estimate is prudent (or conservative), • we suggest additionally using exposure-(or limit-)weighted § sample averages in order to better inform assessments of estimation (or prediction) prudence, and • we propose more elaborate statements of the hypotheses for the tests (by including 'variance expansion') in order to account for portfolio inhomogeneity in terms of composition (exposure sizes) and riskiness.
The proposal to look for a 'proof' of prediction prudence is inspired by the regulatory requirement [BCBS, 2006, paragraph 451]: "In order to avoid over-optimism, a bank must add to its estimates a margin of conservatism that is related to the likely range of errors". As a matter of fact, the statistical tests discussed in this paper can be deployed both for proving prudence and for proving aggressiveness of estimates. However, an unsymmetric approach is recommended for making use of the evidence from the tests: • For proving prudence, request that both the equal-weights test and the exposure-weighted test reject the null hypothesis of the parameter being aggressive. • For an alert of potential aggressiveness, request only that the equal-weights test or the exposureweighted test reject the null hypothesis of the parameter being prudent.
The paper is organised as follows: • In Section 2, we introduce a general non-parametric paired difference test approach to testing for the sign of a weighted mean value (Section 2.1). We compare this approach to the t-test for LGD, CCF and EAD proposed in ECB [2019] and note possible improvements of both approaches (Section 2.2). We then present in Section 2.3 a test approach to put into practice these improvements in the case of variables with values in the unit interval like LGD and CCF. Appendices A and B supplement Section 2.3 with regard to weight-adjustments as an alternative to sampling with inhomogeneous weights and to testing non-negative but not necessarily bounded variables like EAD. • In Section 3, we discuss paired difference tests in the special case of differences between observed event indicators and the predicted probabilities of the events. We start in Section 3.1 with the presentation of a test approach that takes account of potential weighting of the observation pairs and variance expansion to deal with the individual randomness of the observations. In Section 3.2, we compare this test approach to the Jeffreys test proposed in ECB [2019] for assessing the 'predictive ability' of PD estimates. • In Section 4, the test methods presented in the preceding sections are illustrated with two examples of test results. • Section 5 concludes the paper with summarising remarks.

Paired difference tests
The statistical tests considered in this paper are 'paired difference tests'. This test design accounts for the strong dependence that is to be expected between the observation and the prediction in the § ECB [2019] presumably only looks at "number-weighted" (i.e. equally weighted) averages because the Basel framework [BCBS, 2006] requires such averages for the risk parameter estimates. In banking practice, however, also exposure-weighted averages are considered [see, e.g., Li et al., 2009]. , matched observation-prediction pairs which the analysed samples consist of. See Mendenhall et al. [2008, Chapter 10] for a discussion of the advantages of such test designs.

Basic approach
Starting point.
• ∆ 1 , . . . , ∆ n may be a sample of differences (residuals) between observed and predicted LGD (or CCF or EAD) for defaulted credit facilities (matched pairs of observations and predictions). • The weight w i reflects the relative importance of observation i. For instance, in the case of CCF or EAD estimates of credit facilities, one might choose where limit j is the limit of credit facility j at the time when the estimates were made. • In case of LGD estimates, the weights w i could be chosen as [Li et al., 2009, Section 5] where EAD j is the exposure at default estimate for credit facility j at the time when the estimates were made.
Goal. We consider ∆ w as defined by (2.1) the realisation of a test statistic to be defined below and want to answer the following two questions: • If ∆ w < 0, how safe is the conclusion that the observed (realised) values are on weighted average less than the predictions, i.e. the predictions are prudent / conservative? • If ∆ w > 0, how safe is the conclusion that the observed (realised) values are on weighted average greater than the predictions, i.e. the predictions are aggressive?
The safety of conclusions is measured by p-values which provide error probabilities for the conclusions to be wrong. The lower the p-value, the more likely the conclusion is right. In order to be able to examine the properties of the sample and ∆ w with statistical methods, we have to make the assumption that the sample was generated with some random mechanism. The key idea for the mechanism is to interpret the weights w i as the probabilities of the corresponding observations ∆ i . Consequently, we look at an inhomogeneous version of the empirical distribution of the sample ∆ 1 , . . . , ∆ n , with the weight w i replacing 1/n as the probability of observation ∆ i . The details of the mechanism are described in the following assumption. , Assumption 2.1. The sample ∆ 1 , . . . , ∆ n consists of independent realisations of a random variable X ϑ with distribution given by where the value of the parameter ϑ ∈ R is unknown.
Note that (2.3) includes the case of equally weighted observations ¶ , by choosing w i = 1/n for all i.
Proposition 2.2. For X ϑ as described in Assumption 2.1, the expected value and the variance are given by Proof. Obvious. By Assumption 2.1 and Proposition 2.2, the questions on the safety of conclusions from the sign of ∆ w can be translated into hypotheses on the value of the parameter ϑ: > 0 is true? If we assume that the sample ∆ 1 , . . . , ∆ n was generated by independent realisations of X ϑ then the distribution of the sample mean is different from the distribution of X ϑ , as shown in the following corollary to Proposition 2.2.
Corollary 2.3. Let X 1,ϑ , . . . , X n,ϑ be independent and identically distributed copies of X ϑ as in Assumption 2.1 and defineX ϑ = 1 n n i=1 X i,ϑ . Then for the mean and the variance ofX ϑ , it holds that In the following, we useX ϑ as the test statistic and interpret ∆ w as its observed value . Next we describe a bootstrap test to answer the above questions under Assumption 2.1 and then provide the rationale behind its design.
Bootstrap test. Generate a Monte Carlo sample **x 1 , . . . ,x R from ∆ 1 , . . . , ∆ n as follows: • For j = 1, . . . , R:x j is the equally weighted mean of n independent draws from the distribution of X ϑ as given by (2.3), with ϑ = 0. Equivalently,x j is the mean of n draws with replacement from the sample ∆ 1 , . . . , ∆ n , where ∆ i is drawn with probability w i . •x 1 , . . . ,x R are realisations of independent, identically distributed random variables. ¶ See Appendix A for a more detailed discussion of special cases with equal weights.
For arithmetic reasons, actually most of the time ∆ w cannot be a realisation ofX ϑ . As long as the sample size n is not too small, however, by (2.5a) and the law of large numbers considering ∆ w as realisation ofX ϑ is not unreasonable. ** According to Davison and Hinkley [1997, Section 5.2.3], sample size R = 999 should suffice for the purposes of this paper.
Then a bootstrap p-value for the test of H 0 : ϑ ≤ ∆ w against H 1 : ϑ > ∆ w can be calculated as † † A bootstrap p-value for the test of H * 0 : ϑ ≥ ∆ w against H * 1 : ϑ < ∆ w is given by Rationale. By (2.3), for each ϑ the distributions of X 0 − ϑ and X ϑ are identical. As a consequence, if under H 0 the true parameter is ϑ ≤ ∆ w and (−∞, x] is the critical (rejection) range for the test of H 0 against H 1 based on the test statisticX ϑ , then it holds that (2.7) Hence, by Theorem 8.3.27 of Casella and Berger [2002], in order to obtain a p-value for H 0 : ϑ ≤ ∆ w against H 1 : ϑ > ∆ w , according to (2.7) it suffices to specify: • The upper limit x of the critical range for rejection of H 0 : ϑ ≤ ∆ w as 'observed' value ∆ w ofX ϑ , and • an approximation of the distribution ofX 0 , as it is done by generating the bootstrap samplē This implies Equation (2.6a) for the bootstrap p-value ‡ ‡ of the test of H 0 against H 1 . The rationale for (2.6b) is analogous.
Normal approximate test. By Corollary 2.3 for ϑ = ∆ w , we find that the distribution ofX ∆ w can be approximated by a normal distribution with mean 0 and variance as shown on the right-hand side of (2.5b). With x = ∆ w , therefore, we obtain the following expression for the normal approximate p-value of H 0 : ϑ ≤ ∆ w against H 1 : ϑ > ∆ w : Here Φ denotes the standard normal distribution function. The same reasoning gives for the normal (2.8b) † † #S denotes the number of elements of the set S . ‡ ‡ We adopt here the definition provided by Davison and Hinkley [1997, Eq. (4.11)].

The t-test approach
In Sections 2.6.2 (for LGD back-testing), 2.9.3.1 (for CCF back-testing) and 2.9.3.2 (for EAD back-testing) of ECB [2019], the ECB proposes a t-test for (in the terms of Section 2.1 of this paper) H * 0 : ϑ ≥ ∆ w against H * 1 : ϑ < ∆ w . Transcribed into the notation of Section 2.1, the test can be described as follows: • n is the number of matched pairs of observations and predictions in the sample.
• ∆ i is the difference of the realised LGD for facility i and the estimated LGD for facility i in ECB Section 2.6.2, the realised CCF for facility i and the estimated CCF for facility i in ECB Section 2.9.3.1, and the drawings (balance sheet exposure) at the time of default of facility i and the estimated EAD of facility i in ECB Section 2.9.3.2.
• All w i equal 1/n.
• The right-hand side of (2.5b) is replaced by the sample variance • The p-value is computed as where Ψ n−1 denotes the distribution function of Student's t-distribution with n − 1 degrees of freedom.
By the Central Limit Theorem, the p-values according to (2.6b), (2.8b) and (2.9) will come out almost identical for large sample sizes n and equal weights w i = 1/n for all i = 1, . . . , n. For smaller n, the value of (2.9) would be exact if the variables X i,ϑ in Corollary 2.3 were normally distributed.
Criticisms of the basic approach. The basic approach as described in Sections 2.1 and 2.2 fails to take account of the following issues: • The random mechanism reflected by (2.3) can be interpreted as an expression of uncertainty about the cohort / portfolio composition. The randomness of the loss rate / exposure of the individual facilities -the degree of which potentially can differ between facilities -is not captured by (2.3). • The parametrisation of the distribution by a location parameter in (2.3) could result in distributions with features that are not realistic, for instance negative exposures or loss rates greater than one.
In the following section and in Appendix B, we are going to modify the basic approach for LGD / CCF on the one hand and EAD on the other hand in such a way as to take into account these two issues.

Tests for variables with values in the unit interval
By definition, both LGD and CCF take values only in the unit interval [0, 1]. This fact allows for more specific tests than the ones considered in the previous sections. In this section, we talk only about LGD most of the time. But the concepts discussed also apply with little or no modification to CCF or any other variables with values in the unit interval. , Starting point.
Interpretation in the context of LGD back-testing.
• A sample of n defaulted credit facilities / loans is analysed.
• The LGD λ i is an estimate of loan i's loss rate as a consequence of the default, measured as percentage of the exposure at the time of default (EAD). • The realized loss rate ℓ i shows the percentage of loan i's exposure at the time of default that cannot be recovered. • The weight w i reflects the relative importance of observation i. In the case of LGD predictions, one might choose (2.2b) for the definition of the weights, for CCF one might choose (2.2a) instead.
Goal. We want to use the observed weighted average difference / residual ∆ w = n i=1 w i ∆ i = ℓ w − λ w to assess the quality of the calibration of the model / approach for the λ i to predict the realised loss rates ℓ i . Again we want to answer the following two questions: • If ∆ w < 0, how safe is the conclusion that the observed (realised) values are on weighted average less than the predictions, i.e. the predictions are prudent / conservative? • If ∆ w > 0, how safe is the conclusion that the observed (realised) values are on weighted average greater than the predictions, i.e. the predictions are aggressive?
The safety of such conclusions is measured by p-values which provide error probabilities for the conclusions to be wrong. The lower the p-value, the more likely the conclusion is right. In order to be able to examine the specific properties of the sample and ∆ w with statistical methods, we have to make the assumption that the sample was generated with some random mechanism. This mechanism is described in the following modification of Assumption 2.1.
Assumption 2.4. The sample ∆ 1 , . . . , ∆ n consists of independent realisations of a random variable X ϑ with distribution given by . . , n. The parameters α i and β i of the beta-distribution depend on the unknown parameter 0 < ϑ < 1 by (2.10b) § § See Casella and Berger [2002, Section 3.3] for a definition of the beta-distribution.
, In (2.10b), the constant 0 < v < 1 is the same for all i. The ϑ i are determined by Assumption 2.4 introduces randomness of the difference between loss rate and LGD prediction for individual facilities. Comparison between (2.13b) below and (2.4b) shows that this entails variance expansion of the sample ∆ 1 , . . . , ∆ n .
Note that Assumption 2.4 also describes a method for recalibration of the LGD estimates λ 1 , . . . , λ n to match targets ϑ with the weighted average of the ϑ i . In contrast to (2.3), the transformation (2.10c) makes it sure that the transformed LGD parameters still are values in the unit interval. By definition of ( 2.11) The constant v must be pre-defined or separately estimated. We suggest estimating it from the sample ℓ 1 , . . . , ℓ n asv This approach yields 0 ≤v ≤ 1 because the fact that 0 ≤ ℓ i ≤ 1, i = 1, . . . , n, implies A simpler alternative to the definition (2.10c) of ϑ i would be linear scaling: ϑ i = λ i ϑ λ w . However, with this definition ϑ i > 1 may be incurred. This is not desirable because then the beta-distribution for Y ϑ | I = i would be ill-defined.
Proposition 2.5. For X ϑ as described in Assumption 2.4, the expected value and the variance are given by Proof. For deriving the formula for var[X ϑ ], make use of the well-known variance decomposition In contrast to (2.4b), the variance of X ϑ as shown in (2.13b) depends on the parameter ϑ and has an additional component v n i=1 w i ϑ i (1 − ϑ i ) which reflects the potentially different variances of the loss rates in an inhomogeneous portfolio.
By Assumption 2.4 and Proposition 2.5, the questions on the safety of conclusions from the sign of ∆ w = ℓ w − λ w again can be translated into hypotheses on the value of the parameter ϑ: > 0 is true? If we assume that the sample ∆ 1 , . . . , ∆ n was generated by independent realisations of X ϑ then the distribution of the sample mean is different from the distribution of X ϑ , as shown in the following corollary to Proposition 2.5.
Corollary 2.6. Let X 1,ϑ , . . . , X n,ϑ be independent and identically distributed copies of X ϑ as in Assumption 2.4 and defineX ϑ = 1 n n i=1 X i,ϑ . Then for the mean and variance ofX ϑ , it holds that In the following, we useX ϑ as the test statistic and interpret ∆ w = ℓ w − λ w as its observed value.
Proposition 2.7. In the setting of Assumption 2.4 and Corollary 2.6, ϑ ≤ ϑ implies that Proof. Observe that ϑ ≤ ϑ implies ϑ i ≤ ϑ i for all i = 1, . . . , n. For fixed i, the family of beta(α i , β i )distributions, parametrised by ϑ ∈ (0, 1), has got a monotone likelihood ratio in the sense of Definition 8.3.16 of Casella and Berger [2002]. This implies that for ϑ ≤ ϑ, conditional on I = i, the distribution of Y ϑ is stochastically not less than the distribution of Y ϑ , i.e. it holds that for all x ∈ R.
From this, it follows that for all i = 1, . . . , n for all x ∈ R.
But this inequality implies for all x ∈ R that (2.15) Property (2.15) is passed on to convolutions of independent copies of X ϑ and X ϑ . This proves the assertion.
Bootstrap test. Generate a Monte Carlo samplex 1 , . . . ,x R from X ϑ with ϑ = ℓ w as follows: • For j = 1, . . . , R:x j is the equally weighted mean of n independent draws from the distribution of X ϑ as given by Assumption 2.4, with ϑ = ℓ w . •x 1 , . . . ,x R are realisations of independent, identically distributed random variables.
Then a bootstrap p-value for the test of H 0 : ϑ ≤ ℓ w against H 1 : ϑ > ℓ w can be calculated as A bootstrap p-value for the test of H * 0 : ϑ ≥ ℓ w against H * 1 : ϑ < ℓ w is given by (2.17) Hence, by Theorem 8.3.27 of Casella and Berger [2002], in order to obtain a p-value for H 0 : ϑ ≤ ℓ w against H 1 : ϑ > ℓ w , according to (2.17) it suffices to specify: • The upper limit x of the critical range for rejection of H 0 : ϑ ≤ ℓ w as our realisation ∆ w = ℓ w − λ w ofX ϑ , and • an approximation of the distribution ofX ℓ w , as it has been done by generating the bootstrap samplē This implies Equation (2.16a) for the bootstrap p-value. The rationale for (2.16b) is analogous.
Normal approximate test. By Corollary 2.6, we find that the distribution ofX ℓ w can be approximated by a normal distribution with mean 0 and variance as shown on the right-hand side of (2.13b) with ϑ = ℓ w . With x = ℓ w − λ w , one obtains for the approximate p-value of H 0 : ϑ ≤ ℓ w against H 1 : ϑ > ℓ w : with ϑ i = (λ i ) h(ℓ w ) as in Assumption 2.4. The same reasoning gives for the normal approximate p-value of H * 0 : ϑ ≥ ℓ w against H * 1 : ϑ < ℓ w : (2.18b)

Tests of probabilities
Starting point.
Interpretation in the context of PD back-testing.
• A sample of n borrowers is observed for a certain period of time, most commonly one year.
• The PD p i is an estimate of borrower i's probability to default during the observation period, estimated before the beginning of the period. • The status indicator b i shows borrower i's performance status at the end of the observation period. b i = 1 means "borrower has defaulted", b i = 0 means "borrower is performing".
• w i could be the relative importance of observation i. In the case of default predictions, one might choose weights as in (2.2b).
Goal. We want to use the observed weighted average difference / residual ∆ w = n i=1 w i ∆ i = b w − p w to assess the quality of the calibration of the model / approach for the p i to predict the realised status indicators b i . Again we want to answer the following two questions: • If ∆ w < 0, how safe is the conclusion that the observed (realised) values are on weighted average less than the predictions, i.e. the predictions are prudent / conservative? • If ∆ w > 0, how safe is the conclusion that the observed (realised) values are on weighted average greater than the predictions, i.e. the predictions are aggressive?
The safety of such conclusions is measured by p-values which provide error probabilities for the conclusions to be wrong. The lower the p-value, the more likely the conclusion is right. In determining the p-values, we take into account the criticisms of the basic approach as mentioned at the end of Section 2.2.

Testing probabilities on inhomogeneous samples
In order to be able to examine the PD-specific properties of the sample and ∆ w = b w − p w with statistical methods, we have to make the assumption that the sample was generated with some random mechanism. This mechanism is described in the following modification of Assumptions 2.1 and 2.4.
Assumption 3.1. The sample ∆ 1 , . . . , ∆ n consists of independent realisations of a random variable X ϑ with distribution given by

1a)
where I is a random variable with values in {1, . . . , n} and P[I Then the ϑ i depend on the unknown parameter 0 < ϑ < 1 by Assumption 3.1 introduces randomness of the difference between status indicator and PD prediction for individual facilities. Comparison between (3.2b) below and (2.4b) shows that this entails variance expansion of the sample ∆ 1 , . . . , ∆ n . Note that Assumption 3.1 also describes a method for recalibration of the PD estimates p 1 , . . . , p n to match targets ϑ with the weighted average of the ϑ i . In contrast to (2.3), the transformation (3.1c) makes it sure that the transformed PD parameters still are values in the unit interval. In principle, instead of (3.1c) also the transformation (2.10c) could have been used. (3.1c) was preferred because it has a probabilistic foundation through Bayes' theorem. By definition of Y ϑ , it holds that E[Y ϑ | I Another simple alternative to the definition (3.1c) of ϑ i would be linear scaling: ϑ i = p i ϑ p w . However, with this definition ϑ i > 1 may be incurred. This is not desirable because then the Bernoulli distribution for Y ϑ | I = i would be ill-defined.
Proposition 3.2. For X ϑ as described in Assumption 3.1, the expected value and the variance are given by Proof. Similar to the proof of Proposition 2.5. Note that n i=1 w i (b i − ϑ i ) 2 is a weighted version of the Brier Score [see, e.g., Hand, 1997] for the observation-prediction sample (b 1 , ϑ i ), . . . , (b n , ϑ n ). This observation suggests that the power of the calibration tests considered in this section will be the greater, the better the discriminatory power of the PD predictions is (reflected by lower Brier scores).
By Assumption 3.1 and Proposition 3.2, the questions on the safety of conclusions from the sign of ∆ w = b w − p w again can be translated into hypotheses on the value of the parameter ϑ: • If ∆ w < 0, can we conclude that H 0 : ϑ ≤ b w is false and H 1 : ϑ > b w ⇔ E[X ϑ ] < 0 is true? • If ∆ w > 0, can we conclude that H * 0 : ϑ ≥ b w is false and H * 1 : ϑ < b w ⇔ E[X ϑ ] > 0 is true? If we assume as before in Section 2 that the sample ∆ 1 , . . . , ∆ n was generated by independent realisations of X ϑ then the distribution of the sample mean is different from the distribution of X ϑ , as shown in the following corollary to Proposition 3.2.
Corollary 3.3. Let X 1,ϑ , . . . , X n,ϑ be independent and identically distributed copies of X ϑ as in Assumption 3.1 and defineX ϑ = 1 n n i=1 X i,ϑ . Then for the mean and variance ofX ϑ , it holds that In the following, we useX ϑ as the test statistic and interpret ∆ w = b w − p w as its observed value.
, Proof. Assume ϑ < ϑ and let h = h(ϑ) and h = h ϑ . Along the same lines of algebra as in Section 3 of Tasche [2013b], it can be shown that (with w i and ̺ i as in Assumption 3.1) for 0 < t < 1 and η > 0 the following two equations are equivalent: (3.5b) By definition, (3.1d) holds for ϑ and h. From (3.4) and (3.5a) then it follows that However, by (3.4) we also have By (3.5b), this only is possible if it holds that h > h. Hence it follows that By (3.1c) (i.e. the definition of ϑ i and ϑ i ), this inequality implies ϑ i < ϑ i .
Theorem 3.5. In the setting of Assumption 3.1 and Corollary 3.3, ϑ ≤ ϑ implies that for all x ∈ R.
Proof. By Lemma 3.4, ϑ ≤ ϑ implies for all i = 1, . . . , n that ϑ i ≤ ϑ i and therefore also The remainder of the proof is identical to the last part of the proof of Proposition 2.7. , Exact p-values. Since by definition up to the constant 1/n the test statisticX ϑ as defined in Assumption 3.1 and Corollary 3.3 takes only integer values in the range {−n, . . . , −1, 0, 1, . . . , n}, its distribution can readily be exactly determined by means of an inverse Fourier transform [Rolski et al., 1999, Section 4.7]. By Theorem 3.5 and Theorem 8.3.27 of Casella and Berger [2002], then a p-value for the test of H 0 : ϑ ≤ b w against H 1 : ϑ > b w can exactly be computed as (3.6b) Normal approximate test. By Corollary 3.3, we find that the distribution ofX b w can be approximated by a normal distribution with mean 0 and variance as shown on the right-hand side of (3.3b). With x = b w − p w , one obtains for the approximate p-value of H 0 : ϑ ≤ b w against H 1 : ϑ > b w : The same reasoning gives for the normal approximate p-value of H * 0 : ϑ ≥ ℓ w against H * 1 : ϑ < ℓ w : (3.7b)

The Jeffreys test approach
In Section 2.5.3.1 of ECB [2019], the ECB proposes "PD back testing using a Jeffreys test". Transcribed into the notation of Section 3.1 of this paper, the starting point for the test can be described as follows: • n = N, where "N is the number of customers in the portfolio/rating grade".
where "D is the number of those customers that have defaulted within that observation period". The Jeffreys test for the success parameter of a binomial distribution.
• In a Bayesian setting, an "objective Bayesian" prior distribution beta(1/2, 1/2) for the PD is chosen such that -assuming a binomial distribution for the number of defaults -the posterior distribution (i.e. conditional on the observed number of defaults) of the PD is beta(D + 1/2, N − D + 1/2). See Kazianka [2016] for the rationale for choosing this method of test. If estimated as the mean of the posterior distribution, the Bayesian PD estimate is D+1/2 N+1 .
• The Null hypothesis is "the PD applied in the portfolio/rating grade . . . is greater than the true one (one sided hypothesis test)", i.e. H 0 : θ ≤ θ with θ = "applied PD" and θ = "true PD". In the notation of Section 3.1, this can be phrased as testing H * 0 : ϑ ≥ b 1/n against H * 1 : ϑ < b 1/n . • ECB [2019]: "The test statistic is the PD of the portfolio/rating grade." The construction principle for the Jeffreys test is to determine a credibility interval for the PD and then to check if the applied PD is inside or outside of the interval. where F α, β denotes the distribution function of the beta(α, β)-distribution. Comments.
• The standard (frequentist) one-sided binomial test would be: 'Reject H 0 if D ≥ c' where c is a 'critical' value such the probability under H 0 to observe c or more defaults is small. For this test, the p-value is Hence, unless the observed number of default D is very small or even zero, from (3.8) it follows that in practice most of the time the Jeffreys test and the standard binomial test give similar results. • For a 'fair' comparison of the Jeffreys test and the test proposed in Section 3.1, we have to modify Assumption 3.1 such that there is no variance expansion and all weights are equal, i.e. the random variable X ϑ is simply defined by where the ϑ i depend on the unknown parameter 0 < ϑ < 1 in the way described by (3.1c) and (3.1d). The normal approximate p-value of H 0 against H 1 is then (using the ECB notation) (3.11) • The normal approximation of the frequentist (and by (3.8) and (3.9) also Jeffreys) binomial test p-value is (3.12) • The test for H 0 as required by the ECB would typically be performed when D/N > PD, i.e. when there are doubts with regard to the conservatism of the PD estimate. Rejection of H 0 would then be regarded as 'proof' of the estimate being aggressive while non-rejection would entail 'acquittal' for lack of evidence. In case of 1/2 ≥ D/N > PD, it holds that PD (1 − PD) < D/N (1 − D/N) such that the p-value according to the ECB test is lower than the p-value according to (3.10) and (3.11), i.e. the ECB test would reject H 0 earlier than the simplified version of the test according to Section 3.1.

Numerical examples
The test methods of Section 2 and the appendices are illustrated in Section 4.1 below with numerical results from tests on a data set from Fischer and Pfeuffer [2014, Table 1]. The test methods of Section 3 are illustrated in Section 4.2 below with numerical results from tests on a data set consisting of simulated data. However, the exposures in the data set are again from Fischer and Pfeuffer [2014, Table 1]. A zip-archive with the R-scripts and csv-files that were used for computing the results can be downloaded from https://www.researchgate.net/profile/Dirk_Tasche. Explanations.

Example: Tests for variables with values in the unit interval
• Sample means: According to (2.1). Weights according to (2.2b) with EAD from the column 'raw.w' of the data set, and w i = 1/100 in the equally weighted case. • Sample standard deviations: First two values according to the square root of the right-hand side of (2.4b). Third value also according to (2.4b), but with ∆ i from (A3.a) and equal weights. • Weights according to (2.2b) with EAD from the column 'raw.w' of the data set.
• t-test results: 'Eq-weighted' according to (2.9) and 1 − p-value * for the first row of the t-test results. 'Weighted' analogously adapted for the weighted case (but without strong theoretical foundation). 'W-adjusted' like 'Eq-weighted' but for the sample ∆ 1 , . . . , ∆ 100 . • 'Basic' results: Bootstrapped according to (2.6a) and (2.6b) respectively, with weights and samples like for the t-test rows. This example demonstrates that • as mentioned in Section 3.2, the Jeffreys test has a tendency to earlier reject 'H0: mean(obspred)≤0' than the other tests discussed in Section 3, • test results based on equally weighted means and means with inhomogeneous weights can lead to different outcomes (no conclusion vs. rejection of the null hypothesis), and • variance expansion to capture the individual randomness of single observation-prediction pairs can have some impact on the degree of certainty of the test results, by entailing greater p-values.

Conclusions
In this paper, we have made suggestions of how to improve on the t-test and the Jeffreys test presented in ECB [2019] for assessing the 'preditive ability (or calibration)' of credit risk parameters. The improvements refer to • also testing the null hypothesis that the estimated parameter is less than or equal to the true parameter in order to be able to 'prove' that the estimate is prudent (or conservative), • additionally using exposure-or limit-weighted sample averages in order to better inform assessments of estimation (or prediction) prudence, and • 'variance expansion' in order to account for sample inhomogeneity in terms of composition (exposures sizes) and riskiness.
The suggested test methods have been illustrated with exemplary test results. R-scripts with code for the tests are available.

A. Appendix: Special cases of the weighted paired difference approach
Equal weights in the basic approach. In this case, the variable of interest is the ordinary average of the sample ∆ 1 , . . . , ∆ n , as reflected by the fact that then instead of (2.4a), it holds that In the same vein, the algorithms and formulae of Section 2.1 can be adapted to the equal weights case by replacing all weights w i and w j with 1/n.
Weight-adjusted sample. In this case, the weights w i are accounted for by replacing the sample ∆ 1 , . . . , ∆ n with the sample ∆ * 1 , . . . , ∆ * n where ∆ * i is defined by The adjusted sample ∆ * 1 , . . . , ∆ * n in turn is treated as in the equal weights case. Then, in particular, (2.3) for the distribution of X ϑ reads As a consequence of (A.2), the adaptation of the algorithms and formulae from Section 2.1 for the weight-adjusted sample case would appear somewhat misleading if comparability in magnitude of the values of the test statisticX ϑ to its values in the unequal weights case as discussed in Section 2.1 were intended. A workaround for this problem is to adjust the sample not only for the weights but also for the sample size, i.e. to define the adjusted sample ∆ 1 , . . . , ∆ n by Assuming equal weights now means P[X ϑ = ∆ i − ϑ] = 1/n which implies Comparison with (2.4b) shows that the variances of X ϑ according to the weighting scheme (A.3a) and the weighting scheme deployed in Section 2.1 differ by which can be positive or negative. The algorithms and formulae from Section 2.1 can be applied to the weight-adjusted sample case as specified by (A.3a) and P[X ϑ = ∆ i − ϑ] = 1/n if the following two modifications are taken into account in the given order: • Replace the value of ∆ i by the value of ∆ i = n w i ∆ i for i = 1, . . . , n.
• Replace all remaining appearances of the weights w i by 1/n.
Note that the weight-adjustment (A.3a) can also be deployed for samples with more special structure like the ones considered in Section 2.3 and Appendix B below. There is no guarantee, however, that adjustment (A.3a) would preserve the 'values in the unit interval' constraint of Section 2.3. There is no such preservation issue with regard to Appendix B.

B. Appendix: Tests for non-negative variables
In contrast to LGD and CCF which by definition are variables with values in the unit interval, EAD in principle may take any non-negative value. This requires some modifications in order to adapt the approach from Section 2.3 to the assessment of EAD estimates.
Starting point.
• A sample of n defaulted credit facilities / loans is analysed.
• The EAD η i is an estimate of loan i's exposure at the moment of the default, measured in currency units. • The realized exposure h i shows the loan i's exposure at the time of default.
• The weight w i reflects the relative importance of observation i. In the case of direct EAD predictions, one might choose w i according to (2.2a). • Define ∆ i = h i − η i , i = 1, . . . , n. If |∆ i | ≈ 0 then η i is a good EAD prediction. If |∆ i | is large then η i is a poor EAD prediction.
Goal. We want to use the observed weighted average difference / residual ∆ w = n i=1 w i ∆ i = h w − η w to assess the quality of the calibration of the model / approach for the η i to predict the realised exposures h i . Again we want to answer the following two questions: • If ∆ w < 0, how safe is the conclusion that the observed (realised) values are on weighted average less than the predictions, i.e. the predictions are prudent / conservative? • If ∆ w > 0, how safe is the conclusion that the observed (realised) values are on weighted average greater than the predictions, i.e. the predictions are aggressive?
The safety of such conclusions is measured by p-values which provide error probabilities for the conclusions to be wrong. The lower the p-value, the more likely the conclusion is right. In order to be able to examine the specific properties of the sample and ∆ w with statistical methods, we have to make the assumption that the sample was generated with some random mechanism. This mechanism is described in the following modification of Assumption 2.4.
Assumption B.1. The sample ∆ 1 , . . . , ∆ n consists of independent realisations of a random variable X ϑ with distribution given by where I is a random variable with values in {1, . . . , n} and P[I = i] = w i , i = 1, . . . , n. Y ϑ is a gamma(α i , β i )-distributed random variable *** conditional on I = i for i = 1, . . . , n. The parameters α i and β i of the gamma-distribution depend on the unknown parameter 0 < ϑ < ∞ by In (B.1b), the constant 0 < v < ∞ is the same for all i. The ϑ i are determined by Note that Assumption B.1 describes a method for recalibration of the EAD estimates η 1 , . . . , η n to match targets ϑ with the weighted average of the ϑ i . By definition of Y ϑ , it holds that E[Y ϑ | I = i] = ϑ i .
The constant v specifies the variance of Y ϑ conditional on I = i as multiple of its expected value ϑ i , i.e. it holds that var[Y ϑ | I = i] = v ϑ i , i = 1, . . . , n. (B. 2) The constant v must be pre-defined or separately estimated. We suggest estimating it from the sample h 1 , . . . , h n asv Like in (2.13b), the variance of X ϑ as shown in (B.4b) depends on the parameter ϑ and has an additional component v ϑ which reflects the potentially different variances of the exposures at default in an inhomogeneous portfolio.
By Assumption B.1 and Proposition B.2, the questions on the safety of conclusions from the sign of ∆ w again can be translated into hypotheses on the value of the parameter ϑ: • If ∆ w < 0, can we conclude that H 0 : ϑ ≤ h w is false and H 1 : ϑ > h w ⇔ E[X ϑ ] < 0 is true? • If ∆ w > 0, can we conclude that H * 0 : ϑ ≥ h w is false and H * 1 : ϑ < h w ⇔ E[X ϑ ] > 0 is true? If we assume that the sample ∆ 1 , . . . , ∆ n was generated by independent realisations of X ϑ then the distribution of the sample mean is different from the distribution of X ϑ , as shown in the following corollary to Proposition B.2. Corollary B.3. Let X 1,ϑ , . . . , X n,ϑ be independent and identically distributed copies of X ϑ as in Assumption B.1 and defineX ϑ = 1 n n i=1 X i,ϑ . Then for the mean and variance ofX ϑ , it holds that In the following, we useX ϑ as the test statistic and interpret ∆ w = h w − η w as its observed value. Proof. Same as the proof of Proposition 2.7.
Bootstrap test. Generate a Monte Carlo samplex 1 , . . . ,x R from X ϑ with ϑ = h w as follows: • For j = 1, . . . , R:x j is the equally weighted mean of n independent draws from the distribution of X ϑ as given by Assumption B.1, with ϑ = h w . •x 1 , . . . ,x R are realisations of independent, identically distributed random variables, Rationale. Same as the rationale for (2.16a) and (2.16b).
, Normal approximate test. By Corollary B.3, we find that the distribution ofX h w can be approximated by a normal distribution with mean 0 and variance as shown on the right-hand side of (B.5b) with ϑ = h w . With x = h w − η w , one obtains for the approximate p-value of H 0 : ϑ ≤ h w against H 1 : ϑ > h w : η w as in Assumption B.1. The same reasoning gives for the normal approximate p-value of H * 0 : ϑ ≥ h w against H * 1 : ϑ < h w :