Cross-calibration of probabilistic forecasts

When providing probabilistic forecasts for uncertain future events, it is common to strive for calibrated forecasts, that is, the predictive distribution should be compatible with the observed outcomes. Several notions of calibration are available in the case of a single forecaster alongside with diagnostic tools and statistical tests to assess calibration in practice. Often, there is more than one forecaster providing predictions, and these forecasters may use information of the others and therefore influence one another. We extend common notions of calibration, where each forecaster is analysed individually, to notions of cross-calibration where each forecaster is analysed with respect to the other forecasters in a natural way. It is shown theoretically and in simulation studies that cross-calibration is a stronger requirement on a forecaster than calibration. Analogously to calibration for individual forecasters, we provide diagnostic tools and statistical tests to assess forecasters in terms of cross-calibration. The methods are illustrated in simulation examples and applied to probabilistic forecasts for inflation rates by the Bank of England.


Introduction
In the past decades probabilistic forecast, specifying a complete predictive probability distribution for an uncertain future event, have replaced point forecasts in a number of applications including weather forecasting, climate predictions and economics; see Gneiting and Katzfuss (2014) for a recent overview. Murphy and Winkler (1987);  have formulated the guiding principle for a probabilistic forecast to "Maximize sharpness subject to calibration". Calibration refers to the statistical compatibility of the forecasts and the observations. Sharpness, on the other hand, is a property that concerns the forecast only. Roughly speaking, a forecast is sharper the more concentrated the distribution is, with point forecasts as a limiting case.  have formulated their principle in order to pick the "better" of two calibrated forecasts. While it is generally acknowledged that forecasts should be calibrated (Dawid, 1984;Diebold et al., 1998), it is not universally accepted that it is necessary to consider sharpness as a further criterion for forecast evaluation (Mitchell and Wallis, 2011).
In this manuscript we are concerned with calibration only. However, we consider a framework where several forecasters issue competing forecasts. We propose concepts of cross-calibration in order to formalize the influence of forecasters amongst each other and with respect to the observations. Essentially, a cross-calibrated forecaster not only uses her own information optimally but also incorporates the information of the competing forecasters in an optimal way. The notions we propose refine the existing notions of calibration of Gneiting and Ranjan (2013). Furthermore, we extend their prediction space setting to allow for serial dependence which is the usual situation in forecasting applications. We are able to extend the result of Diebold et al. (1998) of uniformity and independence of probability integral transform (PIT) values to our general framework.
Notions of cross-calibration have previously been considered in the literature for binary or categorical outcomes. Al-Najjar and Weinstein (2008) consider a test which an uninformed forecaster cannot pass with high probability when an informed forecaster is present. The notion of cross-calibration by Feinberg and Stewart (2008) takes into account that several forecasters may influence each other, and the one with the largest information set should be preferred. In this paper we generalize the cross-calibration notions of Feinberg and Stewart (2008) to forecasts of real valued outcomes including diagnostic tools and statistical tests to assess cross-calibration in applications.
The cross-calibration test suggested by Feinberg and Stewart (2008) uses the following framework, which we review here only in the case of two forecasters for simplicity. Let Ω = {(ω t ) t=0,1,... | ω t ∈ { 0, 1}} denote the space of all possible realizations and let n > 4 be an integer. Divide the interval [0, 1] into n equal closed subintervals [0, 1/n], . . . , [(n − 1)/n, 1]. At time t, forecaster j, j = 1, 2, makes a prediction which is given as an interval I j t ∈ {[0, 1/n], . . . , [(n − 1)/n, 1]}. It gives bounds on the predictive probability that the next realization ω t is equal to one. The cross-calibration test is defined over the sequence of forecast-observation triples (I 1 t , I 2 t , ω t ) ∞ t=0 . For any pair = ( 1 , 2 ) ∈ {1, . . . , n} 2 and any time T , let which is the number of times up to T , that the forecasting profile is chosen. For ν T > 0, the frequency of realizations equal to one conditional on the forecasting profile is given by A forecaster j passes the cross-calibration test at the outcome (I 1 t , for every satisfying lim T →∞ ν T = ∞. It is shown in Feinberg and Stewart (2008) that a forecaster who is aware of the distribution of (ω t ) ∞ t=0 passes the cross-calibration test with probability one, no matter which strategy the other forecaster uses. From a theoretical point of view, this is an interesting result. However, testing empirically if a forecast is cross-calibrated is rather difficult. The problem is that already if n = 5, there are 25 forecasting profiles to consider. For each of these profiles the empirical frequency conditioned on that profile should lie T Monte-Carlo power 10 4 0.112 5 · 10 4 0.254 10 5 0.333 5 · 10 5 0.699 10 6 0.847 5 · 10 6 0.994 Table 1: Monte-Carlo power of detecting cross-calibration for different time periods T .
inside the predicted interval of the cross-calibrated forecaster. But some profiles are hardly ever predicted and therefore the number of observations needs to be very large. We illustrate this problem with the following simple simulation example, which has been implemented in R (R Development Core Team, 2008) like all further simulations in this paper.
Example 1.1. In this example we consider the setting of the cross-calibration test described above. Let B t , C t , t = 0, . . . , T be independent beta random variables with parameters (3, 5) for B t and (2, 1.5) for C t . We simulate a (finite) stochastic process (ω t ) T t=0 , where ω t is conditionally Bernoulli distributed with probability (B t + C t )/2. Let n = 5. The first forecaster predicts at each time t the interval I 1 t which contains the value (B t + C t )/2, which is the probability that the realization ω t is one. The second forecaster predicts the interval I 2 t which contains the value C t . Therefore, we expect that the first forecaster is cross-calibrated with respect to the second forecaster and should pass the test. The first forecaster passes the test if for all forecasting profiles = ( 1 , 2 ) where ν T is positive f T lies in [( 1 − 1)/n, 1 /n]. In Table 1 we see the result. For several T we performed the test 1000 times. The second column of the table gives the Monte-Carlo power of the test, that is, how often the test detected the cross-calibration of the first forecaster divided by T . Table 1 shows that already for this rather simple example, the sample size T needs to be large in order to come close to the theoretically predicted power of one. Furthermore, the test is only applicable to probabilistic forecasts for binary outcomes. The goal of this paper is to extend the notion of cross-calibration to probabilistic forecasts of real valued quantities, and present methodology to empirically assess cross-calibration for serially dependent forecast-observation tuples. We have chosen to work in the framework of prediction spaces as introduced by Gneiting and Ranjan (2013), and extend it to allow for serial dependence.
The paper is organized as follows. In Section 2 we review and extend the notion of a prediction space and generalize the notions of calibration for individual forecasters to multiple forecasters. We introduce diagnostic tools for checking cross-calibration and illustrate their usefulness in a simulation study in Section 3. In Section 4 we treat the special case of binary outcomes and relate our work to the existing results of Feinberg and Stewart (2008). Statistical tests for cross-calibration are derived in Section 5. We analyse the Bank for England density forecasts for inflation rates in Section 6. Finally, the paper concludes with a discussion in Section 7. Gneiting and Ranjan (2013) introduced the notion of a prediction space as follows.

Notions of cross-calibration
Definition 2.1 (one-period prediction space). Let k ≥ 1 be an integer. A prediction space is a probability space (Ω, A, Q) together with sub-σ-algebras A 1 , . . . , A k ⊂ A. The elements of Ω are tuples of the form (F 1 , . . . , F k , Y, V ) such that, for i = 1, . . . , k, F i is a CDF-valued random quantity that is measurable with respect to A i 1 , Y is a real-valued random variable, and V is a uniformly distributed random variable on [0, 1], independent of A 1 , . . . , A k , Y .
The integer k corresponds to the number of forecasters. The σ-algebra A i can be seen as the information set available to forecaster i. The random variable Y is the observation, the random variable V is needed for technical reasons. It allows to define the probability integral transform (PIT) in Definition 2.6 below.
We term the prediction space proposed by Gneiting and Ranjan (2013) a one-period prediction space as it is only concerned with predictions for an outcome Y at one time point. While this framework is sufficient to define various notions of calibration and crosscalibration of forecasters in principle, a statistical analysis of calibration is only possible if we can assume that we have a sequence (F 1,n , . . . , F k,n , Y n , V n ) 1≤n≤N of independent forecast-observation tuples. This assumption is unrealistic in most forecasting situations. Therefore, we propose to extend the prediction space setting, allowing for serial dependence as follows.
Definition 2.2 (prediction space for serial dependence). Let k ≥ 1 be an integer. A prediction space for serial dependence is a probability space (Ω, A, Q) together with filtrations (A 1,t ) t∈N , . . . (A k,t ) t∈N ⊂ A. The elements of Ω are sequences of tuples of the form (F 1,t , . . . , F k,t , Y t+1 , V t ) t∈N , where (Y t ) t∈N is a sequence of real-valued random variables, and (V t ) t∈N is an iid sequence of standard uniform random variables that is independent of everything else. Let T t = σ(Y s | s ≤ t) be the σ-algebra generated by the observations until time t. For all t ∈ N and i = 1, . . . , k, F i,t is a CDF-valued random quantity that is σ(A i,t , T t )-measurable. We assume that, for all t ∈ N, m ≥ 1, where L(X | G) denotes the conditional law of a random variable X with respect to the σ-algebra G.
The notation in Definition 2.2 is chosen such that A i,t encodes the information of the ith forecaster F i,t at time t to predict the outcome Y t+1 at the next time point. Additionally, all forecasters F 1,t , . . . , F k,t have access to the past realizations of Y t in principle, that is, to the information contained in T t . This means, we have separated the information of forecaster F i into two parts, the information of past realizations of the outcome T t , that is available to all forecasters, and a personal information set A i,t that she acquires (partially) from other sources. Condition (1) formalizes that information from other sources about the outcome at time point t + 1 + m should not influence the outcome Y t+1 at time points t + 1. A sufficient condition for (1) to hold is that Let us illustrate this point in the context of weather forecasting. Suppose a numerical weather prediction system is used to calculate the state of the atmosphere to help us predict temperature tomorrow. Condition (1) means that if we let the numerical system run longer to give us also information about the atmosphere the day after tomorrow, this will have no influence on what temperature is realized tomorrow.
All further statements are within the prediction space setting and expressions such as almost surely are with respect to the probability measure Q. In the prediction space for serial dependence, In the case of independent forecast-observation tuples, we recover the definition of an ideal forecaster of Gneiting and Ranjan (2013), that is, in the one-period prediction space setting, see also Tsyplakov (2011Tsyplakov ( , 2013. We generalize this notion as follows. Definition 2.3 (cross-ideal). In the prediction space setting for serial dependence, we call A cross-ideal forecaster does not only use her own information optimally but also the information available to the other forecasters. In fact, at time t, her information A i,t contains all relevant information of all the forecasters because F i,t is σ(A i,t , T t )-measurable and hence by (2), also L(Y t+1 |A 1,t , . . . , A k,t , T t ) is σ(A i,t , T t )-measurable, implying that L(Y t+1 |A 1,t , . . . , A k,t , T t ) = L(Y t+1 |A i,t , T t ). Therefore, each cross-ideal forecaster is ideal, whereas the converse does not hold in general; see Examples 2.5 and 3.1. The above argument shows more generally the following proposition.
Proposition 2.4. For some t ∈ N, let F 1,t , . . . , F k,t be forecasters with information sets A 1,t , . . . , A k,t in a prediction space for serial dependence. If F 1,t is cross-ideal with respect to A 1,t , . . . , A k,t , then it is also cross-ideal with respect to For clarity, we have chosen to illustrate the notions of cross-ideal forecasters (or crosscalibrated forecasters; see Definition 2.7) with independent forecast-observation tuples, or, in other words, in the one-period prediction space setting of Gneiting and Ranjan (2013) dropping the time index t. This is natural, as the notions of calibration are essentially one-period concepts, and make no use of assumption (1). The purpose of assumption (1) will become clear in Theorem 2.11 below where we generalize the result of Diebold et al. (1998) on uniformity and independence of PIT values.
Example 2.5. Let ν be uniformly distributed on (5,20) and, conditionally on ν, let σ 2 have an inverse chi-squared distribution with ν degrees of freedom. Conditional on ν and σ, the outcome Y is normally distributed with mean zero and variance σ 2 , and we consider two forecasters, a normally distributed forecaster F 1 = N (0, σ 2 ) and a t-distributed forecaster F 2 = t ν . This example is constructed such that F 1 has the full information about the distribution of the outcome Y , whereas F 2 only knows the prior distribution of σ 2 . We have that F 1 and F 2 are both ideal with respect to to their information sets A 1 = σ(σ 2 ) and A 2 = σ(ν), respectively, but only F 1 is cross-ideal with respect to A 1 , A 2 .
More specifically, the predictive density function f 1 (·|σ 2 ) of F 1 is a normal density with variance σ 2 , and the predictive density function f 2 of F 2 is where g(·|ν) = (ν/2) ν/2 s ν/2−1 exp{−ν/(2s)}/Γ(ν/2) is the density function of an inverse chi-squared distribution with ν degrees of freedom. The right hand side of (3) is the density of a t-distribution. Equation (3) holds because for a normal likelihood with known mean, the inverse chi-squared distribution is a conjugate prior of a t-distributed posterior distribution. Therefore, we see that F 1 is cross-ideal with respect to A 1 , A 2 . It is clear that F 2 is not cross-ideal with respect to A 1 , A 2 . We will come back to this example throughout the paper.
The most prominent diagnostic tool for checking calibration empirically is the probability integral transform (PIT) (Dawid, 1984;Diebold et al., 1998).
Definition 2.6 (PIT). Let F be a (possibly random) CDF, X be a random variable and V a standard uniform random variable independent of F and X. We define where F (y−) = lim x↑y F (x). In the prediction space setting, the random variable Z i,t := Z Y t+1 F i,t is called the probability integral transform (PIT) of the i-th forecaster F i,t .
The PIT Z i,t is a random variable with values in [0, 1]. If F is deterministic and X ∼ F , then Z X F is uniformly distributed and F −1 (Z X F ) = X almost surely, where F −1 is the quantile function of F ; see for example Rüschendorf (2009). Based on the PIT we introduce the following notions of cross-calibration.
For brevity, we sometimes speak of cross-calibration with respect to {i 1 , . . . , i m } instead of F i 1 ,t , . . . , F im,t . Our definitions are natural generalizations of the notions of calibration for individual forecasters in Gneiting and Ranjan (2013, Definition 2.6), which we recall here fore ease of comparison.
Definition 2.8 (calibration). Let F be a forecaster in a one-period prediction space.
In part 2 of Definition 2.8 the left-hand side of the equation depends only on the distribution of the forecast, whereas the right-hand side depends only on the distribution of the observation. Marginal calibration therefore assesses whether the average forecast distribution is equal to the marginal distribution of Y . If F 1,t is marginally cross-calibrated with respect to F j,t , then, on average, the PIT Z 1,t of F 1,t behaves like a standard uniform random variable when considered in view of F j,t . Intuitively, this means that F 1,t has enough information about F j,t and the observation Y t+1 to disguise itself as uniform on average when viewed through the eyes of F j,t .
Probabilistic cross-calibration means that the PIT Z 1,t of F 1,t is uniformly distributed no matter what the other forecasters predict. In contrast, probabilistic calibration of F i means that Z F i is uniformly distributed on average over all possible predictions of the other forecasters, which is a weaker notion.
Remark. Gneiting and Ranjan (2013) also formalize the concept of dispersion in their Definition 2.6 in terms of the variance of Z F . It is possible to define a notion of crossdispersion for multiple forecasters considering the conditional variance of Z 1,t given F i 1 ,t , . . . , F im,t . However, we feel that formulating dispersion in terms of the variance of Z F is not as natural as it seems at first sight. If F is probabilistically calibrated then Z F is uniformly distributed on [0, 1], therefore its variance is 1/12 and F is called neutrally dispersed. In this case, Φ −1 (Z F ), would have a standard normal distribution, where Φ −1 denotes the quantile function of the standard normal distribution. It is equally intuitive to define dispersion in terms of the variance of Φ −1 (Z F ) with over-and underdispersion if this variance is smaller or larger than one, respectively. If a random variable X with values in [0, 1] has variance 1/12, generally, it does not follow that Φ −1 (X) has unit variance. For example, let X be a beta distributed random variable with parameters α = 1 and β = ( √ 33 − 5)/2. Then X has variance 1/12 ≈ 0.083. The variance of Φ −1 (X) is approximately 1.92 = 1. Therefore, it may well be that a forecast is neutrally dispersed with respect to Z F but over-or underdispersed with respect to Φ −1 (Z F ). Due to this ambiguity, we do not consider the concept of (cross-)dispersion in this manuscript.
The following theorem formally connects Definitions 2.7 and 2.8 showing that the former is indeed a generalization of the latter.
Theorem 2.9. Consider forecasters F 1,t , . . . , F k,t in a prediction space for serial dependence.
1. The forecast F 1,t is marginally cross-calibrated with respect to itself, if and only if F 1,t is marginally calibrated.
2. If F 1,t is cross-calibrated with respect to F i 1 ,t , . . . , F im,t , then F 1,t is cross-calibrated with respect to any subset of {i 1 , . . . , i m }. In particular, F 1,t is cross-calibrated with respect to the empty set ∅, that is, probabilistically calibrated.
3. If F 1,t is cross-calibrated with respect to F 2,t , then it is also marginally cross-calibrated with respect to F 2,t .
Proof. To show the first claim, observe that we have for all y ∈ R, The second equality holds, because where the interval consists of the point The second claim follows because, for y ∈ (0, 1), by the definition of cross-calibration. The last claim holds because It is possible that a forecaster is marginally calibrated but not probabilistically calibrated; see Gneiting and Ranjan (2013, Example 2.4) which we take up below in Example 3.1 to illustrate cross-calibration. Conversely, the last claim of Theorem 2.9 shows that marginal cross-calibration with respect to a different forecaster is a necessary condition for cross-calibration. Tsyplakov (2011Tsyplakov ( , 2013) introduced a slightly more restrictive notion than an ideal forecaster, which is an auto-calibrated forecaster, that is, it fulfils L(Y | F ) = F , almost surely, in the one-period prediction space setting. Generally, an auto-calibrated forecaster is ideal with respect to σ(F ), which is the σ-algebra generated by F . Gneiting and Ranjan (2013) contend that it is unlikely that empirical test of auto-calibration are feasible, except for very special circumstances such as forecasts for binary random variables. In cases where forecasters are restricted to specific classes of distributions Held et al. (2010) have taken on the challenge to derive statistical tests for ideal forecasters in the sense of auto-calibration based on a score regression approach; for earlier work in this direction see Hamill (2001); Mason et al. (2007). In Section 5.3 we show that it is possible to extend the score regression approach of Held et al. (2010) to test for cross-calibrated forecasters, that is, for cross-ideal forecasters with respect to σ(F 1 ), . . . , σ(F k ); compare Proposition 2.10.
In this paper, we challenge the statement of Gneiting and Ranjan (2013) by proposing two powerful tests for cross-calibration under very general assumptions that are justified even under serial dependence; see Sections 5.1 and 5.2. Note that the following Proposition 2.10 shows that auto-calibration is in fact a special case of cross-calibration.
Proof. The equivalence of parts one and two is immediate from the definition of crosscalibration. Suppose now that 1 ∈ {i 1 , . . . , i m }. For all y ∈ R, we obtain which shows that last claim.
We conclude this section with the announced generalization of the result of Diebold et al. (1998) on uniformity and independence of PIT values in a prediction space for serial dependence.
Theorem 2.11. Suppose we are in the prediction space setting for serial dependence. Let {i 1 , . . . , i m } ⊂ {1, . . . , k} and assume that Then, for all l ∈ N 0 , we have where we used condition (1) to obtain the third equality, and then proceeded iteratively.
Remark. If we consider q-step ahead forecasts for some q ≥ 2, then the above result continues to hold for all vectors of the form (Z 1,t , Z 1,t+q , . . . , Z 1,t+mq ).
However, there may be dependence amongst (Z 1,t , Z 1,t+1 , . . . , Z 1,t+q−1 ), which complicates matters when testing for cross-calibration. This problem also arises in tests for uniformity and independence of PIT values as suggested by Diebold et al. (1998). Several approaches to deal with this issue have been suggested in the literature; see Knüppel (2015) and references therein. In this paper, we restrict our attention to cross-calibration of oneperiod ahead forecasts but extensions to q-step ahead forecasts would certainly be of great interest.
3 Diagnostic plots for assessing cross-calibration  suggest to assess marginal calibration based on a plot of the empirical analogue of the difference Analogously, to assess marginal cross-calibration, the empirical version of can be plotted. If the graph is not equal to zero everywhere one can deduce that F t,i is not marginally cross-calibrated with respect to F j,t and therefore also not cross-calibrated with respect to {j} by Theorem 2.9. If the graph is zero everywhere, then we have marginal cross-calibration. However, this does not necessarily imply that we have a cross-calibrated forecaster. Probabilistic calibration is often checked empirically by plotting a histogram of Z i,t , the so-called PIT-histogram. Generally, it is not obvious how to check cross-calibration empirically. However, in many situations of practical interest it can be done by borrowing the idea of considering forecasting profiles as in the cross-calibration test of Feinberg and Stewart (2008). Suppose that the forecasters F 1 , . . . , F k pick predictions from some parametric class of distributions F = {F λ | λ ∈ Λ}, where Λ ⊂ R d . Then we can identify each forecaster F i,t with the parameter λ i,t she predicts. We observe a sample (F 1,t , . . . , F k,t , Y t+1 , V t ) for 1 ≤ t ≤ N . Let Λ 1 , . . . , Λ p be a partition of the parameter space. For a diagnostic plot showing if F 1,t , say, is cross-calibrated with respect to {i 1 , . . . , i m }, we can sort the observations into pm bins according to the predicted values (λ i 1 ,t , . . . , λ im,t ). Then a PIT-histogram of Z 1,t can be plotted for each bin. Clearly, the number of bins needs to be small in relation to the number of observations.
We illustrate these diagnostic tools with two examples. The first one has been proposed by Gneiting and Ranjan (2013, Examples 2.4); see also .

Forecaster Predictive distribution
Information set Perfect  Example 3.1. Let µ be standard normally distributed, which we denote by µ ∼ N (0, 1). Conditional on µ, the outcome is Y ∼ N (µ, 1). Let τ take the values 1 or -1 with equal probability, independent of Y and µ. We consider four forecasters F 1 , . . . , F 4 of different skill, whose properties are summarized in Table 2. It is clear that the perfect forecaster F 1 is cross-calibrated with respect to F 1 , F 2 , F 3 , F 4 . It is straight forward to check that the climatological forecaster F 2 is not cross-calibrated with respect to any of F 1 , F 3 , F 4 but with respect to itself. As F 2 is deterministic, this corresponds to the fact that F 2 is ideal with respect to the trivial σ-algebra. As the signreversed forecaster F 4 is not probabilistically calibrated it cannot be cross-calibrated. The cross-calibration of F 3 with respect to F 1 , F 2 , F 4 is shown in Appendix A. The statements about marginal cross-calibration in Table 2 are consequences of Theorem 2.9.
In Figure 2 the differences given at (4) are plotted. More precisely, the random variables are simulated 10'000 times and the mean is given. Recall, that all for all simulation examples we are using independent forecast-observation tuples for reasons of simplicity. In this example, it is easy to see that F 1 is superior to F 2 using the notion of marginal cross-calibration, which was not the case using only the calibration notions of Gneiting and Ranjan (2013, Definition 2.6); see .
As an example for checking cross-calibration empirically we note that all four forecaster are in the class of distribution functions We plotted the PIT-histograms of Z 2 and Z 3 conditional on the four bins µ ∈ I j for 1 ≤ j ≤ 4 with Figure 1 confirm that F 3 is probabilistic crosscalibrated with respect to F 1 , F 2 , F 4 . On the other hand, F 2 is clearly not probabilistic cross-calibrated with respect to any set of the other forecasters.
Example 3.2 (Example 2.5 continued). Coming back to the forecasters F 1 and F 2 of Example 2.5, Theorem 2.9 implies that F 1 is marginally cross-calibrated with respect to itself and with respect to F 2 . Furthermore, F 2 is marginally cross-calibrated with respect to itself but F 2 is not marginally cross-calibrated with respect to F 1 . Marginal crosscalibration plots for this scenario using 10'000 and 100'000 simulations are given in Figure  3. In this example, the lack of marginal cross-calibration can only be detected for an unrealistically large number of observations. PIT-histograms for assessing cross-calibration of F 1 with respect to F 2 and of F 2 with respect to F 1 for 10'000 simulations are given in Figure 4. The partition of the parameter space is chosen such that in each histogram there are around the same amount of observations. The lack of cross-calibration of F 2 with respect to F 1 is clearly detected.

Binary outcomes
In this section we consider the case, when the observation Y only takes two values, zero and one. We interpret Y = 1 as a success and Y = 0 as a failure. A forecaster F is then represented by her predictive success probability p, such that the predictive CDF is In the case of an individual forecaster F for a binary outcome it has been shown in Gneiting and Ranjan (2013, Theorem 2.11) that the notions of a probabilistically calibrated forecaster F and an ideal forecaster relative to the σ-algebra generated by the predictive probability p are equivalent. Furthermore, both notions coincide with the notion of conditional calibration, that is Q(Y = 1| p) = p. This result carries over to the notions of cross-calibration of multiple forecasters introduced in this paper. As the notions of calibration are essentially only concerned with one prediction period, we have chosen to present the results of this section in the one-period prediction space setting of Definition Figure 2: Marginal cross-calibration plots of the forecasters in Example 3.1 with 10'000 simulations. In the i-th row and j-th column the empirical version of Equation (4) is plotted to assess whether F i is marginally cross-calibrated with respect to F j or not.  (4) in order to deduce whether F i is marginally cross-calibrated with respect to F j or not. 2.1 for simplicity.
Theorem 4.1. Consider the one-period prediction space setting with binary outcome Y and forecasts F 1 , . . . , F k represented by their predictive success probabilities p 1 , . . . , p k , respectively. Then the following statements are equivalent: 1. The forecast p 1 is cross-calibrated with respect to p 2 , . . . , p k , that is L(Z p 1 | p 2 , . . . , p k ) is standard uniform.
The proof of Theorem 4.1 parallels the proof of Gneiting and Ranjan (2013, Theorem 2.11). The following lemma gives a formula for the density function of Z p 1 conditional on Lemma 4.2. The density function of Z p 1 conditional on p = x is given by Proof of Lemma 4.2. The PIT of p 1 is Differentiation yields the claim.
Proof of Theorem 4.1. It is easy to see that part two is equivalent to part three. By Theorem 2.9, part three implies part one. The remaining task is to prove that part one implies part two. Let H = p(Q) be the marginal law of the random vector p under Q.
and furthermore, We know that Q(Z p 1 = 1| p 2 = x 2 , . . . , p k = x k ) = 0, because L(Z p 1 | p 2 , . . . , p k ) is standard uniform. This implies that q(0, x 2 , . . . , Using that L(Z p 1 |p 2 , . . . , p k ) is a standard uniform distribution and Lemma 4.2, we have for a.a. z ∈ ( 0, 1), δ ∈ ( 0, 1 − z) where H 1 = p 1 (Q) is the marginal law of p 1 under Q. We define the signed measure µ for a given (x 2 , . . . , x k ) ∈ [0, 1] k−1 as The cross-calibration notion of Feinberg and Stewart (2008) is analogous to our notion of cross-calibration with respect to {1, . . . , k} which is equivalent to cross-ideal forecasters for binary events. Theorem 4.1 shows that both notions coincide with cross-calibration of p 1 with respect to {2, . . . , k} which is a priori a weaker requirement. As noted by Gneiting and Ranjan (2013) the fact that probabilistically calibrated forecasters are automatically ideal clarifies the relation between PIT-histograms and calibration curves which are the diagnostic tool frequently used for assessing calibration of binary predictions (Dawid, 1986;Murphy and Winkler, 1992;Ranjan and Gneiting, 2010). As described in Section 3, crosscalibration can be assessed with conditional PIT-histograms. Analogously, in the case of binary forecasts, conditional calibration curves can be considered.

Tests for assessing cross-calibration
In this section we consider statistical tests for cross-calibration. The tests in Section 5.1 are based on the idea of conditional exceedance probabilities (Mason et al., 2007), whereas the tests in Section 5.2 use a linear regression approach. Finally, the score regression approach by Held et al. (2010) to test for ideal forecasters is reviewed and extended to a test for cross-ideal forecasters in Section 5.3.

Conditional exceedance probabilities
Suppose we have observations F 1,t , . . . , F k,t and Y t+1 , 1 ≤ t ≤ N in a prediction space for serial dependence. We would like to test the null hypothesis that F 1,t is cross-calibrated with respect to J ⊂ {1, . . . , k}. For z ∈ [0, 1), we define B z,t := 1{Z 1,t ≤ z}. Under the null hypothesis, using Proposition 2.10 and Theorem 2.11, conditional on F i,t for i ∈ J, the random variables B z,1 , . . . , B z,N are independent Bernoulli random variables with success probability z. We stipulate the logistic regression models for each z ∈ [0, 1), where logit(p) = log{p/(1 − p)} is the logistic function. Using (5), the null hypothesis is For each z ∈ [0, 1), we suggest to test the pointwise hypothesis by a likelihood ratio test yielding a p-value π(z). More precisely, the covariate vector x z,t has one as the first entry and then F −1 i,t (z), i ∈ J and the parameter vector β z has entries β 0,z , β i,z , i ∈ J. For values of z close to zero or one, we frequently encounter the phenomenon of separation, that is, the likelihood converges, but at least one parameter value is infinite. Therefore, we have chosen to use the method of Firth (1993), which always yields finite parameter estimates; see Heinze and Schemper (2002). That is, we fit the parameters β 0,z , β i,z , i ∈ J by maximizing the penalized log-likelihood function and |I(β z )| is the determinant of the Fisher information matrix. We denote the estimated parameter vector byβ z with entriesβ 0,z ,β i,z , i ∈ J. For N large enough, the test statistic has a χ 2 -distribution with 1 + |J| degrees of freedom, where |J| denotes the cardinality of |J|, and γ z = (logit(z), 0, . . . , 0) . We define the p-value π(z) = 1 − χ 2 1+|J| (T z ), where χ 2 1+|J| denotes the cumulative distribution function of a χ 2 random variable with 1 + |J| degrees of freedom. For the simulation studies below and the data analysis in Section 6 we have used the R-package of Heinze et al. (2013) to calculate T z .
In order to draw conclusions about the global null hypothesis H 0 at (6) from the pointwise p-values π(z), we adjust them for multiple testing. We follow the approach of Cox and Lee (2008) to use the method of Westfall and Young (1993, Chapter 2) for functional data to compute adjusted p-values r(z); see also Meinshausen et al. (2011).
Let 0 < z 1 < · · · < z M < 1. Under the null hypothesis of cross-calibration, it is possible to simulate a vector of p-values (π * (z 1 ), . . . , π * (z M )) with the same distribution as (π(z 1 ), . . . , π(z M )) conditional on F −1 i,t (z m ), i ∈ J, 1 ≤ t ≤ N , 1 ≤ m ≤ M , as follows. Let U 1 , . . . , U N be iid standard uniform random variables. For 1 ≤ m ≤ M , define B * zm,t = 1(U t ≤ z m ), and let π * (z m ) be the p-value from the pointwise likelihood ratio test for the simulated data vector (B * zm,t ) 1≤t≤N and covariates (x zm,t ) 1≤t≤N as before. The adjusted p-values can now be obtained as follows. Let σ be the permutation of {1, . . . , M } such that π{z σ(1) } ≤ · · · ≤ π{z σ(M ) }. This permutation σ remains unchanged in the following procedure. For a simulated vector of p-values (π * (z 1 ), . . . , π * (z M )), we define q * m = min{π * {z σ(s) } : s ≥ m}. Repeating this procedure L times, we obtain an array (q * m,l ) 1≤m≤M,1≤l≤L and define the adjusted p-values r 1 , . . . , r M corresponding to z 1 , . . . , z M as The global null hypothesis H 0 at (6) can be rejected at level α ∈ (0, 1) if min{r m : 1 ≤ m ≤ M } ≤ α. Furthermore, the adjusted p-values allow to draw conclusions for which values of z m ∈ (0, 1) miscalibration occurs. For example, a prediction method may perform satisfactory for the left tail of the distribution, that is, for z close to zero, the adjusted p-values are large, whereas it fails to capture the right tail and hence for z close to one, the adjusted p-values are small. We call this test the CEP test with respect to J.
Remark. It is important to note that the adjusted p-values r m remain the same, if the pointwise p-values π(z) are transformed with a strictly monotone transformation. Therefore, even if the π(z) are only asymptotic p-values, the adjusted p-values r m will control the familywise error rate at the desired level α even for finite samples (for large numbers L of bootstrap replications); see Westfall and Young (1993, Chapter 2) and Cox and Lee (2008). It is nevertheless important which test statistic to choose for the pointwise tests as the power of the overall test will crucially depend on the power of the pointwise tests.
wrt   For data examples, L should be much larger. However, for analysing the performance of the resampling based p-values, it is more important to run a large number of simulations than to have a large bootstrap sample for each of them; see Westfall and Young (1993) for a more detailed discussion. The results are given in Table 3. Conditioning on F 2 corresponds to conditioning on the trivial σ-algebra, therefore testing conditional on F 1 , F 2 , F 3 is the same as testing conditional on F 1 , F 3 , for example. Hence, Table 3 contains all interesting subsets of F 1 , . . . , F 4 and the column entitled 'F 2 ' corresponds to a test for probabilistic calibration. The test performs well, even for the small sample size N = 50. Generally, the power of the test appears to increase, the more information is used. For example, the test has difficulty to detect that F 3 is not ideal with respect to itself but it performs well for rejecting the null hypothesis that F 3 is cross-ideal with respect to F 1 , F 3 , F 3 , F 4 or F 1 , F 3 , F 4 .
Example 5.2 (Example 2.5 continued). We applied the CEP tests to data simulated from the prediction space described in Example 2.5. We used the same grid and other parameters as in the previous example, except that we considered two different sample sizes N = 50 and N = 200. The results from 10 000 simulations can be seen in Table 4. Here, the power for sample size N = 50 is only small. Fortunately, it appears to increase rapidly with sample size and is satisfactory for N = 200.

Linear regression approach
To formulate the linear regression approach (LRA) tests for cross-calibration, we restrict ourselves to a parametric class of cumulative distribution functions F = {F λ | λ ∈ Λ}, where Λ ⊂ R d . Suppose we have N forecast-observation tuples (F 1,t , . . . , F k,t , Y t , V t ), 1 ≤ t ≤ N in a prediction space for serial dependence such that F i,t ∈ F for all i and t. Each forecaster F i,t is then represented by its predictive parameter vector λ i,t = (λ (1) We want to test the hypothesis that F 1,t is cross-calibrated with respect to some J = {i 1 , . . . , i m } ⊂ {1, . . . , k}, for 1 ≤ t ≤ N . Proposition 2.10 and Theorem 2.11 lead to the null hypothesis where Φ −1 denotes the quantile function of a standard normal distribution and N N (0, I N ) denotes a multivariate standard normal distribution. In order to test this hypothesis we perform an F-test based on linear regression. We consider the linear model where is the response vector, is the design matrix, β = (β 0 , β 1 , . . . , β dk ) T ∈ R 1+dm is the parameter vector we would like to estimate and ∈ R N is a random error vector, which is multivariate standard normal under the null hypothesis. In order to estimate β the method of least square is used and we obtain the estimated parameter vectorβ To test the assumption that is standard normal one can use a normality test such as the Anderson-Darling or Shapiro-Wilk (Anderson and Darling, 1954;Shapiro and Wilk, 1965;Yap and Sim, 2011). This yields a p-value π normal . To test the other assumption we consider the test statistic 1.000 1.000 1.000 1.000 is the unbiased variance estimator. The test statistic F 0 has a Fisher distribution with 1 + dm and N − 1 − dm degrees of freedom; see for example Montgomery et al. (2001).
where F p,q denotes the Fisher cumulative distribution function with p and q degrees of freedom. Combining these two tests by the method of Holm leads to the adjusted p-value π adjust = 2 min{π F , π normal }.
We need that rank(D J ) = 1 + dm, otherwise the regression analysis is not possible. Therefore, any forecaster F i,t , i ∈ J has to predict for each parameter at least two distinct values. Otherwise, we omit the parameter for this forecaster in the model and are still able to use the test, which we call the LRA test with respect to J.
Example 5.3 (Example 3.1 continued). Recall the forecasters F 1 , F 2 , F 3 and F 4 from Example 3.1. Then all four forecasters are in the class of distribution functions F = F λ |λ = (µ, σ, τ ) ∈ R × (0, ∞) × {−1, 0, 1} for F λ = 1 2 {N (µ, σ) + N (µ + τ, σ)}. We apply the LRA test for all combinations of forecasters for sample sizes N = 20 and N = 50. The Monte-Carlo powers of π adjust for 10 000 simulations are given in Table 5. We only listed the Monte-Carlo powers with respect to individual forecasters, since we omit some parameters to have a design matrix with full rank. Therefore, testing cross-calibration with respect to J ⊂ {1, 2, 3, 4} leads to the same test as testing with respect to F 3 if 3 ∈ J and testing with respect to F 1 otherwise. For testing standard normality, we have used an Andersen-Darling test (with mean set to zero and variance set to one). In the cases of cross-calibration, the normality test never rejects the null hypothesis, which explains the conservative levels of around 0.025 in these cases. The test is powerful even for the small sample sizes and it provides the expected results from the theoretical considerations; see Table 2. In particular, the LRA test detects well, that F 3 is not ideal with respect to itself contrary to the CEP test; compare Table 3.
Example 5.4 (Example 2.5 continued). Coming back to forecasters F 1 and F 2 from Example 2.5 we perform the F-tests for different sample sizes N . The Monte-Carlo powers of the tests for 10 000 simulations can be found in  Table 4. We do not report the power of LRA test in this example because the Anderson-Darling test for standard normality almost never rejects the null hypothesis. Held et al. (2010) suggest a significance test for ideal forecasters based on scoring rules. They use the continuous ranked probability score (CRPS) (Gneiting et al., 2005) and the Dawid-Sebastiani score (DSS) (Dawid and Sebastiani, 1999). Their approach relies on independent forecast-observation tuples, and this restriction remains, when generalizing their approach to a test for cross-ideal forecasters. Therefore, throughout this section we work in a one-period prediction space.

Score regression approach
First, we recall some preliminaries on the CRPS and the DSS. Let F and f denote the predictive CDF of a forecaster and the predictive density function, respectively. Let µ and σ 2 be the predictive mean and variance, respectively. 3 The observed value of Y is denoted by y. The CRPS is given by and the DSS by DSS(F, y) = 1 2 log(σ 2 ) +ỹ 2 , whereỹ = (y − µ)/σ. For a forecaster predicting a normal distribution F = N (µ, σ 2 ), the CRPS turns out to be where Φ and φ are the CDF and the density of a standard normal distribution, respectively . The CRPS is a strictly proper scoring rule relative to the class of probability measures with finite first moments; see  for details on proper and strictly proper scoring rules. For a normal prediction the DSS is the same as the classical logarithmic score LS(f, y) = − log{f (y)} up to a constant. The DSS is a proper scoring rule relative to the class of probability measures with finite second moment. It is strictly proper relative to any class of probability measures that are characterized by their first two moments, such as Gaussian measures or other location-scale families of distributions .
We assume now that mean and variance of the predictive distribution F match mean and variance of the outcome Y . The following properties of the CRPS and the DSS can be found in Held et al. (2010). For the DSS we get If the distribution of Y has finite fourth moment then var{DSS(F, Y )} is a constant that does not depend on µ or σ. If Y has a normal distribution then var{DSS(F, Y )} = 1 2 . Similar results for the CRPS are harder to obtain. Held et al. (2010) show the following lemma.
Lemma 5.5. Let X 0 be a random variable with finite second moment. For a ∈ R and b > 0, let Y = a + bX 0 , let F be the CDF of Y , and σ 2 its variance. Then, with X 0 an independent copy of X 0 .
The lemma shows that for location-scale families of distributions the expected CRPS of an ideal forecast is proportional to the predictive standard deviation σ, and the variance of the CRPS is proportional to the predictive variance σ 2 . For the family of normal distributions we have d = 1/ √ π and D = {1/3 − (4 − √ 12)/π} ≈ 0.16275. The constants for other families can be calculated at least numerically.
For the score regression approach, we consider N independent and identically distributed observations (F 1,n , . . . , F k,n , Y n , V n ), 1 ≤ n ≤ N in the prediction space setting; see Definition 2.1. The expectation of the DSS depends on the logarithm of the predictive standard deviation; see equation (10). Therefore, we stipulate a regression model of the form where σ i,n is the predictive standard deviation of F i,n and n is an independent error with mean zero. Since the variance of the DSS is constant, irrespectively of the predictive variance, we can use a homoscedastic regression model to compute the least squares estimatorŝ a,b 1 , . . . ,b k . In the case k = 1 this is the same model as proposed at Held et al. (2010, eq. (7)). We need to assume that the scores have finite variance, which is fulfilled if Y has a finite fourth moment (conditional on A 1 ).
We have var{CRPS(F 1,n , Y n )} ∝ σ 1,n and use a weighted regression analysis with weights 1/σ 1,n to obtain estimatorsĉ,d 1 , . . . ,d k ; see for example Montgomery et al. (2001). Both of these models can be used for testing if the forecaster F 1 is cross-ideal with respect to A 1 = σ(F 1 ), . . . , A k = σ(F k ) in case of a normal forecaster F 1 . The DSS can also be used if the prediction is non-normal as emphasized in Held et al. (2010). The CRPS model is useful for any location-scale family of distributions.
If k = 1, that is the case of just one forecaster, we obtain the test for an ideal forecaster suggested in Held et al. (2010). We call the tests presented in this section SRA tests as they are based on a score regression approach (SRA). As noted already by Held et al. (2010), SRA tests can only be used if each forecaster predicts at least two different variances, therefore we cannot apply it to Example 3.1 by Gneiting and Ranjan (2013). Instead, we consider the following setup for illustration.
The tests are performed at significance level α = 0.05. The results are in accordance with the theoretical considerations. However, the Monte-Carlo power of the test if F 2 is cross-ideal with respect to F 1 , F 2 is higher then in the test if F 2 is ideal with respect to F 2 . It is interesting to see that taking F 1 into account helps to detect that F 2 is not ideal with respect to F 2 .
Example 5.7 (Example 2.5 continued). Considering again Example 2.5, we used the DSS test to assess if the forecasters are cross-ideal. In Table 8 Table 7: Monte-Carlo powers for the CRPS tests with sample sizes N and 10'000 simulations described in detail in Example 5.6.  Table 8: Monte-Carlo power for the DSS tests with sample sizes N and 10'000 simulations described in detail in Example 5.7. cannot be used for F 2 since the forecast is not normal. As expected, the tests show that F 1 is ideal with respect to A 1 and also cross-ideal with respect to A 1 , A 2 . For a sample size of N = 200 the level of the test is kept reasonably well. The forecaster F 2 is ideal with respect to A 2 but fails to be cross-ideal with respect to A 1 , A 2 . The test shows a good power already for a sample size of N = 100. However, for F 2 the test is slightly anticonservative even for a sample size of N = 500.

Summary
We have presented three different approaches for testing cross-calibration, the CEP tests in Section 5.1, the LRA tests in Section 5.2 and the SRA tests in Section 5.3. The first two approaches allow to test for cross-calibration of F 1 with respect to any subset J ⊂ {1, . . . , k}, whereas the SRA tests only allow to test for F 1 being cross-ideal which is equivalent to requiring that 1 ∈ J. The CEP test and the LRA test with respect to J = ∅ are tests for probabilistic calibration, that is, the classical hypothesis of uniformity and independence of PIT values. While the SRA tests require independent forecastobservation tuples, the CEP and the LRA tests are formulated in a prediction space for serial dependence, which is a scenario that is frequently encountered in practice; see also Section 6.
The CEP test has the advantage that it provides information concerning the parts of the distribution where miscalibration is detected (in terms of quantile levels); this is illustrated in Figures 5 and 6. It may be considered a disadvantage that the adjusted p-values are simulation based and depend on a grid 0 < z 1 < · · · < z M < 1 that is to be chosen. In simulations, the method has shown to be robust to the number M of grid points. On the contrary, the p-values for the LRA test are given explicitly. The forecasters have to be described through a finite-dimensional parameter vector and there are some restrictions concerning the predictive parameters, as it has to be ensured that the design matrix D J at (9) has full rank. For the forecasters of Example 3.1, the LRA test has overall a better power than the CEP test; see Examples 5.1 and 5.3. The difference is minor, except for the hypothesis that the forecaster F 3 is ideal. Here, for sample size N = 50, the LRA test achieves a power of 0.734, whereas the CEP test only has a power of 0.168. For the forecasters in Example 2.5 the CEP test outperformed the LRA test; see Examples 5.2 and 5.4. In fact, for sample size N = 200, the power of the CEP test is more than three times higher than the power of the LRA test.
The following modifications of the CEP and the LRA tests are straight forward but unexplored. The logistic regression model in (5) can be replaced by any other regression model for a binary outcome variable, where it is possible to formulate a test for an analogous pointwise null hypothesis as given at (7). If forecasters choose their distributions from a parametric class of distributions as assumed in LRA approach, it could also be considered to regress the random variables B z = 1(Z 1,t ≤ z) on the predicted parameter values. In the LRA, the linear regression model stipulated at (8) can be replaced by some other regression model for a vector of real valued outcomes.
We would like to remark that the CEP and the LRA tests are formulated in the prediction space setting for serial dependence and make use of condition (1). It appears that deciding whether this assumption is justified in a given application context is sometimes a delicate matter. For example, if a forecaster i bases her predictions purely on intuition, then (1) is certainly justified. If a forecaster j uses a time series model for predictions, that is, predictions are exclusively derived from past data, then one may argue that assumption (1) fails and the CEP and LRA tests should only be applied with respect to sets J such that j ∈ J. It may be that some parameters of a predictive distribution are derived from past data, whereas others are from external sources such as expert opinion. Here, it could be argued that one should only regress on the latter type of parameters in the LRA tests and use a regression model in terms of these parameters for the CEP tests. A different point of view would be that the parameters based on past data are derived through a subjectively chosen model, thus after the fitting procedure they should rather be viewed as personal opinion of the forecaster than as an information influencing the outcome. We will discuss condition (1) further in Section 6.
The SRA tests, based on the score regression approach, require independent forecastobservation tuples, or, more precisely, independent sequences of realized score values CRPS(F 1,n , Y n ) or DSS(F 1,n , Y n ), 1 ≤ n ≤ N , which may be a weaker requirement. They are asymptotic tests, that appear to be working well for sample sizes of at least N = 100; see Tables 7 and 8. The SRA test with the CRPS works only for predictive distributions from one location-scale family, whereas the SRA test with the DSS requires only that the predictive distributions have finite fourth moments. In both cases, the predictive standard deviations have to differ for at least two observations. For the forecasters of Example 2.5, the SRA test with the DSS showed a better power than the CEP test, so it is an interesting alternative despite the more restrictive assumptions; see Examples 5.2 and 5.7. In particular, for sample size N = 200 the SRA test had a power of 0.963 detecting that F 2 is not cross-ideal with respect to F 1 , F 2 , whereas the CEP test had a power of 0.464.
In the case of independent forecast-observation tuples it is possible to derive a test for marginal cross-calibration by testing for mean zero in (4) for each y ∈ R. It has turned out in simulations, that the resulting asymptotic test has several problems for applications.
For completeness, we report these findings in Appendix B.

Data example
The Bank of England (BoE) predicts the inflation rate of every quarter by using a probabilistic forecast with a potentially asymmetric two-piece normal distribution with parameters µ ∈ R and σ 1 , σ 2 > 0 and density The forecasts have been issued by the BoE's Monetary Policy Committee since February 1996 for the first quarter of 1996 and are publicly available online. The first quarter is from March to May, the second quarter from June to August, and so forth. Furthermore, there are forecasts available which have been issued between February 1993 and May 1997. These were converted into density forecasts retrospectively. Until the first quarter of 2004, the forecasts have been issued to predict RPIX inflation rates. But since the first quarter of 2004, inflation has been predicted and assessed in terms of percentage changes over twelve months of the CPI. The observed RPIX as well as the CPI inflation rates are available from the Office for National Statistics under codes CDKQ and D7G7, respectively. There is no simple transformation that converts an RPIX inflation rate into a CPI inflation rate and vice versa, so we have analysed the two data sets separately; RPIX inflation rate predictions from the first quarter of 1993 to the first quarter of 2004 and CIP inflation rate predictions from the first quarter of 2004 to the first quarter of 2015. In both cases we have 45 forecast-observation tuples. For further detail on the data set, see Gneiting and Ranjan (2011, Section 4.1). The BoE inflation forecasts have also been previously analysed for example by Wallis (2003); Clements (2004); Mitchell and Hall (2005); Galbraith and van Norden (2012).
For both data sets, we compare the BoE predictions with a Gaussian autoregression (AR) of order one with rolling estimation window of length six quarters, which leads to Gaussian density forecasts. The prediction horizon we consider is one quarter. As discussed in Section 5.4, the CEP and LRA tests make use of condition (1). While we believe that the BoE forecasts can be assumed to satisfy (1), it is more debatable in the case of the AR forecasts. If one is not willing to believe that (1) holds in this case, one should only consider the CEP and the LRA tests with respect to the empty set, that is probabilistic calibration, and cross-calibration with respect to BoE. The conclusions we can draw about the quality of the forecasts remain essentially the same. Due to the serial dependence in the data, we do not apply the SRA tests.
First, we consider the CEP tests. The results for the BoE density forecasts can be seen in Figure 5 and the ones for the AR forecasts in Figure 6. In both plots the grid is z m = {1+ (148/149)m}/150 for 0 ≤ m ≤ 149 and 20 000 bootstrap replications are used to calculate the adjusted p-values under the null hypothesis. For the RPIX inflation rate forecasts in the top panel of Figure 5, the BoE forecast seems to be probabilistically calibrated and also cross-calibrated with respect to the AR forecast. It fails to be ideal, that is cross-calibrated  with respect to itself. As a theoretical consequence it also fails to be cross-calibrated with respect to both, the AR forecast and itself by Theorem 2.9. The CEP test picks this up correctly, and rejects the null hypothesis with respect to BoE or with respect to BoE and AR for some small exceedance probabilities between zero and 0.05. However, it should be remarked that the rejection region is small allowing the tentative conclusion that the BoE forecast is not far from being ideal or cross-ideal. For the CPI inflation rate predictions in the bottom panel of Figure 5, the situation is different. Probabilistic calibration of the BoE forecast is rejected for exceedance probabilities between 0.13 and 0.26. Note that this result makes no use of assumption (1). Cross-calibration with respect to AR and with respect to BoE itself is also rejected in some parts of the region between 0.13 and 0.26. In this case, the CEP test is not able to pick up a failure of cross-calibration with respect to both, AR and BoE, although this is a theoretical consequence of the lack of probabilistic calibration by Theorem 2.9. According to the CEP test, the AR forecast for the RPIX inflation rate is not probabilistically calibrated and therefore also not cross-calibrated, ideal or cross-ideal; see the top panel of Figure 6. For all tests, the forecaster fails in the region of exceedance probabilities lower than 0.4 and near to 1. Cross-calibration with respect to BoE and AR is rejected for all exceedance probabilities with a very low p-value. While the overall con-     clusions remain the same for the CPI inflation rate forecasts, the situation is somewhat different; see the bottom panel of Figure 6. Cross-calibration with respect to BoE and with respect to AR and BoE is rejected for almost all exceedance probabilities. However, probabilistic calibration of the AR forecaster and cross-calibration with respect to itself is only rejected for some probabilities below 0.10 and above 0.80 indicating the the AR forecast might be superior to the BoE forecast for exceedance probabilities between 0.13 and 0.26. Secondly, we consider is the LRA tests. The parametric class F used for the tests is the class of two-piece normal distributions with parameters µ ∈ R, σ 1 > 0, σ 2 > 0 given at (12). We can perform all the tests as for the CEP. The corresponding p-values can be found in Tables 9 and 10. We also see if the estimated regression parameter failed to be zero or the standard normality assumption for the residuals was violated. The results coincide with the ones from the CEP, but we do not see in which region of exceedance probabilities the forecasters failed. On the other hand, in this application the LRA tests are consistent with Theorem 2.9 in the sense that rejection of cross-calibration with respect to a smaller set implies rejection with respect to any superset.

Discussion
We have extended the prediction space setting of Gneiting and Ranjan (2013) to accommodate serially dependent forecasts which are commonly encountered in practice. For prediction spaces with serial dependence, we have shown a refined version of the result of Diebold et al. (1998) on uniformity and independence of PIT values. It relies on condition (1), whose implications should be studied in greater detail. We have focussed on the case of one period ahead forecasts like in the original result. As mentioned in Remark 2, an analogous result continues to hold for q-step ahead forecasts. However, additional complications arise in testing for cross-calibration, which need further investigation in future research.
We have refined the notions of calibration to notions of cross-calibration and we have provided powerful statistical tests for these properties requiring minimal assumptions on the sequences of forecasts and observations. The characterization of cross-calibration and cross-ideal forecasters in Proposition 2.10 sheds some light on the difference between ideal forecasters and probabilistically calibrated forecasters as discussed in Gneiting and Ranjan (2013). It is remarkable to note that with our approaches, testing for ideal forecasters is not more difficult than testing for probabilistic calibration, contrary to the doubts voiced in Gneiting and Ranjan (2013).
In order to optimize forecasting performance, it is natural to combine forecasts. Gneiting and Ranjan (2013) have proposed combination formulas and aggregation methods to combine several forecasters; see also Ranjan and Gneiting (2010). It would be interesting to consider under which conditions, calibrated forecasters can be combined to yield crosscalibrated forecasts. Also, the more refined notions of cross-calibration in this paper, may help to identify which forecasters to include in combination formulas and which ones do not add additional information about the future outcome. Finally, combining forecasts is only a good idea if the predictions are based on different information sets. If there is a cross-calibrated forecaster with respect to all forecasters, any combination of forecasts would compromise on forecast quality.
Our approach may add another perspective on the concerns raised by Mitchell and Wallis (2011) concerning the principle to "Maximize sharpness subject to calibration" formulated by Murphy and Winkler (1987); . In fact, the concept of cross-calibration allows to assess the statistical compatibility of several forecasters with the observations. When considering calibration and sharpness as suggested by , calibration concerns the interplay of one forecaster and the observation, whereas sharpness compares forecasters but makes no reference to observations. Possibly, the formulated guiding principle should be modified to "Maximize sharpness subject to cross-calibration" which is a stronger requirement in terms of calibration and therefore gives somewhat less importance to sharpness.

B Testing for marginal cross-calibration
We consider two forecasters F 1 and F 2 within the prediction space setting. Our interest lies in S(y) = F 2 (y) − 1{F −1 2 (Z Y F 1 ) ≤ y}, y ∈ supp(Y ), where supp(Y ) denotes the support of of the observation Y . We would like to test if E Q S(y) = 0 for all y ∈ supp(Y ), because this is equivalent to marginal cross-calibration of F 1 with respect to F 2 ; cf. Definition 2.7.
We suppose that we have N independent and identically distributed observations (F 1,n , F 2,n , Y n , V n ) for 1 ≤ n ≤ N in a prediction space and define for each n, S n (y) = F 2,n (y) − 1{F −1 2,n (Z 1,n ) ≤ y}, y ∈ supp(Y ).
We pick a sequence y 0 < y 1 < . . . < y m in the support of Y and define S n = (S n (y 0 ), S n (y 1 ), . . . , S n (y m )) T , andS N = (1/N ) N n=1 S n . Let Σ N = (1/N ) N n=1 (S n −S N )(S n −S N ) T be the sample covariance matrix. If F 1 is marginally cross-calibrated with respect to F 2 , then E(S n ) = 0. Therefore, by standard arguments of probability theory, the test statistic T = NS T N Σ −1 NS N converges in distribution to χ 2 m , a chi-squared distribution with m degrees of freedom. We test the null hypothesis that E Q S(y) = 0 for all y ∈ supp(Y ) for one particular finite distribution of S(y). Therefore, the sequence y 1 , . . . , y m has to be chosen carefully. Simulations indicate that the level and power of the test is not much affected by the choice of y 1 , . . . , y m . However, for small sample sizes the number of grid points m should be rather small, otherwise the sample covariance matrix may be singular and can not be inverted to compute the test statistic. Another reason for singularity of the sample covariance matrix for small sample sizes may be the choice of a grid point y such that the probability that F −1 i {Z Y F j } ≤ y is small. Unfortunately, for an individual test case, different choices of y 1 , . . . , y m may lead to completely different p-values, which makes the test useless in practice. We illustrate these effects in the following example.
Example B.1. We consider the forecasters F 1 , . . . , F 4 and the observation Y from Example 3.1. Let N = 500 be the number of observations from (F i , F j , Y ), for each pair F i and F j with 1 ≤ i, j ≤ 4. The results in Table 11 show that the marginal cross-calibration test performs well overall, and the performance is relatively unaffected by the choice of different grid points y 1 , y 2 , . . . , y m . However, if we consider an increasing number of grid points for the same data set the p-value changes substantially. This is illustrated in Figure  7 for five different simulated data sets with N = 500 and the null hypothesis that F 3 is marginally cross-calibrated with respect to F 4 .  Table  11: Monte-carlo powers of the marginal cross-calibration tests for sample size N = 500 and grid points (y 1 , y 2 , . . . , y 9 ) = (−1.81, −1.19, −0.74, −0.36, 0.00, 0.36, 0.74, 1.19, 1.81), (y 1 , y 2 , y 3 , y 4 ) = (−1.19, −0.35, 0.35, 1.19), (y 1 , y 2 , y 3 ) = (−0.95, 0, 0.95), respectively, for the first, second and third table. The value in the i-th row and j-th column is the percentage of rejections of the null hypothesis that F i is marginally cross-calibrated with respect to F j at level α = 0.05 for the forecasters F 1 , F 2 , F 3 and F 4 in Example B.1 in 10'000 simulations.