Forecast dominance testing via sign randomization

We propose randomization tests of whether forecast 1 outperforms forecast 2 across a class of scoring functions. This hypothesis is of applied interest: While the prediction context often prescribes a certain class of scoring functions, it is typically hard to motivate a specific choice on statistical or substantive grounds. We investigate the asymptotic behavior of the test statistics under mild conditions, avoiding the need to assume particular dynamic properties of forecasts and realizations. The properties of the one-sided tests depend on a corresponding version of Anderson's inequality, which we state as a conjecture of independent interest. Numerical experiments and a data example indicate that the tests have good size and power properties in practically relevant situations.


Introduction
Forecasts of future events and quantities are essential across disciplines. At the same time, forecasts notoriously are imprecise and prone to bias, calling for methods to assess and compare the performance of imperfect predictions both theoretically and on the basis of empirical data. In the context of point forecasts, which we consider in this paper, the appropriate evaluation tool is that of a consistent scoring function [21]. A scoring function S ≡ S(x, y) assigns to each forecast x and realization y a real-valued score such that smaller scores correspond to better forecasts. Specifically, let φ be a real-valued functional defined on a class G of possible distributions G of y, such as the mean or a quantile of G. The scoring function is consistent (for φ relative to G) if S(φ(G), G) ≤ S(x, G) for every G ∈ G and forecast x; here S(x, G) = E y∼G S(x, y). With a consistent scoring function the forecaster can do no better than predict the true functional value, which rewards honest reporting.
For a given functional φ, supposed fixed in the following, there generally exists a whole class of consistent scoring functions, or "scores." Characterizations of the respective score classes for various functionals may be found in e.g. [5,19,21]. For example, all scores of the form S(x, y) = ϕ(y) − ϕ(x) − ϕ (x) (y − x), ϕ a convex function with subgradient ϕ , are consistent for the mean functional [5,49]. In applied contexts, consistent alternatives to the special case ϕ(x) = x 2 of the squared error score were discussed in [22], [37], [50] and others for the binary case y ∈ {0, 1}, and in [43] for positive predictands y ∈ R + .
The availability of an entire family of scoring functions that are theoretically legitimate comes with the drawback that two scores may produce different forecast rankings even if both are consistent for the same target functional [16,17,44]. This lack of robustness is unsatisfactory as there are often no strong arguments for choosing a particular score. It is therefore natural to ask whether forecast rankings are stable across a family of scores. E.g., given two forecasts x 1 , x 2 , does x 1 dominate x 2 in the sense that it is superior with respect to all consistent scoring functions? In the case of a quantile or expectile functional 1 the complexity of the problem can be much reduced by means of a Choquet representation: here every consistent score S can be represented as a mixture of "elementary" (in fact: extremal) scores S θ , θ ∈ R. That is, for every S there is a nonnegative Borel measure M on R such that S(x, y) = S θ (x, y) dM (θ) [16]. The form of the elementary scores depends on the specific functional being studied; for example, in case of the mean functional e l s e . (1.1) The Choquet representation makes it possible to reduce dominance with respect to all consistent scoring functions to dominance with respect to the linearly indexed family of elementary scores {S θ : θ ∈ R}, a substantial simplification. Nevertheless, testing related hypotheses remains a challenging task. Our focus will be on a very simple test of forecast dominance that goes without any of the common assumptions. In return, the notion of forecast dominance acquires a central role. Initially, it may be defined as follows [16]. Given a family S = {S θ : θ ∈ Θ} of (consistent) scoring funcions S θ we say that forecast for every θ ∈ Θ. In such a one-step scenario, the set of all G ∈ G satisfying this condition could constitute the hypothesis 'x 1 dominates x 2 ' (with respect to S). The question how to formulate suitable dominance hypotheses becomes substantially more involved when, as usually, forecasts x k1 , x k2 are produced step by step and the realizations y k become known before the next forecast instance. Current work on forecast evaluation and comparison emphasizes the joint dynamic behavior of forecasts and realizations, by using martingale methods [34,51], the concept of prediction spaces [23,52], or comparisons of conditional predictive ability [20]. Bootstrap tests of dominance hypotheses are presented in, e.g., [28,38,39,56], based on different dominance concepts and asymptotic frameworks. An account of the related literature addressing the relations with, and differences to the present approach is given in the discussion section 7.
Usually, mathematical analyses proceed from statistical models for the data and the formulation of hypotheses to related tests and their properties. Here we follow a reverse path. We make no assumptions about possible data generating mechanisms; instead we focus on a simple-to-implement test procedure and ask for hypotheses for which this procedure represents a valid test (asymptotically, at a given level). We take this route because quite often very little is known about the stochastic nature of the data. In fact, typical forecasting problems have to cope with complex statistical dependencies, structural change, and lim-ited domain knowledge. Thus presumably, most of the usual assumptions do not apply, with largely unknown consequences, and are hard or impossible to check. We therefore have recourse to the classical Fisherian technique of external randomization, which is completely under one's control, and treat everything conditionally on the data (x k1 , x k2 , y k ), k = 1, . . . , n.
The use of external randomization to compare forecast performance dates back at least to [14]. Here we compare forecast performance across families of scores, rather than with respect to a single scoring function. Concretely, our goal is to elaborate on the sign randomization procedure tentatively proposed in [16,end of Section 3] for testing forecast dominance. The idea is to reject the hypothesis 'forecast 1 dominates forecast 2' if, e.g., sup θ∈Θ D n (θ) exceeds some critical value c n where is the sum of the single score differences d k (θ), properly scaled for the sake of asymptotics. Unfortunately, determination of c n generally is very diffcult even asymptotically; it appears impossible without making assumptions about the stochastic structure of the data. Our suggestion in [16] was to determine c n such that P r and P r * exclusively refers to the i.i.d. ("Rademacher") random variables σ k assuming the values ±1 with probability 1/2 each. This clearly raises questions. First, how can the randomization distribution be connected to the distribution of the test statistic, particularly when no model assumptions are being made? Secondly, what precisely is to be understood under the hypothesis 'forecast 1 dominates forecast 2' ? As explained in Section 3.2, there is in fact a close connection between the two problems that helps to get around both -up to one missing link: The approximate validity of our one-sided tests depends on an unproven variant of the celebrated Anderson's inequality [2]. While for symmetric hypotheses postulating 'no difference in predictive performance' the classical Anderson's inequality provides the necessary link, the asymmetry of dominance hypotheses requires a one-sided version of the inequality which we state as a conjecture that appears of independent interest.
Obviously, dispensing with model asssumptions cannot mean doing without any assumptions. However, as detailed in Section 4.1, asssumptions distantly related to stationarity and (in-)dependence properties of forecasts and observations will enter in a very indirect manner only, via basic asymptotic stability and "moderate local clustering" conditions, respectively, which are fulfilled under virtually any of the standard statistical models; cf. Section 9. Building on this novel asymptotic framework we present, in Section 4, weak convergence results governing the asymptotics of our test statistics in the important special cases of quantile and expectile forecasts. For the overall organization of the paper see the table of contents. R [46] program code to implement the randomization test is available at https://github.com/FK83/fdtest.

Formal setup
Let (x k1 , x k2 , y k ), k = 1, . . . , n be a sequence of n triplets where x k1 , x k2 are two point forecasts each for the subsequent observation y k . The triplets are considered as random variables on a common probability space (Ω, F, Q) endowed with a filtration {F k , k = 0, . . . , n} such that (x k1 , x k2 , y k ) is F k -measurable for every k, and F 0 is trivial. Given a family S = {S θ , θ ∈ Θ} of scoring functions S θ , we compare the two forecasts via the suitably normalized average difference of their S θ scores, i.e., we are interested in the stochastic process Initially, the family S may be arbitrary; from Section 4 on it will be specialized to the elementary scores for quantile and expectile forecasts.

Notions of forecast dominance
Which notion of forecast dominance fits with the randomizarion test in our focus? One possibility to introduce forecast dominance in the present framework is to declare forecast 1 as weakly dominating forecast 2 at Q (with respect to S) if E Q D n (θ) ≤ 0 for every θ ∈ Θ. The same condition furnishes a natural one-sided hypothesis in a testing context: In fact, H w − stands for all probability measures Q under which sup θ E Q D n (θ) ≤ 0. Note that the hypothesis does not depend on the scaling constant, which could also be taken as n −1 , or the constant 1. The present choice, n −1/2 , is simply for convenience in regard to the large sample asymptotics to be discussed later on.
The hypothesis H w − involves unconditional expectations referring to both the observations y k and the forecasts x k . This form of H w − is at odds with the dominance concept of Section 1 which is sequential in nature and makes no assumption about the dynamics of the forecasts, hence is more flexible in this sense. Better accordance with this initial concept is achieved on replacing the unconditional expectations E Q d k (θ) in (2.2) by conditional expectations given the past. This leads upon the following definition of forecast dominance: we say that forecast 1 dominates forecast 2 at Q if Hypothesis H − is defined by conditions on a random process rather than on a parameter, which is unusual. Nevertheless, it makes for a useful notion of forecast dominance, and it will prove to be the appropriate hypothesis for the randomization test. We also will consider sub-hypotheses of H − including, specifically, the hypothesis The interpretation of H s − is straightforward: It says that forecast 1 is at least as good as forecast 2 at each time step. In [20], this dominance concept is referred to as a comparison of conditional predictive ability. In this parlance, the hypotheses H − and H s − express superiority of forecast 1 over forecast 2 in terms of, respectively, average and step-by-step conditional predictive ability. Example 2.1. For illustration we consider the case where the forecasters know one part each of the verifying observation. Specifically, let y k = η k1 + η k2 where η k , k ≥ 1, = 1, 2 are two independent autoregressive processes of the form η k = a η k−1, + k with the same parameter a and independent innovations k ∼ N (0, τ 2 ). Suppose that at any instance k, forecaster 1 has access to η k1 and the preceding value η k−1,2 of the second process. 3 If a is known, a natural choice for the prediction of y k is x k1 = η k1 + a η k−1,2 . By definition, and if forecaster 2's prediction similarly is x k2 = η k2 + a η k−1,1 , then x k2 = a y k−1 + k2 , and of course, y k = a y k−1 + k1 + k2 . Taking at first squared error as the scoring function, the k-th score difference becomes Thus if τ 1 ≥ τ 2 , say, and the innovations k are independent of F k−1 -as in the common case where F k is the σ-algebra generated by all triplets (x j1 , x j2 , y j ), j ≤ k -, then E [ d k | F k−1 ] ≤ 0, consistent with the intuition that the forecaster having access to the more variable component should be better off. The case τ 1 = τ 2 is an instance of a situation where, a priori, none of the two forecasters is believed to outperform the other. Here E [ d k | F k−1 ] = 0, which holds in fact for every scoring function S within the present model: by symmetry the joint conditional distributions of (x kj , y k ) = (a y k−1 + kj , a y k−1 + k1 + k2 ) given F k−1 are identical (j = 1, 2), whence the conditional expectation of  [32] of dominance conditions based on the concept of convex order.

A fictitious test
For testing the one-sided hypothesis H − (or H s − ) it is natural to work with test functionals T = T (D n ) that are monotone in the sense that T (f ) ≤ T (g) whenever f ≤ g pointwise on Θ, an interval in R from now on. Main examples are the supremum functional and integrals of the positive part of D n , e.g.
These functionals are convex in f , and accordingly, the generic test functional T is supposed to be monotone and convex in the real functions f on Θ. Importantly for the following, in tests of H − or H s − based on such a functional T it suffices to control the error of the first kind at the "boundary" of the hypothesis, where either M n, denote the conditionally centered version of D n . Then by monotonicity for every Q ∈ H − , as claimed. Thus, if critical values c n could be obtained such give us an approximate level-α test of H − . For compactness of notation we henceforth drop the index Q, taking the dependence on the underlying probability measure as self-evident. An initial step toward the determination of critical values is the following proposition, for which we need the Lindeberg condition Suppose there is some non-random function γ such that γ(θ, θ) > 0, θ ∈ Θ, and for every pair Then under (A0), the finite-dimensional distributions of the process D n converge to those of a mean zero Gaussian process Z with covariance γ.
The proposition suggests that for large n the distribution of the test statistic T (D n ) at the boundary, where D n = D n , can be approximated by the distribution of the functional T ( Z) on the paths of the Gaussian process Z. Of course, convergence of the finite-dimensional distributions is insufficient for such a conclusion; tightness of the processes D n in a suitable function space is required, too. Furthermore, the distribution of T ( Z) generally is unknown and may be difficult to determine. And there still is the problem that the process D n involves the (sum of the) conditional expectations E [ d k (θ) | F k−1 ], which depend on the unknown probability Q and would have to be estimated with sufficient accuracy. In view of these difficulties with the determination of proper critical values c n we refer to the hypothetical test rejecting H − if T (D n ) > c n as the "fictitious test." For the sake of exposition we introduce further sub-hypotheses of H − 5 besides H s − , namely the null-hypothesis H 0 of equal performance on average, (2.9) and the null-hypothesis H s 0 of equal performance at every forecast instance,

General idea
Using a standard randomization device (c.f. Section 7), we let σ 1 , σ 2 , . . . be i.i.d. such that σ k = ±1 with probability 1/2 each, and define We reject H − "at level α" if T (D n ) > c * n where T is a monotone and convex test functional, and c * n is determined such that P r * [T (D * n ) > c * n ] ≈ α. Here P r * exclusively pertains to the random signs σ k , the data x k1 , x k2 , y k being considered as fixed, non-random quantities. Henceforth we refer to this test as the randomization test. Its rationale is as follows. First, if T is monotone, the argument at (2.7) together with the inclusion H 0 ⊂ H − imply so that it suffices to control the error of the first kind across H 0 where there is no systematic difference between the two forecasts. Secondly, if there is no difference in predictive performance between forecasts 1 and 2, changing the labels (i.e., the sign of the d k ) should not affect the distribution of the test statistic. The quotes in "at level α" shall underline that the test has (approximative) level α only formally; the actual test level might differ. While the randomization test has intuitive appeal and is easy to implement, its properties are less clear. For instance, calculating the critical value c * n from the randomization distribution tacitly supposes that in the indifference case the distribution of the R n -valued process θ → (d 1 (θ), . . . , d n (θ)) is invariant under arbitrary sign changes in the n components (same change for all θ), which is an even stronger hypothesis than H s 0 . This raises questions concerning the approximate range of validity of the test in asymptotic regimes, where fine distinctions between different hypotheses may become inessential.
Initial answers will be obtained through (partially) heuristic reasoning, utilizing relationships between the randomization and the fictitious test. In Section 4 these considerations are complemented by rigorous weak convergence results for the case of quantile and expectile forecasts, which validate the heuristics.

Test validity: Heuristics, and a conjecture
In part a) of the following proposition it is understood that chance enters in two ways: via the random signs σ k , and via the statistical nature of the data triplets. In part b) we condition on the data, leaving the σ k as the sole source of randomness. Proposition 3.1. a) Suppose there is a non-random function γ such that γ(θ, θ) > 0, θ ∈ Θ, and for every pair Then under assumption (A0) the finite-dimensional distributions of the process D * n converge to those of a mean zero Gaussian process Z with covariance γ. b) The latter conclusion also holds under P r * (i.e., conditionally on the data) provided that the stochastic convergence (3.3

) is replaced by the usual (deterministic) convergence, and the Lindeberg condition (A0) is satisfied without the expectation sign.
Remark 3.1. Regarding part b), note that under P r * the d k (θ) are known non-random quantities, rendering the expectation sign void. On the other hand, if in (A0) d k (θ) everywhere is replaced by d * k (θ) = d k (θ)σ k , and E by the expectation E * pertaining to the σ k only, then the resulting condition (A0 * ) is a Lindeberg condition in the usual sense. Anyway, since |d * k (θ)| = |d k (θ)|, there is no difference between the conditions with and without the expectation sign, and we need not distinguish (A0) and (A0 * ).
We now address the question for which among the above hypotheses the randomization test is approximately valid, i.e., has size α. The discussion builds on distributional approximations to be established later on and on an unproven conjecture. The argument still is instructive as it helps delineate the key problem. Let us re-emphasize that in regard to testing for forecast dominance our focus is on the one-sided hypothesis H − ; cf. Footnote 5.
, hence D n = D n and γ n = γ n . Consequently, the limit processes Z and Z of D * n and D n are identical in distribution under H s 0 , and so are the limit distributions of any sufficiently regular test statistic (which need not be monotone, for instance). In particular, the critical values c * n and c n of the randomization and the fictitious test coincide asymptotically. Therefore, since the latter test is, for large n, approximatively valid for testing H s 0 at level α, then so is the former. The point is, of course, that the fictitious test is valid but infeasible, whereas the randomization test is straightforward to implement.

Hypothesis H 0 .
Under this hypothesis the above reasoning does not apply because the covariance functions γ and γ, hence the limit processes Z and Z, generally are different. Nevertheless, the randomization test remains approximatively valid for the hypothesis H 0 if the test functional T fits with H 0 . To substantiate this claim, let us begin by noting that γ = γ + ψ with a positive definite function ψ given by the stochastic limit cf. Lemma 9.1. For the limit processes this means that Z = Z+W in distribution, where W is an independent centered Gaussian process. Now in the context of the hypotheses H 0 , H s 0 it is natural to consider test functionals T that are symmetric, T (−f ) = T (f ), and convex in f . In other words, the acceptance region A = {f : T (f ) ≤ c} is symmetric and convex. This allows us to control the error probability under H 0 by applying a celebrated inequality. A basic finite-dimensional version of the inequality is as follows.

Anderson's inequality
In our case X and Y correspond to Z and W sampled discretely at d points As the sampling gets dense, one finds that under H 0 and for symmetric, convex test functionals T one has Relation (1)  The test functionals T = T (f ) appropriate for these hypotheses are convex and monotone in f . The latter property is incompatible with symmetry, which is an essential ingredient of Anderson's inequality. We nevertheless could argue similarly as above if there was a one-sided version of Anderson's inequality. The following would be most helpful.

Remark 3.2.
An important consequence of this inequality is that for test levels α ≤ α 0 the randomization test is (approximately) valid for testing the onesided hypotheses H − , H s − . The argument is parallel to (3.5), with two modifications: first, since M n ≤ 0 under H − , H s − , instead of (1) we have an inequality, , by the monotonicity of T ; secondly, relation (2) now follows from the one-sided Anderson inequality, again up to an approximation as in [2, Proof of Corollary 4]. Remark 3.3. The one-sided Anderson's inequality is not needed for the approximative validity of the one-sided randomization test if we only consider (sequences of) probability measures Q n ∈ H − that are contiguous [54, p. 87] to some sequence P n ∈ H s 0 . This is because under P n we have γ n = γ n , hence γ n −→ p γ = γ, and by contiguity this convergence also takes place under Q n ; cf. Propositions 2.2, 3.1. Therefore the distributions of the limit processes Z and Z coincide, and the inequality (2) in (3.5) becomes an equality; while the '=' sign (1) there has to be replaced by '≤', again by the monotonicity of T . Thus in this case too, P r Qn [T (D n ) > c * n ] α, as claimed.

Summary. Asymptotically, the randomization test is an (approximatively) valid level-α test of the hypotheses
It should be emphasized that the above discussion exclusively pertains to the control of the error of the first kind. Regarding power we only mention that the argument in 3.3 may as well be applied to alternatives Q n ∈ H + -satisfying M n,Qn ≥ 0 Q n -a.s. for all θ ∈ Θ; cf. (2.4) -that are contiguous to a sequence P n ∈ H s 0 . By monotonicity of T one analogously gets P r Qn [T (D n ) > c * n ] α, i.e., the randomization test is unbiased against such alternatives. For alternatives Q n not in H + , where θ → M n,Qn (θ) assumes positive as well as negative values, the power issue is subtle and will not be pursued systematically in this paper.

Some comments on the conjectured inequality
In dimension d = 1, the inequality is trivial. Convex, monotone acceptance regions then are intervals of the form For d > 1, a small piece of evidence in favour of the conjecture can be given as , which intersection is convex and symmetric, and let R denote the complement of the union A ∪ (−A). Then for any symmetric probability distribution F on For the moment being, suppose that R is contained in the set S where the density of G + exceeds the density of G. Then G + (R) ≥ G(R), and an application of Anderson's inequality to the set K, G + (K) ≤ G(K), yields the desired conclusion, As for the possible inclusion R ⊂ S, note that in terms of the log densities with the θ j becoming dense. Since the covariance function γ of Z generally is unknown, we have no control on the eigenvalues of V −1 − V −1 + . Proposition 3.2 thus does not guarantee that α 0 stays bounded away from zero uniformly in the pair V, V + and all dimensions d, as it is necessary for the one-sided Anderson inequality. This uniformity is the core of the problem.
A proof of the conjecture may require additional assumptions, e.g. invariance of g under coordinate permutations. (Generalizations involving other invariance conditions appear in [11,41] , which correspond to test functionals of major interest; cf. (2.6). A proof for such a special case would already be most worthwhile.
Thus far, our numerical experiments in the bivariate case d = 2 and simulations with randomly generated covariance matrices for d > 2 yielded no counterexample. Needless to say, this is irrelevant for the conjecture.

Asymptotics for quantile and expectile forecasts
In principle, the developments so far apply to largely arbitrary functionals φ on the class G of predictive distributions G and related families of consistent scoring functions S θ . Hereafter we focus on functionals representing a quantile or an expectile. Given α ∈ (0, 1), the α-expectile of G is defined as the unique solution t to the equation ( [42]; for α = 1/2 one obtains the mean value functional. As usual, q = inf{y : G(y) ≥ α} is the (lower) α-quantile of G, which here is identified with its right-continuous CDF. The median of G obtains when α = 1/2.
As mentioned in Section 1, forecast dominance with respect to all consistent scoring functions is, for these functionals, equivalent to dominance with respect to a certain linearly indexed family of "elementary" scoring functions S θ , such that any consistent scoring function can be written as setting M in (4.1) equal to the Lebesgue measure results in the popular 'piecewise linear' score considered in quantile regression [30]. For α-expectiles, again, the most popular choice of M is the Lebesgue measure, in which case S(x, y) becomes the asymmetric squared error considered in [42].

Conditional weak convergence of D * n
The purpose of this section is to establish the approximation P r figuring in the display (3.5) that is central to our argument. The asymptotics involves conditioning on the data x k1 , x k2 , y k , so that the sign variables σ k form the only source of randomness. We thus avoid having to make assumptions about the stochastic structure of the data.
Of basic importance are the second (cross-)moments of the process D * n , and the continuity moduli of the empirical distributions G n , F n1 , F n2 of the observations y k and the forecasts x k1 , x k2 , respectively. Put m k = |y k − x k1 | ∨ |y k − x k2 |, and let for quantiles: H n = G n + F n1 + F n2 , for expectiles: Assumptions.
(C4) There exist numbers ν > 0, A > 0, and n 1 ≥ 1 such that Discussion of the assumptions. In practice, the test functionals in (2.6) may be evaluated over a bounded interval Θ in the full θ-domain R and the score difference processes restricted correspondingly. Of course, forecast dominance then holds only with respect to all mixtures S = S θ dM (θ) of scores S θ with a measure M supported by Θ, which would often be considered as sufficient. Assumption (C1) is a basic asymptotic stability requirement that would hold 'in probability' under virtually any standard statistical model. However, as in part b) of Proposition 3.1 there is no probability governing the data, hence no convergence in probability (see also Remark 3.1).
The uniform Hölder condition assumed in (C2) requires that the data x k1 , x k2 , y k are well dispersed and do not heavily cluster locally. The lower bound r ≥ β n on the width of the increments is unavoidable because of the (asymptotically small) jumps of the empirical CDFs.
Assumption (C3) only matters for expectiles. To substantiate it, one may argue that reasonable forecasts should covary with the observations, which would limit the deflections of the quantities m k . (C3) is stronger than boundedness on average of the m 2 k , which appears as the minimal condition to impose. In return, it implies a Lindeberg type condition holding uniformly in θ, Assumption (C4) restrains the large fluctuations of the forecasts x k1 , x k2 and allows us to control the tail behavior of the functions θ → ED * n (θ) 2 . Altogether, the assumptions appear weak as well as natural for the quantile and expectile functionals and for continuously distributed data. They only pertain to quantities computable from the data and do not presuppose any statistical model. On the other hand, if a probabilistic model is assumed, they hold with arbitrarily high probability in many of the customary settings. See the corresponding discussion in Section 9, where (C2), (C4) are verified under conventional stationarity assumptions.
Hereafter, ∞ 0 denotes the space of all bounded measurable functions on R vanishing at infinity equipped with the sup-norm [54].  The theorem states weak convergence under P r * , i.e., conditionally on the data. A fortiori, this convergence holds unconditionally as long as the limit covariance function γ, hence the limit process Z, is unique. See Proposition 3.1 for a related discussion.
As a consequence of the theorem, T (D n ) converges weakly in distribution to T (Z) for any continuous functional on the space ∞ 0 . This covers the supremum statistic T ∞ (f ) = sup θ∈R f (θ) as one special case of interest. Other examples such as the integral type functionals T p (f ) = R f (θ) p + dθ (p = 1, 2) require a sharpening of assumption (C4) for the control of the tail masses. The conditions on ν can be dropped if the integration in the functionals T 1 , T 2 is restricted to a finite interval Θ ⊂ R (cf. (2.6)). The simplification occurs because, in this case, the functionals are continuous on the space ∞ (Θ) of all bounded measurable functions on Θ endowed with the sup-norm, and because weak convergence in ∞ 0 implies weak convergence in ∞ (Θ).

Weak convergence of D n
Here the focus is on the approximation P r [T ( D n ) > c * n ] ≈ P r [T ( Z) > c * n ] in (3.5), whose verification completes our proof of the validity of the randomization test (apart from the conjecture). The setting is unconditional, so 'P r' refers to some underlying probability measure governing the joint stochastic behavior of the data triplets. For simplicity we only deal with weak convergence on finite intervals Θ, that is, in ∞ (Θ).
The necessary distinction between the quantile and the expectile case is a bit tedious. We denote the sequentially conditioned versions of the empirical data distributions as = 1, 2, J an interval, and put as earlier H c n = G c n + F c n1 + F c n2 in the quantile, and H c n = F c n1 + F c n2 in the expectile case. Note that F c n = F n in the common case of forecasts x k that are F k−1 -measurable. The following assumptions are similar to those in the previous section, except that convergence is 'in probabiliy' and expectations are being taken at the appropriate places. A justification of assumption (A2) is given in Section 9.

Assumptions.
(A1) (2.8) holds: there exists a function γ such that γ(θ, θ) > 0 for all θ, and Given any b > 1 there exists a number p ≥ 1 such that (A2) and (A3) hold: (A2) There are numbers B > 0, n 2 ≥ 1 and a sequence β n → 0 such that for both K n = H n and K n = H c (A3) sup n n −1 k≤n Em 4p k < ∞ (only for expectiles). As before we consider a continuous version of D n obtained by linear interpolation on the grid {jβ n : j ∈ Z}, β n as in Assumption (A2), which we denote asD n . Accordingly, the approximation to be established becomes For bounded Θ the functionals T ∞ , T 1 , T 2 , and in fact all T p , 1 ≤ p ≤ ∞ (see Equation 2.6) are continuous on ∞ (Θ). Therefore the desired approximation is immediate from the following.

Monte Carlo simulations
Here we study the randomization test in finite sample scenarios involving mean (i.e., expectile) and quantile forecasts. The test statistics under examination are the positive part integrals T p (D n ) = D n (θ) p + dθ, p = 1, 2, considered as tests of the hypothesis H s − saying that method 1 dominates method 2 at each time step. All simulation results are based on 1 000 Monte Carlo iterations.

Mean forecasts
We first present simulation results for the illustrative example from Section 2.2. One of the variances is fixed, τ 1 = 1, while τ 2 is varied. By Proposition 2.1 the hypothesis H s − is satisfied if τ 2 ≤ 1, and violated otherwise. We consider samples of n = 200 observations each, which in an economic context is empirically relevant for quarterly time series data focusing on the postwar period. The top panel of Figure 1 summarizes our results for the case where the regression parameter a = 0.4; similar results obtain for other values of a. The figure shows that the performance of the test is quite satisfactory: It comes close to its nominal level 5% at the boundary of the hypothesis (τ 2 = 1), and it is conservative in its interior (τ 2 < 1), as predicted by the conjectured one-sided Anderson inequality. The part of the figure in which τ 2 > 1 yields evidence on the power of the test. Naturally, we find that the power increases monotonically in τ 2 (i.e., clearer violations of the hypothesis imply higher rejection rates). Furthermore, the functional T 1 has a slightly higher power than T 2 .

Quantile forecasts
We take the observations y k to follow an AR(1)-GARCH(1,1) process in the spirit of [6]: with independent "shocks" ε k ∼ N (0, 1). The parametrization follows [44, Section 3], thereby intending to replicate the empirical features of daily stock returns. Forecasters are asked to state the α = 0.05-quantile of the process, conditional on the information F k−1 available up to and including time k − 1.
To devise a simulation model for two imperfect forecasts, let us first conceive of an oracle. If the oracle knew the data generating mechanism and the initial values s 0 , 0 , y 0 , she could successively compute s k from the observations y j , j < k. Let F k denote the σ-algebra generated by the variables s 0 , 0 , y j , j ≤ k. Then (assuming the regression parameters are known) y k | F k−1 ∼ N (m k , s 2 k ), where we write m k = 0.03 + 0.05 y k−1 for convenience. Thus for our oracle, the ideal quantile forecast would be the α-quantile of the conditional distribution of y k , namely x k,ideal = m k +s k z α where z α = Φ −1 (α) is the ideal forecast in standard units. This leads us to mimic lack of knowledge and forecast errors by assuming that the issued forecasts are of the form x k = m k + s k z k ( = 1, 2) where the z k are random perturbations of z α that are independent among themselves and from all other variables. Specifically, we assume that the deflections from z α are Gaussian in the log odds scale, Intuitively, forecast 1 should be better than forecast 2 if τ 1 < τ 2 , since the deflections from the ideal forecast are then smaller for forecast 1. It can indeed be shown that H s − holds if and only if τ 1 ≤ τ 2 ; cf. end of Section 9. In our simulations (bottom panel of Figure 1), τ 1 = 0.3 is fixed, and τ 2 varies from 0.05 to 0.5. The quantile level is α = 0.05, and the sample size is 2 000. Both choices are in line with the empirical case study in Section 6.2, where we analyze daily financial return data. Again, as in the previous example, the course of the power as a function of τ 2 supports our claim that the randomization test is approximatively valid for testing H s − .

Mean forecasts
For a practical application of the randomization test we consider the recession probability forecasts studied in [48], using the updated data set analyzed by [16,Section 4]. The data set covers n = 186 quarterly observations from 1968 to 2014, and two competing forecasting methods: Judgmental forecasts from a survey of professional forecasters (SPF), and forecasts from a simple statistical model (Probit). Both forecasts are one quarter ahead, and are out-of-sample. 6 As shown in [16, Figure 6], the survey based forecasts attain better elementary expectile scores for most thresholds θ ∈ [0, 1]. We specifically consider two test problems where either the survey forecast or the model based forecast dominates the respective other one under the maintained hypothesis. The top panel of Table 6.1 summarizes the results, which are based on 1 000 simulated sign randomizations. The hypothesis that Probit dominates SPF is rejected at the one percent significance level. By contrast, there is no evidence against the hypothesis that SPF dominates Probit. These results conform with those of [16] who consider informal (pointwise) confidence intervals. Remarkably, the randomization test here proves powerful enough to yield interpretable conclusions in a relatively small data set.

Quantile forecasts
In a second case study we consider quantile forecasts of daily returns y k of the Dow Jones Industrial Average (DJIA), using data that is freely available at http://realized.oxford-man.ox.ac.uk/. Quantiles at low levels α are commonly used as measures for financial risk, and are referred to as Value-at-Risk at level α (e.g. [40], Sections 1 and 2). We specifically consider prediction of the five percent quantile of y k , given information until the previous business day k − 1. In a first specification, which we denote by QR RV , the predicted quantile is given by where RV k−1 is the so-called realized volatility computed from intra-daily data (e.g. [1]). We obtain parameter estimatesβ 0 ,β 1 by quantile regression [30], based on a rolling window of 2 000 observations. 7 Recent evidence [60] suggests that the specification in (6.1) compares favorably to a number of more complicated alternatives. Our second specification (QR ABS ) is analogous to (6.1), except that it employs the lagged absolute return |y k−1 | in place of realized volatility |RV k−1 |. The two specifications are motivated by the fact that realized volatility and absolute returns proxy for the variability of financial returns, which is well known to fluctuate over time (cf. Section 5.2). Both measures should thus be informative about the quantiles of y k , given F k−1 . The bottom panel of Table 6.1 presents the results of the comparison. We find no evidence against the hypothesis that QR RV dominates QR ABS ; however, we clearly reject the converse hypothesis that QR ABS dominates QR RV . This suggests that intra-daily information encoded in realized volatility contains more predictive content than daily returns. 8 Similar conclusions were reached in [58]. As in the first case study, the results are qualitatively robust across the two test statistics T 1 , T 2 . In summary, the Monte Carlo simulations and the case studies point to the potential usefulness of the proposed randomization test.

Discussion
We have studied randomization type tests of hypotheses implying that a quantile or expectile forecast is superior to a competitor, uniformly across all (or a large class of) consistent scoring functions. Variants of this topic recently have gained considerable interest, particularly in the econometrics literature. Tests of dominance relations in quantile and expectile forecasts are studied in [56] using the bootstrap, while in [58] the authors focus on the so-called expected shortfall functional, relying on a combination of pointwise tests and multiple testing corrections. These two papers are closest to the present work in that they base forecast comparisons on consistent scoring functions -arguably the proper concept for this purpose [21] -and their mixture representation [16]. We next provide a more detailed discussion of related literature.
Dominance concept. Our dominance concept aims at the direct comparison of forecasts on a purely empirical basis. This is distinct from more theoretically oriented concepts like nested or overlapping information sets [26,32], or the notion that forecast x 1 'encompasses' forecast x 2 [18, Section 17.1]. In a nutshell, forecast encompassing asks whether a user who has access to both forecasts may safely ignore x 2 , judging from some given criterion. By contrast, our notion of dominance asks whether a user who must choose between the forecasts would prefer x 1 over x 2 , in regard to any admissible criterion. Tests of stochastic dominance are considered in [28,38,39]. Analogously to forecast dominance, stochastic dominance stands for superiority (of some procedure compared to another) that holds uniformly across a whole class of criteria. In stochastic dominance, this class comprises those functions of e, the residuals or forecast errors from some regression type model, that increase as e moves away from zero, or else, are convex in e. In forecast dominance as here, the relevant criteria are the consistent scoring functions, or some subclass thereof.
Modeling assumptions. Our stance is to try to avoid assumptions about possible data generating mechanisms as far as possible, on the grounds given in the introduction. For a similar view see [20] and [28, p. 1308 and Sect. 5], where the common stationarity assumption is weakened to distributional heterogeneity. Consequently our framework puts no restrictions on the type of forecasts or their connection with the observations, and it allows for great freedom regarding their dynamics. The lack of an explicit statistical model and the need to harmonize the underlying statistics with the surrogate external randomization lead us to state forecast dominance hypotheses in terms of conditional expectations expressing conditional predicitive ability, as proposed in [20]. The step-by-step character of the forecast scheme renders this an attractive, natural alternative to the familiar formulations using unconditional expectations; see Example 2.1 or the simulation model in Section 5.2. We note, however, that our hypotheses differ from those of [20]: there the focus is on single-criterion two-sided (equality) testing, whereas in the present paper the focus is on simultaneous one-sided (inequality) testing. In the setup of [20], simple least squares regressions yield attractive tests for the null-hypothesis of equal conditional predictive ability that are not available in our case.
Test procedures. The idea of external randomization has long played an important role in statistics (see e.g. [15,Chapter 4.4], as well as [9] on the related idea of permutation), and it is central to our approach. Randomization for the task of forecast comparison was used in, e.g., [13,14], and it has been applied to ensure that a well-defined limiting distribution exists in the first place [4,10,53]. Our test construction is similar in spirit to the 'conditional p-value' approach of [24] and the 'wild bootstrap' proposed in [35]. In the context of functional data, the use of maximum and integral type test statistics like ours is standard; for weighted versions cf. [39]. Since the limit distributions are unknown, the determination of critical values makes it necessary to resort to resampling methods, customarily various forms of (block [33]) bootstrap as in [28,38,39,56]. While this is often the method of choice, its application in the present one-sided, high-dimensional context is not without problems. These partly are due to the nonstandard asymptotics of the bootstrap-based tests resulting from degenerate limit processes; see e.g. [28,39,56]. Noteworthily at this point, the Gaussian limit processes in our setting are entirely regular.
Test size control. Intuition and classical test theory suggest that in order to control the error of the first kind, it might suffice to control it on the boundary of the hypothesis. Unfortunately, what constitutes the boundary is subtle, and a focus on least favorable cases is inadequate [25]. For an extensive discussion of these points in a different framework (stochastic dominance) see Linton et al. [39], who emphasize the importance, and difficulty, of a uniform control of the test size (see also [28, p. 1320]) and develop a sophisticated bootstrap procedure for this purpose (in the i.i.d. case). Still, even there uniformity is achieved only if certain subsets of the hypothesis are excluded. We are actually not aware of any fully satisfactory result in this regard; neither is the issue clarified in the present paper. However, our approach suggests a potentially elegant solution at least: if the one-sided Anderson's inequality were true, our tests would be valid uniformly on H − . 9 The discussion in Section 3.2 elaborates the central role of our corresponding conjecture. A resolution of the issue, whether in the positive or in the negative, would certainly be of great interest. We may point out, however, that independently of the final status of the conjecture, the randomization test does behave properly for probabilities that are contiguous to the strict nullhypothesis H s 0 ; cf. Remark 3.3.

Proofs
Proof of Proposition 2.1. We may refer to [32, Example 2.3], wherein our earlier proof is much simplified. To see the relevance of Theorem 2.1 of that paper for the elementary scores note that the mean value is an α = 1/2-expectile whence eq. (4.3) assumes the form

Proof of Proposition 2.2. Given
. It suffices to show that the distribution of X n converges to N (0, V ). We have X n = n k=1 X kn where In order to apply [36,Theorem 2.3] to the martingale difference array {X kn }, we note at first that (2.8) implies by Jensen's inequality, and since m and the c j are fixed, it suffices to show that n −1 times the expectation of the two maxima in curly brackets tends to zero for every j. Let > 0. For any θ we have 4.1, 4.2 holds uniformly. Clearly, such restrictions always apply when asymptotics is involved.

Proof of Proposition 3.1.
The proof follows the same lines as the proof of Proposition 2.2. It suffices to replace d k (θ) by d k (θ)σ k , define F k as the σalgebra generated by the random variables σ 1 , . . . , σ k , and observe that |σ k | = 1 and Toward the proofs of Theorem 4.1 and Corollary 4.1, we assume throughout that (C1) to (C4) are fulfilled. We begin with some first consequences of the assumptions. Constants generally depend on whether they refer to quantiles or expectiles, which is indicated by subscripts q, e, respectively. Recall that expectations and probabilities are conditional, and refer to the random signs only. For simplicity we nevertheless use the "un-starred" symbols E, P r. where I(θ, y k ) is the respective identification function. Specifically for α-quantiles, the identification function is I(θ, y) = 1 y≤θ − α, whence |d k (θ)| ≤ |δ k (θ)|. For α-expectiles, I(θ, y) The second inequality is easily seen to follow from the fact that |δ k (θ)| equals one if θ lies between x k1 and x k2 , and is zero otherwise. This observation also shows that δ k (θ) 2 dθ = |x k1 − x k2 | ≤ 2m k , whence by Hölder's inequality and (C3) where s = 0 and s = 2 for quantiles and expectiles, respectively, which is (i). Similarly, if |θ| ≥ 1, n ≥ n 1 , using (C4) we get for quantiles while for expectiles, (8.7), (C3), and Cauchy-Schwarz give which settles (ii). As for the increments, let θ 1 < θ 2 and put δ k (θ 1 , with either i = 1, j = 2 or i = 2, j = 1, whichever is more convenient, we get for α-quantiles and for α-expectiles The last inequality may be verified similarly as at (8.7) on observing that |δ k (θ 1 , θ 2 )| = 1 if exactly one of x k1 , x k2 lies in the interval [θ 1 , θ 2 ), and is zero otherwise. This observation also shows that For quantiles we then get by (8.9) (8.12) while for expectiles the estimates (8.10), (8.11) and Cauchy-Schwarz give similarly as at (8.8) Assertion (iii) thus follows from (C2).

Lemma 8.2.
Up to adjustments of the constants, the assertions of Lemma 8.1 also hold for the interpolated processesD n , with the following improvement of (iii): sup 0≤θ2−θ1≤r E(D n (θ 2 ) −D n (θ 1 )) 2 ≤ C q,e r λq,e (r ∈ [0, 1], n ≥ n 2 ) (8.14) (i.e., with r λq,e rather than (r ∨ β n ) λq,e ). Furthermore, Proof. For convenience we intermediately write β n ≡ . Given θ, there is exactly one ∈ Z and w ∈ [0, 1) such that θ = (1 − w) + w( + 1) . By Jensen's inequality which proves (8.15). Turning to (8.16), let us write Δ k (θ) for the k-th term in the above sum. We first consider the quantile case. Recalling that and w are uniquely determined by θ we get by (8.9), The right-hand side is always ≤ 4, and it vanishes except if both θ and any of y k , x k1 , or x k2 lie in the interval [ , ( + 1) ). Thus for fixed k there are at most 3 intervals of length on which the function θ → Δ k (θ) is non-zero. Consequently, Δ k (θ)dθ ≤ 12 , so taking the average over k settles the quantile case. A similar reasoning applies in the expectile case. By (8.10), so averaging over k and using (C3) gives (8.16). Straightforward estimates yield the uniform Hölder condition (8.14) at first for points θ 1 , θ 2 belonging to the same interval [ , ( + 1) )], then belonging to two adjacent intervals, finally for points with one or more such intervals in between, where we may apply (8.5). The analogs of assertions 1 and 2 in Lemma 8.1 are obvious.

Proof of Theorem 4.1.
Convergence of the finite-dimensional distributions being clear from Proposition 3.1, (4.4), and (8.15), we only need to prove (stochastic) asymptotic equicontinuity [45,54] and the uniform vanishing at infinity of the sample paths ofD n . Without loss of generality we may assume n ≥ n 1 ∨ n 2 (cf. (C4), (C2)). Distinguishing between quantiles and expectiles is not necessary here, so we omit the subscripts q, e in the quantities appearing in Lemma 8.1 and 8.2. Moreover, by Lemma 8.2 quantities initially referring to D * n such as ρ n or ω n may also be used withD n , with the same bounds. Let u > 0. For any set T 0 ⊂ R, let N n (u, T 0 ) denote the minimal cardinality of a subset T ⊂ T 0 such that min t∈T ρ n (θ, t) ≤ u for every θ ∈ T 0 . Given b > 1, pick t j ∈ [−b, b] equidistant with spacing r = 2(u/B) 1/λ . By (8.14), the minimal ρ n -distance of any θ ∈ [−b, b] to the resulting set T is ≤ ω n (r/2) ≤ u, Here and subsequently we write K for any independent finite constant, whose value may thus change from instance to instance.

Additional material
Proof. By (2.8), (3.3), and it suffices to show that e.g. the last term, to be denoted R n , tends to zero in quadratic mean.
Similarly, ER 2 n → 0: the off-diagonal terms in the double sum vanish, and by Jensen's and Cauchy's inequalities and (C3) the sum of the diagonal terms is O(n).

Justification of (C2), (C4).
We will show that the conditions (C2), (C4) are satisfied with probability arbitrarily close to one under common probability models for the data. Possible dependencies within the triplets (x k1 , x k2 , y k ) do not matter because (C2), (C4) effectively pertain to the marginal CDFs F n1 , F n2 , G n only. However, it is natural in the prediction setting to allow for serial dependence. Specifically, suppose that the predictions x k1 , x k2 and the observations y k each form a strictly stationary sequence, defined for all k ∈ Z.
CONDITION (C2). To verify (C2) for the empirical CDF G n of the observations, e.g., we may invoke an estimate by W. B. Wu applying to certain causal processes of the form y k = J(· · · , k−1 , k ), where J is measurable and the k , k ∈ Z are i.i.d. random variables. As an immediate consequence of [55, Theorem 2] one has, under the conditions given there, that where 2 < q < 4 and β n is a sequence tending to zero sufficiently slowly (not faster than n −1 (log n) 2q/(q−2) ). Markov's inequality then gives Putting r = 2 − and summing over = 0, 1, . . . we find that for any positive κ < 1/2 − 1/q we have with probability 1 − O(n −1 ) that Thus (C2) holds with probability tending to one for the processes in question.
For an alternative justification let us consider the more common case where the y k form a strong (α-)mixing sequence. In a first step we apply covariance inequalities due to E. Rio [47] yielding an estimate of the variance of the increments of G n . Specifically, suppose that the mixing coefficients α n decay as n − for some > 1. Let G denote the common CDF of the y k . Then as a consequence of [47, Theorem 1.2] we get that for some finite constant K Proof. (Cf. [47, pp. 590,591].) We have G n (t) − G n (s) = n −1 1≤k≤n ξ k , ξ k = 1 s<y k ≤t . The common 'quantile function' Q ξ of the ξ k is readily seen to be given by Q ξ (u) = 1 if u < P r[s < y k ≤ t] = G(t) − G(s) ≡ Δ, and = 0 otherwise. Since α −1 (u) = k 1 α k >u = O(u −1/ ) it follows that the term Further by [47,Theorem 1.2], the limits of the sequences n Var(G n (t)−G n (s)) and n Var(G n (t)), hence also of n Cov(G n (s), G n (t)), exist for all s, t. The latter, for instance, is given by the absolutely convergent sum lim n→∞ n Cov(G n (s), G n (t)) = k∈Z Cov(1 y0≤s , 1 y k ≤t ) ≡ Λ(s, t). (9.3) Analogous expressions hold for the other limits. We next appeal to the weak convergence of the processes n 1/2 (G n (t) − G(t)), t ∈ R to the mean zero Gaussian process V (t), t ∈ R with covariance function Λ from (9.3), which follows from a stronger (almost sure) approximation result cited below. By the convergence of moments, the increments of V satisfy the analog of (9.2), E(V (t) − V (s)) 2 ≤ K(G(t) − G(s)) 1−1/ .
Since V is Gaussian, an application of the well-known Garsia-Rodemich-Rumsey Lemma (see e.g. [3]) along with an intermediate time change implies that there exists a positive constant K such that for any δ < (1 − 1/ )/2 the process V satisfies with probability one a pathwise Hölder condition of the form |V (t) − V (s)| ≤ K(G(t) − G(s)) δ , s < t (almost surely, 'a.s.'). (9.4) Now suppose that the CDF G is uniformly Hölder continuous with index κ ∈ (0, 1]. Then by (9.4), V also fulfils, for any η < κ(1 − 1/ )/2 , In order to transfer this pathwise Hölder condition to the processes G n we may apply a "Hungarian type" strong approximation result for the empirical process of a stationary sequence. As a consequence of [57,Theorem] or [12, Theorem 2.1], there exists a sequence of Gaussian processes V n (t), t ∈ R, all copies of V , all defined on a common probability space carrying also the y k , such that This reduces to O(r κ ) if we set β n = n −1/(2κ) , since then n −1/2 = β κ n ≤ r κ for r ≥ β n . The empirical processes F n1 , F n2 can be treated analogously. Thus we have shown that assumption (C2) is fulfilled with probability one under the indicated conditions, namely (sufficiently) strong mixing of the y k and x k1 , x k2 , and Hölder continuity of their respective marginal CDFs. CONDITION (C4). Suppose that the predictions x k1 form a strongly mixing sequence with the common marginal CDF F 1 . Suppose, furthermore, that we have a strong approximation of the empirical processes F n1 as in the previously discussed case. Using the same notation V n for the approximating Gaussian processes, we then have as in (9.5) sup θ∈R |F n1 (θ) − F 1 (θ) − n −1/2 V n (θ)| = o(n −1/2 ) (a.s.).
We now assume that |x| q dF 1 (x) < ∞ for some q > 0. Then

Analysis of the quantile forecast example (Section 5.2). The difference of the elementary quantile scores is
Taking our assumptions into account and passing to standard units on writing t k = (θ − m k )/s k (and z k = (x k − m k )/s k ), we get where P r refers to the z k (resp. u k ), everything else being considered as nonrandom (given F k−1 ). We henceforth omit the index k and use the abbreviation h k (θ) ≡ h.