Asymptotic minimum scoring rule prediction

Most of the methods nowadays employed in forecast problems are based on scoring rules. There is a divergence function associated to each scoring rule, that can be used as a measure of discrepancy between probability distributions. This approach is commonly used in the literature for comparing two competing predictive distributions on the basis of their relative expected divergence from the true distribution. In this paper we focus on the use of scoring rules as a tool for finding predictive distributions for an unknown of interest. The proposed predictive distributions are asymptotic modifications of the estimative solutions, obtained by minimizing the expected divergence related to a general scoring rule. The asymptotic properties of such predictive distributions are strictly related to the geometry induced by the considered divergence on a regular parametric model. In particular, the existence of a global optimal predictive distribution is guaranteed for invariant divergences, whose local behaviour is similar to well known α-divergences. We show that a wide class of divergences obtained from weighted scoring rules share invariance properties with α-divergences. For weighted scoring rules it is thus possible to obtain a global solution to the prediction problem. Unfortunately, the divergences associated to many widely used scoring rules are not invariant. Still for these cases we provide a locally optimal predictive distribution, within a specified parametric model. MSC 2010 subject classifications: Primary 62M20; secondary 60G25.


Introduction
Recent years have seen growing interest in the use of scoring rules for statistical estimation and prediction. These emerge in forecast problems as means to assess the relative quality of a probabilistic forecast; see [7], [24] and [40]. In particular, scoring rules provide a measure of the loss suffered by a forecaster in view of a certain outcome. The notion of proper scoring rule, which motivates a forecaster to be honest in his predictions, is an attractive property for scoring rules in both the contexts of prediction and estimation. Indeed, from the estimation point of view, proper scoring rules lead to unbiased estimating equations; see for instance [15]. Therefore, estimators for an unknown parameter based on proper scoring rules can be obtained by resorting to results from M-estimation; see [16], [34] and [35]. In the prediction framework, the use of scoring rules is almost exclusively restricted to measure the relative quality of an available probability distribution in comparison to other distributions. Indeed, proper scoring rules furnish summary measures to simultaneously evaluate calibration and sharpness of probability forecasts; see [22].
There exists a great variety of scoring rules, such as the logarithmic score, the Tsallis score and the Bregman score. Moreover, [30] introduce the alternative class of weighted scoring rules that generalise already known score functions. Weighted scoring rules evaluate probability forecasts on the basis of a nonuniform distribution that represents the available information at the time of prediction. One general family of weighted scoring rules, which encompasses the class of [30], is the family of quasi-Bregman weighted scoring rules; see [19]. It should be noted that there are several different notions of weighted scoring rules in the literature, which differ from the one considered in this paper and in [30] and [19]. For instance, the weighted scoring rules introduced in [27] evaluate forecasts only on a restricted domain of the outcome.
A divergence function can be naturally associated to each proper or weighted scoring rule, thus providing an instrument for comparing different predictive distributions. Divergence functions derived from proper scoring rules are called score divergences. Instead, we can name weighted score divergences those derived from weighted scores. The most famous score divergence is probably the Kullback-Leibler divergence that can be obtained from the logarithmic score.
The use of measures of discrepancy between probability distributions is widely spread in statistics; see [39] and references therein. Many authors have considered the problem of finding optimal predictive distributions with respect to some suitable divergence; see, among others, [1], [10], [29] and [37]. As shown in [11], the existence of asymptotically optimal predictive distributions depends on the geometric properties of the considered divergence. In particular, the class of monotone and regular divergences, introduced in [8] and studied in [9] as a wide class of invariant divergences, leads to asymptotically optimal predictive distributions.
In this work we propose a wider and more complete use of scoring rules for prediction, that goes beyond the simple comparison of two competing predictive distributions. In particular, borrowing from [11], we consider the existence of predictive distributions that asymptotically minimize score and weighted score divergences. To this aim, we first study the local behaviour of score and weighted score divergences up to third order. This clarifies the relationship among commonly used score divergences, weighted score divergences and well known classes of invariant divergences, such as α-divergences and φ-divergences. We show that, for the most part, proper scoring rules used in practice lead to score divergences which are not monotone and regular. As a consequence, the geometry these divergences induce on a statistical manifold is non-invariant. Instead, quasi-Bregman weighted divergences share the same invariant properties of monotone and regular divergences.
From the predictive viewpoint, we provide asymptotically optimal predictive distributions for both score and weighted score divergences. The lack of invariance of score divergences implies that the proposed predictive distribution can only be found within a specified parametric model. On the other hand, for quasi-Bregman weighted divergences, the optimal predictive distribution does not depend on the considered parametric model, thus constituting a complete and global solution to the prediction problem.
The paper is organised as follows. In Section 2 we aim to introduce the basic notions of differential geometry in statistical theory, which will be used in the paper. Then, in Section 3, we study the geometric properties of score divergences and weighted score divergences and we characterise their Taylor expansion up to third order. In Section 4 we apply these new results in the context of prediction. We discuss the existence of an optimal predictive distribution that asymptotically minimizes the expected divergence from the true distribution. Section 5 is dedicated to some touchstone examples covering the range of possible situations that may be encountered in practice.

Basic concepts in differential geometry of statistical models
This section is devoted to recall some basic notions on statistical manifolds. The reader wishing to deepen his knowledge in the field, can refer to [3], [28], [38] and references therein.
Let χ be a set and A be a σ-algebra of subsets of χ. Let P be a parametric model, P = {p(x; θ) : θ ∈ Θ ⊆ R d }, with p(x; θ) densities with respect to some dominating measure μ on (χ, A). Under suitable regularity conditions (see [3], p. 16), P can be regarded as a d-dimensional manifold embedded in the set of all probability distributions on (χ, A). In this context, the parametrisation θ = (θ 1 , . . . , θ d ) plays the role of a system of local coordinates, so that any density p(x; θ) in P corresponds to a point in the manifold, univocally determined by the value of θ.
The tangent space T θ at a point p(x; θ) of P is a d-dimensional vector space including the tangent vectors to all smooth curves passing through p(x; θ). It can be represented as the linear space spanned by the basis vectors where (x; θ) = log p(x; θ). Thus, any tangent vector A(x) ∈ T θ can be represented as a linear combination of i , i.e.
where A i are the components of A(x) with respect to the basis i , i = 1, . . . , d.
Here and in the sequel, we make use of Einstein summation convention, so that when an index appears twice in an expression, summation on that index is intended. When the tangent space T θ is provided with an inner product ·, · , the manifold is said to be a Riemannian space. An inner product between two vectors A(x) and B(x) belonging to T θ can always be expressed in terms of inner products between pairs of basis vectors: where B j are the components of B(x) with respect to the basis vectors. The d 2 quantities i (·; θ), j (·; θ) , i, j = 1, . . . , d, define a Riemannian metric in P. An inner product can be naturally defined on the tangent spaces of a statistical manifold by letting where E θ denotes the expected value with respect to the distribution p(x; θ). Notice that g ij (θ) are the components of the Fisher information matrix and constitute the so-called Fisher metric.
In order to compare vectors belonging to different tangent spaces, we need to establish a correspondence between two adjacent tangent spaces. This is done by means of an affine connection. An affine connection is a sort of derivative, called covariant derivative and denoted by ∇, expressing the intrinsic change in a tangent vector as the point θ changes to θ = θ+dθ in some direction along the surface P. In particular, d 3 quantities are needed to define the k-th component of the change in the basis vector j when we move along the direction of i , i, j, k = 1, . . . , d: The quantities Γ ijk are said to be the coefficients or the Christoffel symbols of the affine connection. Affine connections allow to describe important properties of a statistical manifold embedded in the set of all probability distributions on (χ, A), such as curvature, torsion and flatness. A whole class of affine connections is naturally introduced in a statistical manifold by considering the so-called Amari's α-connections, whose Christoffel symbols are defined as where α is a scalar parameter and ij ( Consider now a parametric model M = {p(x; ω) : ω ∈ Ω ⊆ R r }, r > d, including P. We say that P is a submanifold embedded in M if there exists a smooth and full rank mapping ω = ω(θ) from P to M. As a consequence, the tangent space Assume that both P and M are provided with a Riemannian metric and a covariant derivative, that we denote by ·, · and ∇ with superscript P and M, respectively. Notice that P naturally inherits the geometric structure defined in M, by projection of ·, · M and ∇ M . Anyway, the metric and covariant derivative induced by M in P do not necessarily coincide with those previously defined in P. When they coincide, we say that ·, · and ∇ are embedding invariant metric and covariant derivative, respectively.
Fisher metric and Amari's α-connections play a fundamental role in the theory of statistical manifolds, due to their property of invariance with respect to one-to-one transformations in the sample space (see [3]). Moreover, as [8] proved, they are the unique invariant metric and connections with respect to Markov embeddings on finite sample spaces.

The geometry of divergences
In this section we recall the concepts of divergence and geometry that a divergence induces on a parametric model. We show that the Riemannian metric and affine connections associated to a divergence, characterise its local behaviour up to third order. Moreover, we discuss in detail the geometric properties of divergences obtained from proper and weighted scoring rules. We refer the reader to [3], [4], [8], [9], [11], [14], [17], and [18] for deeper discussions on the subject.
Let χ be a set and A be a σ-algebra of subsets of χ. A divergence D is a non-negative function, whose arguments are two probability measures defined Let P be a parametric model, P = {p(x; θ) : θ ∈ Θ ⊆ R d }, with p(x; θ) densities with respect to some dominating measure μ on (χ, A). As we have seen in the previous section, P can be regarded as a d-dimensional manifold embedded in the set of all probability distributions on (χ, A), and the parametrisation θ = (θ 1 , . . . , θ d ) plays the role of a system of local coordinates. Thus, the divergence between p(x; θ) and p(x; θ ) can be written as a function of θ and θ , D(θ, θ ).
We can derive the geometry induced by a divergence D on P, by differentiating D(θ, θ ) with respect to θ and θ . Indices on D indicate derivatives of D with respect to the components of the two arguments θ and θ separated by a semicolon, so that Now, we can define a Riemannian metric as and a family of affine connections whose Christoffel symbols are given by Notice that The 1 and −1-connections play an important role in the geometrical approach to inference, as pointed out in [14] and [17]. The local behaviour of D in a neighbourhood of θ can be expressed in terms of the metric and affine connections induced by D on P. Indeed, where (θ − θ) i is the i-th component of (θ − θ) and (θ − θ) ij and (θ − θ) ijk are the product of two and three components of (θ − θ), respectively.

Monotone and regular divergences
Monotone and regular divergences constitute a wide class of discrepancy functions useful in statistical applications. Important invariance properties of the geometry induced by monotone and regular divergences on a parametric model P have been derived in [8] and [9]. In particular, the Riemannian metric and the affine connections associated to such divergences are equivalent to the Fisher metric and Amari's α-connections. Indeed, monotonicity implies invariance with respect to Markov embeddings and allows the characterisation of the geometry induced by monotone divergences on statistical manifolds defined in finite dimensional sample spaces. Regularity extends this characterisation to any regular parametric model of absolutely continuous probability distributions.
Using these geometric properties, the local behaviour of monotone and regular divergences has been characterised in [9]. A Taylor expansion up to third order for a monotone and regular divergence is for some constants A and B, where g ij and α Γijk are respectively the components of the Fisher metric and the Christoffel symbols of Amari's α-connections in P. The preceding expansion is a particular case of the general equation (3.1), based on the fact that monotone and regular divergences induce an invariant geometric structure on the parametric family where they are defined.

Examples
Well known examples of monotone and regular divergences are Csiszar's φ-divergences, defined as where φ is a strictly convex function with φ(1) = 0; see [12]. Notice that for a φ-divergence the constants characterising Taylor expansion (3.2) we obtain the family of α-divergences that include the Kullback-Leibler divergence (α = −1) and twice the Hellinger distance (α = 0) as special cases.
For an α-divergence the constants in (3.2) reduce to A = A φα = 1/2 and It may also be useful to consider a general family of divergences obtained by transforming a φ-divergence with a differentiable and increasing function h. This is the class of (h, φ)-divergences which has already been studied; see for instance [36] and [39]. It is easy to see that (h, φ)-divergences are monotone and regular.
Monotonicity of a φ-divergence means that, for every Markov embedding K and every P and Q probability measures on (χ, A), we can write D φ (K(P ), K(Q)) ≤ D φ (P, Q); see [9]. Then monotonicity of D hφ follows from since h is an increasing function. Moreover, D hφ (P, Q) is regular too, being a continuous transformation of a regular functional; see [8].
As a consequence, on a regular parametric model P, (h, φ)-divergences can be expanded in a neighbourhood of θ as in formula (3.2). The constants A = A hφ and B = B hφ can be easily calculated by observing that second and third order derivatives of D hφ calculated on the diagonal are respectively . Thus, the metric associated to a (h, φ)-divergence has components where g ij is the Fisher metric and A φ = φ (1)/2 is the constant A in (3.2) associated to the φ-divergence. Moreover, the 1/3 affine connection associated to a (h, φ)-divergence has components

Score divergences
Scoring rules are loss functions that measure the quality of a proposed probability distribution Q for a random variable X taking values on χ, in view of the outcome x of X. Specifically, if a forecaster quotes a predictive distribution Q for X and the event X = x realises, then we can measure the loss using a score function S(x, Q); see [22].
The expected value of S(X, Q) when X has distribution P is called the expected score: S(P, Q) = E P [S(X, Q)]. In a frequentist approach, S(P, Q) suggests a way for evaluating the performance of Q when P is the true distribution for X: the smaller S(P, Q), the better Q as an estimate of P . From this viewpoint, a natural feature for a good scoring rule is that of propriety; see for example [15] and references therein. A scoring rule S is proper relative to P if S(P, Q) ≥ S(P, P ), for all P, Q ∈ P. It is strictly proper if equality holds only when P = Q.
Each proper scoring rule is associated with two functions, the entropy function H(P ) = inf Q∈P S(P, Q) and the divergence function Indeed, it is easy to see that S is proper if and only if D(P, Q) ≥ 0 for all P, Q ∈ P, and S is strictly proper if and only if in addition D(P, Q) = 0 implies P = Q.
A divergence D induced by a proper scoring rule is called a score divergence.

Examples
Assume that P and Q have respectively densities p and q with respect to some dominating measure μ on (χ, A). Notice that the majority of scoring rules presented in this section deals with continuous random variables, however for each of these scoring rules we can also define a categorical counterpart by considering μ as a discrete measure; see for example the Brier score, the discrete counterpart of the quadratic score.
The logarithmic score is a well-known scoring rule (see [24]), defined as It is easily seen that the corresponding divergence is the Kullback-Leibler divergence The Tsallis score (see [41]), also known as the generalised power score, is given by The associated divergence is For γ = 2, the Tsallis score reduces to the quadratic score In the special case of a finite sample space χ, an equivalent rule is the Brier score (see [7]) The associated divergence is the square of the Euclidean distance between two probability functions The Bregman score is among the most used families of scoring rules; see [14] and [15]. It is defined as where ψ is a convex and differentiable function. The associated divergence can be written as Notice that the logarithmic, the Tsallis and the Brier scores are all special cases of the Bregman score with ψ(p) = p log p, ψ(p) = p γ and ψ(p) = (2p 2 − 1)/4, respectively.
The pseudo-spherical score (see [24]) is an example of scoring rule which does not belong to the class of Bregman scores. Let ||p|| γ = p(x) γ μ(dx) 1/γ . The pseudo-spherical score is defined as The divergence associated to the pseudo-spherical score is For γ = 2 we obtain, as a special case, the spherical score

The geometry of score divergences
Now we present some new results on the geometry of score divergences. These results allow to characterise the local behaviour of such divergences, using expansion (3.1). The geometry associated with score divergences has already been considered in [14] in the context of a decision problem, with the name of decision geometry. Here we extend some of those results to the multi-parametric case and we characterise the Taylor expansion of score divergences up to a higher order.
In particular, we derive the geometry induced by the Bregman and the pseudo-spherical divergences. We will see that it does not coincide with the geometry of the Fisher metric and Amari's α-connections, except for the special case of Kullback-Leibler divergence. This proves that, in fact, these divergences are not monotone and regular.
In the rest of this paragraph, we consider a regular parametric model P Moreover we use the same notation previously introduced, so that (x; θ) = log p(x; θ) and indices mean derivatives with respect to the different components of θ = (θ 1 , . . . , θ d ).
Bregman divergences By differentiating (3.3), we obtain the Riemannian metric associated to the Bregman divergence: and the corresponding affine connections: The metric coincides with the decision metric given in [14] for the case of a scalar parameter.

Proposition 3.2. The Kullback-Leibler divergence is the only Bregman divergence that is also monotone and regular.
Proof. The proof easily follows by noticing that the above metric and affine connections recover the Fisher metric and Amari's α-connections, respectively, only when ψ(p) = p log p, i.e. when the Bregman score reduces to the logarithmic score.
As a special case of the Bregman divergence, we consider the Tsallis divergence which is obtained for ψ(p) = p γ . In the following, to simplify the notation, we write

F. Giummolè and V. Mameli
Then, the Taylor expansion up to third order of Tsallis divergence is In a similar way we can find the expansion for the Brier divergence.

Pseudo-spherical divergences
We now present the geometry associated to the pseudo-spherical divergence, using the same notation of the previous case. Moreover, let ||p(θ)|| γ = p(x; θ) γ μ(dx) 1/γ . By straightforward differentiation of (3.4), we obtain the metric, which coincides with [14] for a scalar parameter, and the affine connections We can thus state the following result. Proof. The proof easily follows by noticing that the above metric and affine connections always differ from the Fisher metric and Amari's α-connections, respectively. Now, the Taylor expansion up to third order of the pseudo-spherical divergence is

The relationship between monotone and regular divergences and score divergences
In the previous section, we have proved that pseudo-spherical divergences and Bregman divergences, except for the Kullback-Leibler divergence, are not monotone and regular. This has been done by showing that the geometry induced by these divergences on a regular parametric model is different from the geometry defined by the Fisher metric and Amari's α-connections. Now we prove that (h, φ)-divergences are not score divergences, i.e. it does not exist a proper scoring rule with associated divergence in the class of (h, φ)divergences. The only exception to this statement is the logarithmic score with the Kullback-Leibler divergence. This result has already been conjectured by Dawid in [14] for α-divergences, α = −1, but not proved. Proof. Consider P , Q and R probability distributions defined on (χ, A), with densities p, q and r with respect to some dominating measure μ.
It is shown in [13], Section 11, that if D is a score divergence then D(P, Q) − D(P, R) can be written as an affine function of P for any fixed Q and R. This means that there exists a function f (x; Q, R) such that D(P, Now, let D hφ be a (h, φ)-divergence. We can write It is easy to see that this expression is an affine function of P if and only if h is an increasing linear function and φ is a linear transformation of the logarithm. Indeed, note that if then the function φ(q/p)−φ(r/p) must be independent of p. Thus, ∂φ(q/p)/∂p = ∂φ(r/p)/∂p, so that ∂φ(q/p)/∂p does not depend on q. These considerations lead to the following differential equation: φ (t)t + φ (t) = 0, which admits the solution φ(t) = A log (t) + B, with A and B two suitable constants.
Finally, as a consequence of the previous results, we can state that the Kullback-Leibler divergence is essentially the unique intersection between the classes of (h, φ)-divergences and Bregman divergences. Actually, we believe that this result is even more general, but this is only a conjecture.
Conjecture. The Kullback-Leibler divergence is the unique intersection between the class of monotone and regular divergences and the class of score divergences.

Weighted score divergences
A different way for deriving divergences from scoring rules relies on the use of weighted scoring rules that have been introduced in [30] and represent an interesting generalisation of the above defined (unweighted) scoring rules. Weighted scoring rules depend on a baseline probability distribution P , which represents the available information at the time of prediction. Thus, the expected weighted scoring rule allows to compare two forecasts R and Q, taking P as the reference distribution. It is a function of three arguments: S(Q, R||P ). The entropy function associated to a weighted scoring rule, H(Q||P ) = S(Q, Q||P ) can be used to define a weighted divergence between P and Q: As stated in [19], the entropy H(Q||P ) is minimized only when the forecaster's true belief Q is exactly the baseline distribution P . More formally, a baseline distribution P is defined as the unique distribution which minimizes the entropy, i.e. H(P ||P ) ≤ H(Q||P ), for all Q ∈ P. Thus, it follows that D w (P, Q) ≥ 0, with equality holding if and only if P = Q.
It is important noticing that weighted divergences are defined from weighted scoring rules by means of the corresponding entropy function. This construction is different from the one used for score divergences.
The dependence of the scoring rule on the non-uniform baseline distribution P can be introduced in different ways, depending on the problem under consideration. Here we focus on scoring rules that depend on the ratio between the considered probability distributions and the baseline P ; see [30]. Such a dependence implies that the associated divergence (3.5) measures the distance between two probability distributions P and Q in terms of the ratio Q/P . From our point of view, this condition results in divergences that are invariant with respect to one-to-one transformations of the sample space.

Examples
This section deals with some examples of weighted scoring rules and associated divergences. The weighted power score (see [31]), also known as the weighted Tsallis score, is a generalisation of the power score, defined as The associated weighted power divergence is When γ → 1 the weighted power score reduces to the weighted logarithmic score. For γ = 2, we obtain the weighted Brier scoring rule. For more examples and details we refer to [30] and [31].
As already pointed out, for α = −1 α-divergences are not score divergences, that is they are not associated to an unweighted scoring rule. Anyway, for α = ±1 they can be easily obtained from the following class of weighted scoring rules: which is equivalent, up to a linear transformation, to the weighted power score with γ = (1 + α)/2. The weighted pseudo-spherical score (see [31]) is defined as The associated divergence is It can be easily shown that it is monotonically related to the weighted power divergence through the function h( Note that for γ → 1, the weighted logarithmic score represents a limiting case for the weighted pseudo-spherical score too.
The family of quasi-Bregman weighted scoring rules has recently been defined by [19] for a random variable X taking values in a finite sample space χ = {x 1 , . . . , x n }. It is associated to the following class of divergences: where p = (p 1 , . . . , p n ), q = (q 1 , . . . , q n ), p i = P ({x i }) and q i = Q({x i }), i = 1, . . . , n. Moreover, the function f is positive, g is twice differentiable and strictly convex, l is twice differentiable and strictly increasing and g and l denote the derivatives of g and l, respectively. The family of quasi-Bregman weighted scoring rules has been defined only for finite sample spaces, but in some cases it can be extended to continuous random variables. In fact, notice that the weighted power and pseudo-spherical scores are special instances of the quasi-Bregman weighted scoring rules with f (x) = x.

The geometry of quasi-Bregman weighted score divergences
In this section we consider the divergences associated to the quasi-Bregman weighted scoring family characterised by f (x) = x, namely We show that, under suitable conditions on g and l, they are monotone and regular. As a consequence, their local behaviour is characterised by (3.2) with the Fisher metric, Amari's α-connections and suitable constants A and B. Proof. First, notice that l(x) = (x − g(1)) /g (1) is a strictly increasing linear function, since g (x) > 0 for all x. Thus, The previous result can be generalised as follows.
Proof. Note that the function h(x) = l(g (1)x + g (1)) is strictly increasing, twice differentiable and such that h(0) = 0. Then, the divergence in (3.6) can be rewritten as Notice that the weighted power and pseudo-spherical scores satisfy the conditions of Theorem 3.4 on l and g, thus being equivalent to (h, φ)-divergences.
It is easy to verify that the constants characterising expansion (3.2) are the same for the weighted pseudo-spherical and the weighted power divergence, namely A = 1/2 and B = 2γ − 1. This is due to the fact that, as already said, the weighted pseudo-spherical divergence is an increasing transformation of the weighted power divergence with h(x) = [((γ − 1)x + 1) γ − 1]/[γ(γ − 1)] such that h (0) = 1; see also [31].

The prediction problem
An important application of the new results presented in the previous sections is in the context of prediction.
Let x be an observed random sample of size n from the random vector X = (X 1 , . . . , X n ) with joint distribution p X (x; θ), where θ ∈ Θ ⊆ R d is an unknown parameter. Notice that the components of X may be dependent and with different distributions, as in the examples of Section 5.2 and 5.3. The problem of prediction focuses on a future as yet unobserved random variable Y whose distribution is related to the distribution of X. In general, X and Y are dependent and we need to specify the joint distribution p(x, y; θ) or, equivalently, the conditional distribution of Y |X, p(y; θ|x). To simplify the analysis we confine ourselves to the case of independence where the distribution of Y belongs to a regular parametric model P = {p(y; θ), θ ∈ Θ ⊆ R d } indexed by the same unknown parameter as p X (x; θ). Nevertheless, the following results still apply to the case of dependence, after substituting the marginal density of Y with the conditional density of Y |X, as shown in the example of Section 5.2. Here we assume that the model P is correctly specified. An extension to the case of misspecification is possible by considering M-estimators, as suggested in [21].
A complete solution to the prediction problem can be expressed as a whole distribution for Y , depending on the observed sample x. This is called a predictive distribution and can be denoted byp(y; x). Notice that, in general,p(y; x) may also lie outside the model P.
In a Bayesian framework, the conditional distribution of Y given X = x can be obtained by integrating p(y; θ) with respect to the posterior distribution of θ. Though natural and appealing, the Bayesian approach may be computationally demanding for high dimensional parameters.
Following the classic approach, a predictive distribution can be obtained by replacing the parameter θ with an efficient estimateθ =θ n (x), in the density of Y . The resulting density p(y;θ) is called estimative or plug-in, and belongs to the parametric model P. For instance, in the review paper [23] and references therein, regression models are used to obtain probabilistic forecasts in the form of full probability distributions. These papers consider estimative solutions to the prediction problem obtained by substituting the unknown regression parameters with minimum scoring rule estimates. Unfortunately, it is well known that, because of the uncertainty introduced by assuming θ =θ, the estimative solution may give rise to inaccurate predictions especially when the dimension of the parameter θ is large compared to the sample size.
The goodness of a predictive distribution can be evaluated by studying its long run properties, in a frequentist perspective. In this respect, basically two different criteria have been considered in the literature. One considers the coverage probability of prediction intervals obtained from a predictive distribution; see among others [5], [20], [25], [32], [43]. The other measures the performance of a predictive distribution by means of a divergence from the true distribution; see for instance [10], [26], [29], [42]. Other solutions based on the concept of predictive likelihood have also been considered; see [6] for a comprehensive review.
Here we focus on the approach to prediction based on divergence functions. In particular, we consider divergences obtained from scoring rules.
To our knowledge, up to now scoring rules have been used only to measure the relative quality of a proposed probability distribution. Such approach allows for comparison of already known predictive distributions. In this paper we propose to use proper scoring rules as a tool for constructing new predictive distributions for the unknown of interest. The proposed predictive distributions turn out to be asymptotic modifications of the estimative distribution, obtained by minimizing divergences associated to proper and weighted scoring rules. Borrowing from [11] and using the results presented in the previous sections, we analyse the asymptotic properties of the resulting predictive distributions.
In the following sections we review some general concepts on predictive distributions that asymptotically minimize an expected divergence to the true density. First we consider the asymptotically optimal predictive distribution within a parametric model M containing P. Then we discuss the existence of a global solution, independent on the particular enlargement M that we may consider. We will try to stress the attention on the main underlying ideas more than on complex details. For an extensive treatment, see [11] and references therein.

A locally optimal predictive distribution
A predictive distributionp(y; x) can be obtained by minimizing the expected divergence from the true distribution, E θ [D(p(y; θ),p(y; X))] = D(p(y; θ),p(y; x))p X (x; θ)μ(dx), (4.1) in the set of all probability distributions for Y . Since the true value of the parameter θ is not known, minimization has to be carried over uniformly in θ. Unfortunately, this is not an easy task to achieve, except in some very special cases.
A restricted solution to the problem can be found by minimizing the leading terms of the asymptotic expansion of (4.1), within a parametric model M including P as a sub-model. The resulting predictive distribution belongs to the enlarged model M.
For fixing the notation, let M = {p(y; ω), ω ∈ Ω ⊆ R r } be any regular parametric model containing P, with Ω ⊆ R r , r > d. We can consider on M the coordinate system ω = (θ, s), where θ i , with i = 1, . . . , d are the old coordinates on P, and s I , with I = d + 1, . . . , r, are new coordinates in M. We use indices A, B, C to indicate the coordinate system ω = (θ, s) in M; i, j, k . . . for the components of θ in P and I, J, K . . . for the components of s. We suppose that s = 0 for the points in P and that θ and s are orthogonal in P, i.e. the mixed components iI of the metric tensor defined by the divergence D in M, calculated at the points in P, are zero. Moreover, we use the same notation of Section 2, so that (y; ω) = log p(y; ω) and indices denote derivatives of with respect to the components of ω.
Assuming that the true value of the parameter ω in M is ω 0 = (θ 0 , 0) witĥ θ − θ 0 = O p (n −1/2 ), where n is the sample size, the optimal predictive density in M which asymptotically minimizes the expected divergence is given bŷ where i ij (θ) = lim n→∞ nE θ (θ − θ) i (θ − θ) j and h I (y; θ) = I (y; θ, s)| s=0 ; see [11], formula (12). Note that g and Γ refer to the geometry induced by the divergence D in M and g KI are the components of the inverse of the matrix with components g KI . The predictive distribution (4.2) is obtained by shifting the estimative density p(y;θ) in a direction orthogonal to P. It does not constitute a global solution to the problem of prediction, since its expression depends on the parametric model M. Thus, a better solution could be obtained by further enlarging M.

A global solution to the problem of prediction
The correction in (4.2) to the estimative density can be interpreted as an optimal shift of the estimative density along a direction in M which is orthogonal to P.
This orthogonal correction depends on the geometry induced by the divergence D on M. Now, as mentioned at the end of Section 2, it should be pointed out that the Riemannian metric and the family of affine connections defined by a divergence on M are not necessarily compatible with those induced on P. In fact, as shown in [3], such an invariance property characterises the Fisher metric and Amari's α-connections.
It turns out that a sufficient condition for obtaining a global solution to the prediction problem is the invariance of the metric and the connections induced by the considered divergence D. Under this condition, expression (4.2) can be written independently of M, aŝ (see [11], Proposition 6.1). Here g ij and α Γijk are the components of the Fisher metric and Amari's α-connections and α is a constant depending on the divergence D.
Thus, when the geometry induced by the divergence D is invariant, (4.3) constitutes a global solution to the problem of prediction, in that it asymptotically minimizes the expected divergence from the true distribution in all parametric models including P.
In Section 3, the geometric properties of divergences derived from proper scoring rules and weighted scoring rules have been investigated. We have seen that the geometry induced by both the Bregman and the pseudo-spherical divergences is not invariant, except for the only case of the Kullback-Leibler divergence. As a consequence, there is no global solution to the prediction problem based on these classes of loss functions. Anyway, using (4.2), an improvement on the estimative distribution can still be proposed in cases when the original parametric model can be embedded in a larger one. When the situation is too complicated, the Bregman and the pseudo-spherical divergences can only be used for comparison of competing predictive distributions, as shown in the numerical examples of the following section. Instead, under suitable conditions on the functions f , g and l, quasi-Bregman weighted divergences are in fact monotone and regular divergences. Their use in the context of prediction is thus equivalent to that of α-divergences, for some suitable value of α. The existence of an asymptotically optimal predictive distribution is guaranteed, using formula (4.3).

Examples
In this section we first consider three toy examples involving Gaussian distributions with unknown mean and known variance. Though non very useful in the practice, they allow to analytically calculate a global optimal predictive distribution for invariant divergences. Moreover, using non-invariant divergences, a locally optimal predictive distribution can still be analytically obtained, improving on the simple estimative within the enlarged class of Gaussian distributions with unknown mean and variance.

The normal model with unknown mean
Let x be a random sample from X = (X 1 , · · · , X n ) and assume that we want to predict the random variable Y = 1 m m i=1 X n+i , where X 1 , . . . , X n+m are independent and identically distributed as N (μ, σ 2 0 ), with σ 2 0 known. The maximum likelihood estimator for μ isμ = X = 1 n n i=1 X i , which is normal with mean μ and variance σ 2 0 /n. Indices 1 and 2 refer to μ and σ, respectively. Since Y ∼ N (μ, σ 2 0 /m), it easily follows that: The asymptotically optimal predictive density with respect to an α-divergence has already been derived in [11]. The weighted power and weighted pseudo-spherical divergences are asymptotically equivalent up to the second order, so that the optimal predictive density associated to both divergences iŝ p(y; X) = φ y −μ, which corresponds to the optimal predictive distribution obtained for an α-divergence with α = 2γ − 1; see [11]. In the preceding formula and in the following examples, φ(·, σ 2 ) denotes the density of a centered normal distribution with variance σ 2 . Now we want to derive the optimal predictive density associated to the Tsallis divergence. Since the geometry induced by the Tsallis divergence is not invariant, we are only able to find the optimal predictive density within an enlarged model M. Let M = N (μ, σ 2 ) : μ ∈ R, σ > 0 be the extended manifold which contains P = N (μ, σ 2 0 ) : μ ∈ R . The components of the metric and the connections induced by the Tsallis divergence on M have already been obtained in Section 3.2.2. Here we consider only the components useful to derive the predictive distribution, specifically,

F. Giummolè and V. Mameli
The optimal predictive density in M with respect to the Tsallis divergence is given bŷ Note that it does not depend on the parameter γ. Moreover, it coincides with the global optimal solution obtained by minimizing the Kullback-Liebler divergence.
Next we turn our attention to the pseudo-spherical divergence. Let M be the extended manifold which contains P as in the case of the Tsallis divergence. The components of the metric and of the connections can be found in Section 3.2.2. As before we consider only the components useful to derive the predictive distribution, i.e. .
Hence the asymptotically optimal predictive density in M with respect to the pseudo-spherical divergence iŝ It should be pointed out that it coincides with the global solution obtained by the Kullback-Liebler divergence when γ = 5 3 .

The autoregressive model
Let {X n } n≥0 be an autoregressive process of order 1, with X 0 = 0 and X n |X n−1 ∼ N (ρX n−1 , σ 2 0 ), for n ≥ 1, where σ 2 0 is known and |ρ| < 1. Suppose that X = (X 0 , . . . , X n ) and we want to estimate the value of the future random variable Y = X n+1 . Since X and Y are dependent, we work with the conditional density of Y |X, p(y; ρ|x), and we denote the predictive density asp(y|x). As it is is asymptotically normal with asymptotic variance 1−ρ 2 . Let us denote by 1 and 2 the indices corresponding to ρ and σ, respectively. The derivatives of the conditional log-likelihood l(y; ρ|x) with respect to ρ are 1 (y; ρ|x) = The optimal predictive density associated to the α-divergence is given in [11]. Our aim here is to derive the optimal predictive density associated to the Tsallis divergence within the enlarged model M with unknown ρ and σ 2 . For this purpose, we need the following components of the metric and the connections induced by the Tsallis divergence: We find that the optimal predictive distribution in M associated to the Tsallis divergence iŝ It is worth pointing out that even for this case it does not depend on γ and it reduces to the global optimal predictive distribution with respect to the Kullback-Liebler divergence.

The normal non-linear model
Let y be an observation from a random vector Y = (Y 1 , . . . , Y n ) such that where x i is a vector of m predictors and β = (β 1 , . . . , β m ) is an unknown vector of parameters. We assume that the random variables i are independent and normally distributed, with E( i ) = 0 and known V ar( i ) = σ 2 0 . Suppose further that ∂μ(x,β) ∂β ∂μ(x,β) ∂β , with Σ known. Our purpose here is to predict the future observation y n+1 from the random variable Y n+1 independent of Y and defined as with n+1 ∼ N (0, σ 2 0 ). As it is well known, the maximum likelihood estimatorβ for the regression parameters β is asymptotically normal with mean β and covariance ma-

F. Giummolè and V. Mameli
The optimal predictive density for Y n+1 with respect to an α-divergence can be explicitly determined using formula (4.3): Notice that, if μ is a linear function, it reduces to the solution found in [11] for the classical linear model. This result represents a global solution to the problem of prediction with respect to α-divergences and other asymptotically equivalent divergences as those derived from weighted scoring rules in Section 3.3.
Instead, for non-invariant divergences we can only find a partial solution within some enlarged parametric model. For instance, consider the Tsallis divergence in the enlarged parametric model with both β and σ 2 unknown. The essential terms for the derivation of the optimal predictive distribution are: where the index m + 1 refers to σ 2 . The local optimal predictive density for Y n+1 with respect to the Tsallis divergence, iŝ As in the previous examples, the local optimal predictive density with respect to the Tsallis divergence does not depend on the parameter γ and reduces to the global optimal predictive density with respect to the Kullback-Liebler divergence.
As we have seen in the previous examples, a predictive density that improves on the estimative density with respect to a non invariant divergence can be easily found within a parametric model including the original one. Unfortunately, such an enlargement is usually difficult to specify and the analytical computation of the predictive density can involve cumbersome calculations. Anyway, we can still compare two competing predictive densities by numerically estimating the corresponding expected divergence.
In the next sections we consider two special examples for which it is possible to derive an exact predictive solution by means of predictive pivotal quantities.
The aim is to compare the performance of the estimative solution with that of the pivotal solution by using both the Tsallis and the pseudo-spherical scoring rules. We remind that the lower the score the better the prediction, since we have considered negatively oriented scoring rules. In this comparison we have taken into account also the Brier and the spherical scoring rules, both corresponding to γ = 2.

The normal model with unknown mean and variance
The following example considers a normal distribution with unknown mean and variance. Let x be a random sample from X = (X 1 , · · · , X n ), where the X i 's are independent and with the same distribution N (μ, σ 2 ), i = 1, · · · , n, with μ and σ 2 unknown. We want to estimate the value of a future random variable Y ∼ N (μ, σ 2 ), independent of X. The maximum likelihood estimators for μ and σ 2 areμ = X = 1 n n i=1 X i andσ 2 = 1 n n i=1 (X i − X) 2 , respectively. A predictive solution giving exact predictive intervals, can be obtained by means of the pivotal quantity (Y −μ)/σ, which has a Student T distribution with n−1 degrees of freedom. It is asymptotically equivalent to the optimal predictive solution obtained from the Kullback-Liebler divergence; see for instance [37]. More in general, we remind that the asymptotically optimal predictive solution with respect to an α-divergence, is a Student T distribution with 2n−2 1−α degrees of freedom; [11]. A similar result also holds for the weighted power score and the weighted pseudo-spherical divergences.
We now compare the estimative and the pivotal solution by using the Tsallis and the pseudo-spherical divergences for different values of the parameter γ; the results are in given in Tables 1 and 2, respectively. As it can be seen, both the Tsallis and the pseudo-spherical scoring rules prefer the pivotal solution for every value of γ.

The exponential model
The following example considers an exponential distribution with unknown mean. Let x be a random sample from X = (X 1 , . . . , X n ), where the X i 's are independent and with the same distribution Exp (1/θ), i = 1, . . . , n, with θ unknown. We want to estimate the value of a further random variable Y ∼ Exp (1/θ), independent of X. The maximum likelihood estimator for θ isθ = X. A predictive density giving exact predictive intervals could be obtained using the pivotal quantity Y/θ, which has a Fisher F distribution with (2, 2n) degrees of freedom; see for instance [32].
We have performed a simulation study for comparing the estimative and the pivotal solution using the Tsallis and the pseudo-spherical divergences, for various values of the parameter γ; the results are given in Tables 3 and 4, respectively. Again, both the Tsallis and the pseudo-spherical divergences prefer the pivotal solution for every value of γ.