On Representations of Divergence Measures and Related Quantities in Exponential Families

Within exponential families, which may consist of multi-parameter and multivariate distributions, a variety of divergence measures, such as the Kullback–Leibler divergence, the Cressie–Read divergence, the Rényi divergence, and the Hellinger metric, can be explicitly expressed in terms of the respective cumulant function and mean value function. Moreover, the same applies to related entropy and affinity measures. We compile representations scattered in the literature and present a unified approach to the derivation in exponential families. As a statistical application, we highlight their use in the construction of confidence regions in a multi-sample setup.


Introduction
There is a broad literature on divergence and distance measures for probability distributions, e.g., on the Kullback-Leibler divergence, the Cressie-Read divergence, the Rényi divergence, and Phi divergences as a general family, as well as on associated measures of entropy and affinity. For definitions and details, we refer to [1]. These measures have been extensively used in statistical inference. Excellent monographs on this topic were provided by Liese and Vajda [2], Vajda [3], Pardo [1], and Liese and Miescke [4].
Within an exponential family as defined in Section 2, which may consist of multiparameter and multivariate distributions, several divergence measures and related quantities are seen to have nice explicit representations in terms of the respective cumulant function and mean value function. These representations are contained in different sources. Our focus is on a unifying presentation of main quantities, while not aiming at an exhaustive account. As an application, we derive confidence regions for the parameters of exponential distributions based on different divergences in a simple multi-sample setup.
For the use of the aforementioned measures of divergence, entropy, and affinity, we refer to the textbooks [1][2][3][4] and exemplarily to [5][6][7][8][9][10] for statistical applications, including the construction of test procedures as well as methods based on dual representations of divergences, and to [11] for a classification problem.
Usually, µ is either the counting measure on the power set of X (for a family of discrete distributions) or the Lebesgue measure on the Borel sets of X (in the continuous case). Without loss of generality and for a simple notation, we assume that h > 0 (the set {x ∈ X : h(x) = 0} is a null set for all P ∈ P). Let ν denote the σ-finite measure with µ-density h. We assume that representation (1) is minimal in the sense that the number k of summands in the exponent cannot be reduced. This property is equivalent to Z 1 , . . . , Z k being affinely independent mappings and T 1 , . . . , T k being ν-affinely independent mappings; see, e.g., [12] (Cor. 8.1). Here, ν-affine independence means affine independence on the complement of every null set of ν.
To obtain simple formulas for divergence measures in the following section, it is convenient to use the natural parameter space of P * ζ and normalizing constant C * (ζ) for ζ = (ζ 1 , . . . , ζ k ) t ∈ Z(Θ) ⊂ Ξ * , where Z = (Z 1 , . . . , Z k ) t denotes the (column) vector of the mappings Z 1 , . . . , Z k and T = (T 1 , . . . , T k ) t denotes the (column) vector of the statistics T 1 , . . . , T k . For simplicity, we assume that P is regular, i.e., we have that Z(Θ) = Ξ * (P is full) and that Ξ * is open; see [13]. In particular, this guarantees that T is minimal sufficient and complete for P; see, e.g., [14] (pp. [25][26][27]. The cumulant function κ(ζ) = − ln(C * (ζ)) , ζ ∈ Ξ * , associated with P is strictly convex and infinitely often differentiable on the convex set Ξ * ; see [13] (Theorem 1.13 and Theorem 2.2). It is well-known that the Hessian matrix of κ at ζ coincides with the covariance matrix of T under P * ζ and that it is also equal to the Fisher information matrix I(ζ) at ζ. Moreover, by introducing the mean value function we have the useful relation where ∇κ denotes the gradient of κ; see [13] (Cor. 2.3). π is a bijective mapping from Ξ * to the interior of the convex support of ν T , i.e., the closed convex hull of the support of ν T ; see [13] (p. 2 and Theorem 3.6). Finally, note that representation (2) can be rewritten as for ζ ∈ Ξ * .

Divergence Measures
Divergence measures may be applied, for instance, to quantify the "disparity" of a distribution to some reference distribution or to measure the "distance" between two distributions within some family in a certain sense. If the distributions in the family are dominated by a σ-finite measure, various divergence measures have been introduced by means of the corresponding densities. In parametric statistical inference, they serve to construct statistical tests or confidence regions for underlying parameters; see, e.g., [1]. Definition 1. Let F be a set of distributions on (X , B). A mapping D : F × F → R is called a divergence (or divergence measure) if: (i) D(P, Q) ≥ 0 for all P, Q ∈ F and D(P, Q) = 0 ⇔ P = Q (positive definiteness).
If additionally (ii) D(P, Q) = D(Q, P) for all P, Q ∈ F (symmetry) is valid, D is called a distance (or distance measure or semi-metric). If D then moreover meets (iii) D(P 1 , P 2 ) ≤ D(P 1 , Q) + D(Q, P 2 ) for all P 1 , P 2 , Q ∈ F (triangle inequality), D is said to be a metric.
Some important examples are the Kullback-Leibler divergence (KL-divergence): the Jeffrey distance: as a symmetrized version, the Rényi divergence: along with the related Bhattacharyya distance D B (P 1 , P 2 ) = D R 1/2 (P 1 , P 2 )/4, the Cressie-Read divergence (CR-divergence): which is the same as the Chernoff α-divergence up to a parameter transformation, the related Matusita distance D M (P 1 , P 2 ) = D CR 1/2 (P 1 , P 2 )/2, and the Hellinger metric: for distributions P 1 , P 2 ∈ F with µ-densities f 1 , f 2 , provided that the integrals are welldefined and finite. D KL , D R q , and D CR q for q ∈ R \ {0, 1} are divergences, and D J , D R 1/2 , D B , D CR 1/2 , and D M (= D 2 H ), since they moreover satisfy symmetry, are distances on F × F . D H is known to be a metric on F × F .
In parametric models, it is convenient to use the parameters as arguments and briefly write, e.g., if the parameter ϑ ∈ Θ is identifiable, i.e., if the mapping ϑ → P ϑ is one-to-one on Θ. This property is met for the EF P in Section 2 with minimal canonical representation (5); see, e.g., [13] (Theorem 1.13(iv)). It is known from different sources in the literature that the EF structure admits simple formulas for the above divergence measures in terms of the corresponding cumulant function and/or mean value function. For the KL-divergence, we refer to [15] (Cor. 3.2) and [13] (pp. 174-178), and for the Jeffrey distance also to [16]. Theorem 1. Let P be as in Section 2 with minimal canonical representation (5). Then, for ζ, η ∈ Ξ * , we have and Proof. By using Formulas (3) and (5), we obtain for ζ, η ∈ Ξ * that From this, the representation of D J is obvious.
As a consequence of Theorem 1, D KL and D J are infinitely often differentiable on Ξ * × Ξ * , and the derivatives are easily obtained by making use of the EF properties. For example, by using Formula (4), we find ∇D KL (ζ, ·) = π(·) − π(ζ) and that the Hessian matrix of D KL (ζ, ·) at η is the Fisher information matrix I(η), where ζ ∈ Ξ * is considered to be fixed.
Moreover, we obtain from Theorem 1 that the reverse KL-divergence D * KL (ζ, η) = D KL (η, ζ) for ζ, η ∈ Ξ * is nothing but the Bregman divergence associated with the cumulant function κ; see, e.g., [1,11,17]. As an obvious consequence of Theorem 1, other symmetrizations of the KL-divergence may be expressed in terms of κ and π as well, such as the so-called resistor-average distance (cf. [18]) with D RA (ζ, ζ) = 0, ζ ∈ Ξ * , or the distance obtained by taking the harmonic and geometric mean of D KL and D * KL ; see [19]. (9) can be used to derive the test statistic

Remark 1. Formula
of the likelihood-ratio test for the test problem where ∅ = Ξ 0 Ξ * . If the maximum likelihood estimators (MLEs)ζ =ζ(x) andζ 0 =ζ 0 (x) of ζ in Ξ * and Ξ 0 (based on x) both exist, we have: by using that the unrestricted MLE fulfils π(ζ) = T; see, e.g., [12] (p. 190) and [13] (Theorem 5.5). In particular, when testing a simple null hypothesis with Convenient representations within EFs of the divergences in Formulas (6)- (8) can also be found in the literature; we refer to [2] (Prop. 2.22) for D R q , D H , and D M , to [20] for D B , and to [9] for D R q . The formulas may all be obtained by computing the quantity For q ∈ (0, 1), we have the following identity (cf. [21]).

Remark 2.
For arbitrary divergence measures, several transformations and skewed versions as well as symmetrization methods, such as the Jensen-Shannon symmetrization, are studied in [19]. Applied to the KL-divergence, the skew Jensen-Shannon divergence is introduced as for P 1 , P 2 ∈ P and q ∈ (0, 1), which includes the Jensen-Shannon distance for q = 1/2 (the distance D 1/2 even forms a metric). Note that, for ζ, η ∈ Ξ * , the density q f * ζ + (1 − q) f * η of the mixture qP * ζ + (1 − q)P * η does not belong to P, in general, such that the identity in Theorem 1 for the KL-divergence is not applicable, here.
However, from the proof of Lemma 1, it is obvious that i.e., the EF P is closed when forming normalized weighted geometric means of the densities. This finding is utilized in [19] to introduce another version of the skew Jensen-Shannon divergence based on the KL-divergence, where the weighted arithmetic mean of the densities is replaced by the normalized weighted geometric mean. The skew geometric Jensen-Shannon divergence thus obtained is given by for q ∈ (0, 1). By using Theorem 1, we find for ζ, η ∈ Ξ * and q ∈ (0, 1).
In particular, setting q = 1/2 gives the geometric Jensen-Shannon distance: For more details and properties as well as related divergence measures, we refer to [19,22].
Formulas for D R q , D CR q , and D H are readily deduced from Lemma 1.

Proof.
Since the assertions are directly obtained from Lemma 1.
It is well-known that lim q→1 D R q (P 1 , P 2 ) = D KL (P 1 , P 2 ) and lim q→0 D R q (P 1 , P 2 ) = D KL (P 2 , P 1 ) , such that Formula (9) results from the representation of the Rényi divergence in Theorem 2 by sending q to 1. The Sharma-Mittal divergence (see [1]) is closely related to the Rényi divergence as well and, by Theorem 2, a representation in EFs is available.
Moreover, representations within EFs for so-called local divergences can be derived as, e.g., the Cressie-Read local divergence, which results from the CR-divergence by multiplying the integrand with some kernel density function; see [23].

Remark 3.
Inspecting the proof of Theorem 2, D R q and D CR q are seen to be strictly decreasing functions of A q for q ∈ (0, 1); for q = 1/2, this is also true for D H . From an inferential point of view, this finding yields that, for fixed q ∈ (0, 1), test statistics and pivot statistics based on these divergence measures will lead to the same test and confidence region, respectively. This is not the case within some divergence families such as D R q , q ∈ (0, 1), where different values of q correspond to different tests and confidence regions, in general.
A more general form of the Hellinger metric is given by for m ∈ N, where D H,2 = D H ; see Formula (8). For m ∈ 2N, i.e., if m is even, the binomial theorem then yields and inserting for A k/m , k = 1, 1, . . . , m − 1, according to Lemma 1 along with A 0 ≡ 1 ≡ A 1 gives a formula for D H,m in terms of the cumulant function of the EF P in Section 2. This representation is stated in [16]. Note that the representation for A q in Lemma 1 (and thus the formulas for D R q and D CR q in Theorem 2) are also valid for ζ, η ∈ Ξ * and q ∈ R \ [0, 1] as long as qζ + (1 − q)η ∈ Ξ * is true. This can be used, e.g., to find formulas for D CR 2 and D CR −1 , which coincide with the Pearson χ 2 -divergence for ζ, η ∈ Ξ * with 2ζ − η ∈ Ξ * and the reverse Pearson Here, the restrictions on the parameters are obsolete if Ξ * = R k for some k ∈ N, which is the case for the EF of Poisson distributions and for any EF of discrete distributions with finite support such as binomial or multinomial distributions (with n ∈ N fixed). Moreover, quantities similar to A q such as f * ζ ( f * η ) γ dµ for γ > 0 arise in the so-called γ-divergence, for which some representations can also be obtained; see [24] (Section 4).

Remark 4.
If the assumption of the EF P to be regular is weakened to P being steep, Lemma 1 and Theorem 2 remain true; moreover, the formulas in Theorem 1 are valid for ζ lying in the interior of Ξ * . Steep EFs are full EFs in which boundary points of Ξ * that belong to Ξ * satisfy a certain property. A prominent example is provided by the full EF of inverse normal distributions. For details, see, e.g., [13]. (12) is the two-dimensional case of the weighted Matusita affinity ρ w 1 ,...,w n (P 1 , . . . , (14) for distributions P 1 , . . . , P n with µ-densities f 1 , . . . , f n , weights w 1 , . . . , w n > 0 satisfying ∑ n i=1 w i = 1, and n ≥ 2; see [4] (p. 49) and [6]. ρ w 1 ,...,w n , in turn, is a generalization of the Matusita affinity [25,26]. Along the lines of the proof of Lemma 1, we find the representation

The quantity A q in Formula
for the EF P in Section 2; cf. [27]. In [4], the quantity in Formula (14) is termed Hellinger transform, and a representation within EFs is stated in Example 1.88. ρ w 1 ,...,w n can be used, for instance, as the basis of a homogeneity test (with null hypothesis H 0 : ζ (1) = · · · = ζ (n) ) or in discriminant problems.
For a representation of an extension of the Jeffrey distance to more than two distributions in an EF, the so-called Toussaint divergence, along with statistical applications, we refer to [8].

Entropy Measures
The literature on entropy measures, their applications, and their relations to divergence measures is broad. We focus on some selected results and state several simple representations of entropy measures within EFs.
Let the EF in Section 2 be given with h ≡ 1, which is the case, e.g., for the oneparameter EFs of geometric distributions and exponential distributions as well as for the two-parameter EF of univariate normal distributions. Formula (5) then yields that f * ζ r dµ = e rζ t T−rκ(ζ) dµ = e κ(rζ)−rκ(ζ) = J r (ζ) , say, for r > 0 and ζ ∈ Ξ * with rζ ∈ Ξ * . Note that the latter condition is not that restrictive, since the natural parameter space of a regular EF is usually a cartesian product of the form The Taneja entropy is then obtained as for r > 0 and ζ ∈ Ξ * with rζ ∈ Ξ * , which includes the Shannon entropy by setting r = 1; see [7,28]. Several other important entropy measures are functions of J r and therefore admit respective representations in terms of the cumulant function of the EF. Two examples are provided by the Rényi entropy and the Havrda-Charvát entropy (or Tsallis entropy), which are given by and for ζ ∈ Ξ * with rζ ∈ Ξ * ; for the definitions, see, e.g., [1]. More generally, the Sharma-Mittal entropy is seen to be for ζ ∈ Ξ * with rζ ∈ Ξ * , which yields the representation for H S as r = s → 1, for H R r as s → 1, and for H HC r as s → r; see [29]. If the assumption h ≡ 1 is not met, the calculus of the entropies becomes more involved. The Shannon entropy, for instance, is then given by where the additional additive term E ζ [ln(h)], as it is the mean of ln(h) under P * ζ , will also depend on ζ, in general; see, e.g., [17]. Since for r > 0 and ζ ∈ Ξ * with rζ ∈ Ξ * (cf. [29]), more complicated expressions result for other entropies and require to compute respective moments of h. Of course, we arrive at the same expressions as for the case h ≡ 1 if the entropies are introduced with respect to the dominating measure ν, which is neither a counting nor a Lebesgue measure, in general; see Section 2. However, in contrast to divergence measures, entropies usually depend on the dominating measure, such that the resulting entropy values of the distributions will be different.
Representations of Rényi and Shannon entropies for various multivariate distributions including several EFs can be found in [30].

Application
As aforementioned, applications of divergence measures in statistical inference have been extensively discussed; see the references in the introduction. As an example, we make use of the representations of the symmetric divergences (distances) in Section 3 to construct confidence regions that are different from the standard rectangles for exponential parameters in a multi-sample situation.
Let n 1 , . . . , n k ∈ N and X ij , 1 ≤ i ≤ k, 1 ≤ j ≤ n i , be independent random variables, where X i1 , . . . , X in i follow an exponential distribution with (unknown) mean 1/α i for 1 ≤ i ≤ k. The overall joint distribution P α , say, has the density function with the k-dimensional statistic for x = (x 11 , . . . , x 1n 1 , . . . , x k1 , . . . , x kn k ) ∈ (0, ∞) n , the cumulant function and n = ∑ k i=1 n i . It is easily verified that P = {P α : α ∈ (0, ∞) k } forms a regular EF with minimal canonical representation (15). The corresponding mean value function is given by To construct confidence regions for α based on the Jeffrey distance D J , the resistoraverage distance D RA , the distance D GA , the Hellinger metric D H , and the geometric Jensen-Shannon distance D GJS , we first compute the KL-divergence D KL and the affinity A 1/2 . Note that, by Remark 3, constructing a confidence region based on D H is equivalent to constructing a confidence region based on either A 1/2 , D R 1/2 , or D CR 1/2 .
It is found that over the sample sizes and realizations ofα considered, the confidence regionsC J ,C RA ,C GA ,C H , andC GJS are similarly shaped but do not coincide as the plots for different sample sizes show. In terms of (observed) area, all divergence-based confidence regions perform considerably better than the standard rectangle. This finding, however, depends on the parameter of interest, which here is the vector of exponential means; for the divergence-based confidence regions and the standard rectangle for α itself, the contrary assertion is true. Although the divergence-based confidence regions have a smaller area than the standard rectangle, this is not at the cost of large projection lengths with respect to the m 1 -and m 2 -axes, which serve as further characteristics for comparing confidence regions. Monte Carlo simulations may moreover be applied to compute the expected area and projection lengths as well as the coverage probabilities of false parameters for a more rigorous comparison of the performance of the confidence regions, which is beyond the scope of this article.  Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.

CR
Cressie-Read EF exponential family KL Kullback-Leibler MLE maximum likelihood estimator