The Case for Shifting the Rényi Entropy

Valverde-Albacete, Francisco J.; Peláez-Moreno, Carmen

doi:10.3390/e21010046

Open AccessArticle

The Case for Shifting the Rényi Entropy

by

Francisco J. Valverde-Albacete

^†

and

Carmen Peláez-Moreno

^*,†

Department of Signal Theory and Communications, Universidad Carlos III de Madrid, 28911 Leganés, Spain

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Entropy 2019, 21(1), 46; https://doi.org/10.3390/e21010046

Submission received: 15 November 2018 / Revised: 1 January 2019 / Accepted: 7 January 2019 / Published: 9 January 2019

(This article belongs to the Section Information Theory, Probability and Statistics)

Download

Browse Figures

Versions Notes

Abstract

:

We introduce a variant of the Rényi entropy definition that aligns it with the well-known Hölder mean: in the new formulation, the r-th order Rényi Entropy is the logarithm of the inverse of the r-th order Hölder mean. This brings about new insights into the relationship of the Rényi entropy to quantities close to it, like the information potential and the partition function of statistical mechanics. We also provide expressions that allow us to calculate the Rényi entropies from the Shannon cross-entropy and the escort probabilities. Finally, we discuss why shifting the Rényi entropy is fruitful in some applications.

Keywords:

shifted Rényi entropy; Shannon-type relations; generalized weighted means; Hölder means; escort distributions

1. Introduction

The suggestive framework for the description and assessment of information transmission that Shannon proposed and co-developed [1,2,3] soon took hold of the mind of a generation of scientists and overflowed its initial field of application, despite the cautions of the inceptor himself [4]. He had independently motivated and re-discovered the Boltzmann description for the thermodynamic entropy of a system with many micro-states [5]. His build-up of the concept starting from Hartley’s measure of information using the nowadays well-known axiomatic approach created a sub-science—perhaps a science—out of three papers. For information scientists, it is difficult to shatter the intellectual chains of Shannon’s entropy [5,6,7,8,9,10,11].

After Shannon’s introduction of his re-purposing of the Boltzmann entropy to analyze communication, many generalizations of it were proposed, among which Rényi’s [12], Hvarda-Charvat-Tsallis’ [13] and Csiszar’s [14] seem to have found the widest echo. Reviews of information measures with different points of view are [14,15].

In this paper we want to contribute to the characterization and popularization of the Rényi entropy as a proper generalization of the Shannon entropy. Rényi’s suggestion was obtained after noticing some limits to the axiomatic approach [16], later better analyzed by Aczel and Daroczny [17]. His critical realisation was that there are more ways to develop the means of the individual surprisals of a collection of events, whereby he resorted to the Kolmogorov-Nagumo theory of the means [18,19,20]. In fact, Kolmogorov had been present in the history Information Theory from foundational issues [18], to punctual clarification [21], to his own devising of a measure of entropy-complexity. The situation concerning the theory of the means at the time is described in [22].

Rényi was quite aware that entropy is a quantity related to the averages of the information function on a probability distribution: let

X \sim P_{X}

be a random variable over a set of outcomes

X = {x_{i} ∣ 1 \leq i \leq n}

and pmf

P_{X}

defined in terms of the non-null values

p_{i} = P_{X} (x_{i})

. The Rényi entropy for X is defined in terms of that of

P_{X}

as

H_{α} (X) = H_{α} (P_{X})

by a case analysis [12]

\begin{matrix} α & \neq 1 H_{α} (P_{X}) = \frac{1}{1 - α} log \sum_{i = 1}^{n} p_{i}^{α} \\ α & = 1 lim_{α \to 1} H_{α} (P_{X}) = H (P_{X}) \end{matrix}

(1)

where

H (P_{X}) = - \sum_{i = 1}^{n} p_{i} log p_{i}

is the Shannon entropy [1,2,3]. Similarly the associated divergence when

Q \sim Q_{X}

is substituted by

P \sim P_{X}

on a compatible support is defined in terms of their pmfs

q_{i} = Q_{X} (x_{i})

and

p_{i} = P_{X} (x_{i})

, respectively, as

D_{α} (X ∥ Q) = D_{α} (P_{X} ∥ Q_{X})

where

\begin{matrix} α & \neq 1 D_{α} (P_{X} ∥ Q_{X}) = \frac{1}{α - 1} log \sum_{i = 1}^{n} p_{i}^{α} q_{i}^{1 - α} \\ α & = 1 lim_{α \to 1} D_{α} (P_{X} ∥ Q_{X}) = D_{K L} (P_{X} ∥ Q_{X}) . \end{matrix}

(2)

and

D_{K L} (P_{X} ∥ Q_{X}) = \sum_{i = 1}^{n} p_{i} log \frac{p_{i}}{q_{i}}

is the Kullback–Leibler divergence [23].

When trying to find the closed form for a generalization of the Shannon entropy that was compatible with all the Faddev axioms but that of linear average, Rényi found that the function

φ (x) = x^{r}

could be used with the Kolmogorov–Nagumo average to obtain such a new form of entropy. Rather arbitrarily, he decided that the constant should be

α = r + 1

, thus obtaining (1) and (2), but obscuring the relationship of the entropies of order

α

and the generalized power means.

We propose to shift the parameter in these definitions back to

r = α - 1

to define the shifted Rényi entropy of order r the value

\begin{matrix} {\tilde{H}}_{r} (P_{X}) & = - log M_{r} (P_{X}, P_{X}) \end{matrix}

and the shifted Rényi divergence of order r the value

\begin{matrix} {\tilde{D}}_{r} (P_{X} ∥ Q_{X}) & = log M_{r} (P_{X}, \frac{P_{X}}{Q_{X}}) \end{matrix}

where

M_{r}

is the r-th order weighted generalized power means or Hölder means [24]:

\begin{matrix} M_{r} (\vec{w}, \vec{x}) & = {(\sum_{i = 1}^{n} \frac{w_{i}}{\sum_{k} w_{k}} \cdot x_{i}^{r})}^{\frac{1}{r}} . \end{matrix}

In our opinion, this shifted version may be more fruitful than the original. However, since this could be deemed equally arbitrary, in this paper we argue that this statement of the Rényi entropy greatly clarifies its role vis-a-vis the Hölder means, viz. that most of the properties and special cases of the Rényi entropy arise from similar concerns in the Hölder means. We also provide a brief picture of how the theory surrounding the Rényi entropy would be modified with this change, as well as its relationship to some other magnitudes.

2. Preliminaries

2.1. The Generalized Power Means

Recall that the generalized power or Hölder mean of order r is defined as

\begin{matrix} M_{r} (\vec{w}, \vec{x}) = {(\frac{\sum_{i = 1}^{n} w_{i} \cdot x_{i}^{r}}{\sum_{k} w_{k}})}^{\frac{1}{r}} = {(\sum_{i = 1}^{n} \frac{w_{i}}{\sum_{k} w_{k}} \cdot x_{i}^{r})}^{\frac{1}{r}} \end{matrix}

(3)

By formal identification, the generalized power mean is nothing but the weighted f-mean with

f (x) = x^{r}

(see Appendix A). In this paper we use the notation where the weighting vector comes first—rather than the opposite, used in [24]—to align it with formulas in information theory, e.g., divergences and cross entropies. Reference [25] provides proof that this functional mean also has the Properties 1–3 of Proposition A1 and Associativity.

The evolution of

M_{r} (\vec{w}, \vec{x})

with r is also called the Hölder path (of an

\vec{x}

). Important cases of this mean for historical and practical reasons are obtained by giving values to r:

The (weighted) geometric mean when $r = 0$ .

$\begin{matrix} M_{0} (\vec{w}, \vec{x}) = lim_{r \to 0} M_{r} (\vec{w}, \vec{x}) = {(Π_{i = 1}^{n} x_{i}^{w_{i}})}^{\frac{1}{\sum_{k} w_{k}}} \end{matrix}$

(4)
The weighted arithmetic mean when $r = 1$ .

$M_{1} (\vec{w}, \vec{x}) = \sum_{i = 1}^{n} \frac{w_{i}}{\sum_{k} w_{k}} \cdot x_{i}$
The weighted harmonic mean for $r = - 1$ .

$M_{- 1} (\vec{w}, \vec{x}) = {(\sum_{i = 1}^{n} \frac{w_{i}}{\sum_{k} w_{k}} \cdot x_{i}^{- 1})}^{- 1} = \frac{\sum_{k} w_{k}}{\sum_{i = 1}^{n} w_{i} \cdot \frac{1}{x_{i}}}$
The quadratic mean for $r = 2$ .

$M_{2} (\vec{w}, \vec{x}) = {(\sum_{i = 1}^{n} \frac{w_{i}}{\sum_{k} w_{k}} \cdot x_{i}^{2})}^{\frac{1}{2}}$
Finally, the max- and min-means appear as the limits:

$\begin{matrix} M_{\infty} (\vec{w}, \vec{x}) & = lim_{r \to \infty} M_{r} (\vec{w}, \vec{x}) = {max}_{i = 1}^{n} x_{i} \\ M_{- \infty} (\vec{w}, \vec{x}) & = lim_{r \to - \infty} M_{r} (\vec{w}, \vec{x}) = {min}_{i = 1}^{n} x_{i} \end{matrix}$

They all show the following properties:

Proposition 1

(Properties of the weighted power means). Let

\vec{x}, \vec{w} \in {(0, \infty)}^{n}

and

r, s \in (- \infty, \infty)

. Then, the following formal identities hold, where

{\vec{x}}^{r}

and

\frac{1}{\vec{x}}

are to be understood entry-wise,

1.

(0- and 1-order homogeneity in weights and values) If

k_{1}, k_{2} \in R_{\geq 0}

, then

M_{r} (k_{1} \cdot \vec{w}, k_{2} \cdot \vec{x}) = k_{1}^{0} \cdot k_{2}^{1} \cdot M_{r} (\vec{w}, \vec{x}) .

2.

(Order factorization) If

r \neq 0 \neq s

, then

M_{r s} (\vec{w}, \vec{x}) = {(M_{s} (\vec{w}, {(\vec{x})}^{r}))}^{1 / r} .

3.

(Reduction to the arithmetic mean) If

r \neq 0

, then

M_{r} (\vec{w}, \vec{x}) = {[M_{1} (\vec{w}, {(\vec{x})}^{r})]}^{1 / r}

.

4.

(Reduction to the harmonic mean) If

r \neq 0

, then

M_{- r} (\vec{w}, \vec{x}) = {[M_{- 1} (\vec{w}, {(\vec{x})}^{r})]}^{1 / r} = {[M_{r} (\vec{w}, \frac{1}{\vec{x}})]}^{- 1} = {[M_{1} (\vec{w}, \frac{1}{{(\vec{x})}^{r}})]}^{- 1 / r}

.

5.

(Monotonicity in r) Furthermore,

\vec{x} \in {[0, \infty]}^{n}

and

r, s \in [- \infty, \infty]

, then

min_{i} x_{i} = M_{- \infty} (\vec{w}, \vec{x}) \leq M_{r} (\vec{w}, \vec{x}) \leq M_{\infty} (\vec{w}, \vec{x}) = max_{i} x_{i}

and the mean is a strictly monotonic function of r, that is

r < s

implies

M_{r} (\vec{w}, \vec{x}) < M_{s} (\vec{w}, \vec{x})

, unless:

$x_{i} = k$ is constant, in which case $M_{r} (\vec{w}, \vec{x}) = M_{s} (\vec{w}, \vec{x}) = k$ .
$s \leq 0$ and some $x_{i} = 0$ , in which case $0 = M_{r} (\vec{w}, \vec{x}) \leq M_{s} (\vec{w}, \vec{x})$ .
$0 \leq r$ and some $x_{i} = \infty$ , in which case $M_{r} (\vec{w}, \vec{x}) \leq M_{s} (\vec{w}, \vec{x}) = \infty$ .

6.

(Non-null derivative) Call

{\tilde{q}}_{r} (\vec{w}, \vec{x}) = {\{\frac{w_{k} x_{k}^{r}}{\sum_{i} w_{i} x_{i}^{r}}\}}_{k = 1}^{n}

. Then

\begin{matrix} \frac{d}{d r} M_{r} (\vec{w}, \vec{x}) = \frac{1}{r} \cdot M_{r} (\vec{w}, \vec{x}) ln \frac{M_{0} ({\tilde{q}}_{r} (\vec{w}, \vec{x}), \vec{x})}{M_{r} (\vec{w}, \vec{x})} \end{matrix}

(5)

Proof.

Property 1 follows from the commutativity, associativity and cancellation of sums and products in

R_{\geq 0}

. Property 2 follows from identification in the definition, then Properties 3 and 4 follow from it with

s = 1

and

s = - 1

respectively. Property 5 and the special cases in it are well known and studied extensively in [24]. We will next prove property 6

\begin{matrix} \frac{d}{d r} M_{r} (\vec{w}, \vec{x}) & = \frac{d}{d r} e^{\frac{1}{r} ln (\sum_{k} \frac{w_{k}}{\sum_{i} w_{i}} x_{k}^{r})} = M_{r} (\vec{w}, \vec{x}) (\frac{- 1}{r^{2}} ln (\sum_{k} \frac{w_{k}}{\sum_{i} w_{i}} x_{k}^{r}) + \frac{1}{r} \cdot \frac{\sum_{k} w_{k} x_{k}^{r} ln x_{k}}{\sum_{i} w_{i} x_{i}^{r}}) \end{matrix}

Note that if we call

{\tilde{q}}_{r} (\vec{w}, \vec{x}) = {w_{k}^{'}}_{k = 1}^{n} = {\{\frac{w_{k} x_{k}^{r}}{\sum_{i} w_{i} x_{i}^{r}}\}}_{k = 1}^{n}

, since this is a probability we may rewrite:

\begin{matrix} \sum_{k} \frac{w_{k} x_{k}^{r}}{\sum_{i} w_{i} x_{i}^{r}} \cdot ln x_{k} = \sum_{k} w_{k}^{'} ln x_{k} = ln (\prod_{k} x_{k}^{w_{k}^{'}}) = ln M_{0} ({\tilde{q}}_{r} (\vec{w}, \vec{x}), x) \end{matrix}

whence

\begin{matrix} \frac{d}{d r} M_{r} (\vec{w}, \vec{x}) & = M_{r} (\vec{w}, \vec{x}) (\frac{1}{r} \cdot ln M_{0} ({\tilde{q}}_{r} (\vec{w}, \vec{x}), x) - \frac{1}{r} \cdot ln M_{r} (\vec{w}, \vec{x})) \\ = \frac{1}{r} \cdot M_{r} (\vec{w}, \vec{x}) ln \frac{M_{0} ({\tilde{q}}_{r} (\vec{w}, \vec{x}), \vec{x})}{M_{r} (\vec{w}, \vec{x})} . \end{matrix}

□

Remark 1.

The distribution

{\tilde{q}}_{r} (\vec{w}, \vec{x})

when

\vec{w} = \vec{x}

is extremely important in the theory of generalized entropy functions, where it is called a (shifted) escort distribution (of

\vec{w}

) [5], and we will prove below that its importance stems, at leasts partially, from this property.

Remark 2.

Notice that in the case where both conditions at the end of Property 1.5 hold—that is for

i \neq j

we have

x_{i} = 0

and

x_{j} = \infty

—then we have for

r \leq 0, M_{r} (\vec{w}, \vec{x}) = 0

and for

0 \leq r, M_{r} (\vec{w}, \vec{x}) = \infty

whence

M_{r} (\vec{w}, \vec{x})

has a discontinuity at

r = 0

.

2.2. Rényi’s Entropy

Although the following material is fairly standard, it bears directly into our discussion, hence we introduce it in full.

2.2.1. Probability Spaces, Random Variables and Expectations

Shannon and Rényi set out to find how much information can be gained on average by a single performance of an experiment

Ω

under different suppositions. For that purpose, let

(Ω, Σ_{Ω}, P)

be a measure space, with

Ω = {ω_{1}, \dots, ω_{n}}

the set of outcomes of a random experiment,

Σ_{Ω}

the sigma-algebra of this set and measure

P : Ω \to R_{\geq 0}, P (ω_{i}) = p_{i}, 1 \leq k \leq n

. We define the support of P, as the set of outcomes with positive probability supp

(P) = {ω \in Ω ∣ P (ω) > 0}

.

Let

(X, Σ_{X})

be a measurable space with

X

a domain and

Σ_{X}

its sigma algebra and consider the random variable

X : Ω \to X

, that is, a measurable function so that for each set of

B \in Σ_{X}

we have

X^{- 1} (B) \in Σ_{Ω}

. Then P induces a measure

P_{X}

on

(X, Σ_{X})

with

\forall x \in Σ_{X}, P_{X} (x) = P (X = x) = P (X^{- 1} (x))

, where x is an event in

Σ_{X}

, and

P_{X} (x) = \sum_{ω_{i} \subseteq X^{- 1} (x)} P (ω_{i})

whereby

(X, Σ_{X}, P_{X})

becomes a measure space. We will use mostly

X \sim P_{X}

to denote a random variable, instead of its measurable space. The reason for this is that since information measures are defined on distributions, this is the more fundamental notion for us.

Sometimes co-occurring random variables are defined on the same sample space and sometimes on different ones. Hence, we will need another measure space sharing the same measurable space

(Ω, Σ_{Ω})

but different measure,

(Ω, Σ_{Ω}, Q)

with

Q (ω_{i}) = q_{i}

.

Remark 3.

Modernly, discrete distributions are sets or vectors of non-negative numbers adding up to 1, but Rényi developed his theory for “defective distributions”, that is, with

\sum_{i} P (ω_{i}) \neq 1

which are better described as “positive measures”. In fact, we do not need to distinguish whether P is a probability measure in the

(n - 1)

-simplex

P \in Δ^{n - 1} \Leftrightarrow \sum_{i} P (ω_{i}) = 1

or in general a measure

P \in R_{\geq 0}^{n}

and nothing precludes using the latter to define entropies—while it provides a bit of generalization this is the road we will take below (see [12,26] on using incomplete distributions with

\sum_{i} p_{i} < 1

).

2.2.2. The Approach to Rényi’s Information Functions Based in Postulates

One of the most important applications of the generalized weighted means is to calculate the moments of (non-negative) random variables.

Lemma 1.

Let

X \sim P_{X}

be a discrete random variable. Then ther-th moment of Xis:

\begin{matrix} E_{X} {X^{r}} = \sum_{i} p_{i} x_{i}^{r} = {(M_{r} (P_{X}, X))}^{r} \end{matrix}

(6)

This is the concept that Shannon, and afterwards Rényi, used to quantify information by using the distribution as a random variable (Section 3.3).

The postulate approach to characterize Shannon’s information measures can be found in Appendix B. Analogue generalized postulates lead to Rényi’s information functions, but, importantly, he did not consider normalized measures, that is with

\sum_{k} p_{k} = 1

.

We follow [27] in stating the Rényi postulates:

The amount of information provided by a single random event $x_{k}$ should be a function of its probability $P_{X} (x_{k}) = p_{k}$ , not its value $x_{k} = X (ω_{k})$ , $I : [0, 1] \to I$ where $I \subseteq R$ quantifies information.
This amount of information should be additive on independent events.

$\begin{matrix} I (p, q) = I (p) + I (q) \end{matrix}$

(7)
The amount of information of a binary equiprobable decision is one bit.

$\begin{matrix} I (1 / 2) = 1 \end{matrix}$

(8)
If different amounts of information occur with different probabilities the total amount of information $I$ is an average of the individual information amounts weighted by the probability of occurrence.

These postulates may lead to the following consequences:

Postulates 1 and 2 fix Hartley’s function as the single possible amount of information of a basic event

$\begin{matrix} I : [0, 1] \to [0, \infty], p \mapsto I (p) = - k log p . \end{matrix}$

(9)
Postulates 3 fixes the base of the logarithm in Hartley’s formula to 2 by fixing $k = 1$ . Any other value $k = 1$ / $log b$ fixes b as the base for the logarithm and changes the unit.
Postulate 4 defines an average amount of information, or entropy, properly speaking. Its basic formula is a form of the Kolmogorov–Nagumo formula or f-mean (A2) applied to information

$\begin{matrix} H (P_{X}, φ, I) = φ^{- 1} (Σ_{i = 1}^{n} \frac{p_{i}}{\sum_{k} p_{k}} φ (I (p_{i}))) . \end{matrix}$

(10)

Thus the “entropy” in Information Theory is, by definition, synonym with “aggregate amount of information”, which departs from its physical etymology, despite the numerous analogies between both concepts.
It has repeatedly been proven that only two forms of the function $φ$ can actually be used in the Kolmogorov–Nagumo formula that respect the previous postulates [12,26,27]:
-
The one generating Shannon’s entropy:

$\begin{matrix} φ (h) = a h + b with a \neq 0, \end{matrix}$

(11)

-
That originally used by Rényi himself:

$\begin{matrix} φ (h) = 2^{(1 - α) h}, with α \in [- \infty, \infty] ∖ {1} . \end{matrix}$

(12)

Taking the first form (11) and plugging it into (10) leads to Shannon’s measure of information, and taking the second form leads to Rényi’s measure of information (1), so we actually have:

Definition 1

([12,26]). The Rényi entropy of order

α

for a discrete random variable

X \sim P_{X}

, is

\begin{matrix} H_{α} (P_{X}) = \frac{1}{1 - α} log (\sum_{i = 1}^{n} \frac{p_{i}^{α}}{\sum_{k} p_{k}}), α \neq 1 & lim_{α \to 1} H_{α} (P_{X}) = H (P_{X}) = - \sum_{i} \frac{p_{i}}{\sum_{k} p_{k}} log p_{i}, \end{matrix}

(13)

where the fact that Shannon’s entropy is the Rényi entropy when

α \to 1

in (1) is found by a continuity argument.

Rényi also used the postulate approach to define the following quantity:

Definition 2

([12,26]). The gain of information or divergence (between distributions) when

Y \sim P_{Y}

,

P_{Y} (y_{i}) = q_{i}

is substituted by

X \sim P_{X}

,

P_{X} (x_{i}) = p_{i}

being continuous wrt the latter—that is, with supp

Y \subseteq supp X

—as

\begin{matrix} D_{α} (P_{X} ∥ P_{Y}) = \frac{1}{α - 1} log \sum_{i = 1}^{n} p_{i}^{α} q_{i}^{1 - α}, α \neq 1 & lim_{α \to 1} D_{α} (P_{X} ∥ P_{Y}) & = D_{K L} (P_{X} ∥ P_{Y}) \end{matrix}

and the fact that Kullback–Leibler’s divergence emerges as the limit when

α \to 1

follows from the same continuity argument as before. Such special cases will not be stated again, as motivated in Section 3.1.

As in the Shannon entropy case, the rest of the quantities arising in Information Theory can be defined in terms of the generalized entropy and its divergence [23,27].

3. Results

3.1. The Shifted Rényi Entropy and Divergence

To leverage the theory of generalized means to our advantage, we start with a correction to Rényi’s entropy definition: The investigation into the form of the transformation function for the Rényi entropy (12) is arbitrary in the parameter

α

that it chooses. In fact, we may substitute in

r = α - 1

to obtain the pair of formulas:

\begin{matrix} φ^{'} (h) & = b^{- r h} & φ^{' - 1} (p) & = \frac{- 1}{r} {log}_{b} p \end{matrix}

(14)

Definition 3.

The shifted Rényi entropy of order

r \neq 0

for a discrete random variable

X \sim P_{X}

, is the Kolmogorov–Nagumo

φ^{'}

-mean (10) of the information function

I_{*} (p) = - ln p

over the probability values.

\begin{matrix} {\tilde{H}}_{r} (P_{X}) & = \frac{- 1}{r} {log}_{b} (\sum_{i} \frac{p_{i}}{\sum_{k} p_{k}} p_{i}^{r}) & lim_{r \to 0} {\tilde{H}}_{r} (P_{X}) & = H (P_{X}) . \end{matrix}

(15)

Note that:

For $r \neq 0$ this is motivated by:

$\begin{matrix} {\tilde{H}}_{r} (P_{X}) & = \frac{- 1}{r} {log}_{b} (\sum_{i} \frac{p_{i}}{\sum_{k} p_{k}} b^{r {log}_{b} p_{i}}) = \frac{- 1}{r} {log}_{b} (\sum_{i} \frac{p_{i}}{\sum_{k} p_{k}} b^{{log}_{b} p_{i}^{r}}) = \frac{- 1}{r} {log}_{b} (\sum_{i} \frac{p_{i}}{\sum_{k} p_{k}} p_{i}^{r}) . \end{matrix}$
For $r = 0$ we can use the linear mean $φ (h) = a h + b$ with inverse $φ^{- 1} (p) = \frac{1}{a} (p - b)$ as per the standard definition, leading to Shannon’s entropy.

Remark 4.

The base of the logarithm is not important as long as it is maintained in

φ^{'} (\cdot)

,

I_{*} (\cdot)

and their inverses, hence we leave it implicit. For some calculations—e.g., the derivative below—we explicitly provide a particular basis—e.g.,

{log}_{e} x = ln x

.

The shifted divergence can be obtained in the same manner—the way that Rényi followed himself [26].

Definition 4.

The shifted Rényi divergence between two distributions

P_{X} (x_{i}) = p_{i}

and

Q_{X} (x_{i}) = q_{i}

with compatible support is the following quantity.

\begin{matrix} {\tilde{D}}_{r} (P_{X} ∥ Q_{X}) & = \frac{1}{r} log \sum_{i} \frac{p_{i}}{\sum_{k} p_{k}} {(\frac{p_{i}}{q_{i}})}^{r} & lim_{r \to 0} {\tilde{D}}_{r} (P_{X} ∥ Q_{X}) & = D_{K L} (P_{X} ∥ Q_{X}) . \end{matrix}

(16)

Of course, the values of the Rényi entropy and divergence are not modified by this shifting.

Lemma 2.

The Rényi entropy and the shifted Rényi entropy produce the same value, and similarly for their respective divergences.

Proof.

if we consider a new parameter

r = α - 1

we have:

\begin{matrix} H_{α} (P_{X}) & = \frac{1}{1 - α} log (\sum_{i = 1}^{n} \frac{p_{i}^{α}}{\sum_{k} p_{k}}) = \frac{- 1}{r} log (\sum_{i = 1}^{n} \frac{p_{i}^{r + 1}}{\sum_{k} p_{k}}) = - \frac{1}{r} log (\sum_{i = 1}^{n} \frac{p_{i}}{\sum_{k} p_{k}} p_{i}^{r}) = {\tilde{H}}_{r} (P_{X}) . \end{matrix}

and similarly for the divergence:

\begin{matrix} D_{α} (P_{X} ∥ Q_{X}) & = \frac{1}{α - 1} log \sum_{i = 1}^{n} \frac{p_{i}^{α} q_{i}^{1 - α}}{\sum_{k} p_{k}} = \frac{1}{r} log \sum_{i = 1}^{n} \frac{p_{i}^{r + 1} q_{i}^{- r}}{\sum_{k} p_{k}} = \frac{1}{r} log \sum_{i = 1}^{n} \frac{p_{i}}{\sum_{k} p_{k}} {(\frac{p_{i}}{q_{i}})}^{r} = {\tilde{D}}_{r} (P_{X} ∥ Q_{X}) \end{matrix}

The Shannon entropy and Kullback–Leibler divergences are clearly the limit cases. □

3.1.1. The Case for Shifting the Rényi Entropy

So what could be the reason for the shifting? First and foremost, it is a re-alignment with the more basic concept of generalized mean.

Proposition 2.

The Shifted Rényi Entropy and Divergence are logarithmic transformations of the generalized power means:

\begin{matrix} {\tilde{H}}_{r} (P_{X}) & = log \frac{1}{M_{r} (P_{X}, P_{X})} \end{matrix}

(17)

\begin{matrix} {\tilde{D}}_{r} (P_{X} ∥ Q_{X}) & = log M_{r} (P_{X}, \frac{P_{X}}{Q_{X}}) \end{matrix}

(18)

Proof.

Simple identification of (15) and (16) in the definition of power mean definitions (3). □

Table 1 lists the shifting of these entropies and their relation both to the means and to the original Rényi definition in the parameter

α

.

Remark 5.

It is no longer necessary to make the distinction between the case

r \to 0

—Shannon’s—and the rest, since the means are already defined with this caveat. This actually downplays the peculiar features of Shannon’s entropy, arising from the geometric mean when

\sum_{i} p_{i} = 1

:

\begin{matrix} {\tilde{H}}_{0} (p_{x}) & = log \frac{1}{M_{0} (P_{X}, P_{X})} = - log (\prod_{i} p_{i}^{p_{i}}) = - \sum_{i} p_{i} log p_{i} \end{matrix}

However, the prominence of the Shannon entropy will emerge once again in the context of rewriting entropies in terms of each other (Section 3.2).

Since the means are properly defined for all

r \in [- \infty, \infty]

,

{\tilde{H}}_{r} (P_{X})

is likewise properly defined for all

r \in [- \infty, \infty]

—and therefore the non-shifted version with

α = r + 1

. This is probably the single strongest argument in favour of the shifting and motivates the following definition.

Definition 5

(The Rényi information spectrum). For fixed

P_{X}

we will refer to

{\tilde{H}}_{r} (P_{X})

as its Rényi information spectrum over parameter r.

Also, some relationships between magnitudes are clarified in the shifted enunciation with respect to the traditional one, for instance, the relation between the Rényi entropy and divergence.

Lemma 3.

The shifted formulation makes the entropy the self-information with a change of sign:

\begin{matrix} {\tilde{H}}_{r} (P_{X}) & = {\tilde{D}}_{- r} (P_{X X} ∥ P_{X} P_{X}) . \end{matrix}

(19)

Proof.

{\tilde{D}}_{- r} (P_{X X} ∥ P_{X} P_{X}) = {\tilde{D}}_{- r} (P_{X} ∥ P_{X} P_{X}) = \frac{- 1}{r} log \sum_{i} p_{i} {(\frac{p_{i}}{p_{i} p_{i}})}^{- r} = \frac{- 1}{r} log \sum_{i} p_{i} {(\frac{1}{p_{i}})}^{- r} = {\tilde{H}}_{r} (P_{X})

. □

Recall that in the common formulation,

H_{α} (P_{X}) = D_{2 - α} (P_{X} ∥ P_{X} P_{X})

[23].

Another simplification is the fact that the properties of the Rényi entropy and divergence stem from those of the means, inversion and logarithm.

Proposition 3

(Properties of the Rényi spectrum of P_X). Let

r, s \in R \cup {\pm \infty}

, and

P_{X}, Q_{X} \in Δ^{n - 1}

where

Δ^{n - 1}

is the simplex over the support supp X, with cardinal

| supp X | = n

. Then,

1.: (Monotonicity) The Rényi entropy is a non-increasing function of the order r.

$\begin{matrix} s \leq r \Rightarrow {\tilde{H}}_{s} (P_{X}) \geq {\tilde{H}}_{r} (P_{X}) \end{matrix}$

(20)
2.: (Boundedness) The Rényi spectrum ${\tilde{H}}_{r} (P_{X})$ is bounded by the limits

$\begin{matrix} {\tilde{H}}_{- \infty} (P_{X}) \geq {\tilde{H}}_{r} (P_{X}) \geq {\tilde{H}}_{\infty} (P_{X}) \end{matrix}$

(21)
3.: The entropy of the uniform pmf $U_{X}$ is constant over r.

$\begin{matrix} \forall r \in R \cup {\pm \infty}, {\tilde{H}}_{r} (U_{X}) = log n \end{matrix}$

(22)
4.: The Hartley entropy ( $r = - 1$ ) is constant over the distribution simplex.

$\begin{matrix} {\tilde{H}}_{- 1} (P_{X}) = log n \end{matrix}$

(23)
5.: (Divergence from uniformity) The divergence of any distribution $P_{X}$ from the uniform $U_{X}$ can be written in terms of the entropies as:

$\begin{matrix} {\tilde{D}}_{r} (P_{X} ∥ U_{X}) = {\tilde{H}}_{r} (U_{X}) - {\tilde{H}}_{r} (P_{X}) . \end{matrix}$

(24)
6.: (Derivative of the shifted entropy) The derivative in r of Rényi’s r-th order entropy is

$\begin{matrix} \frac{d}{d r} {\tilde{H}}_{r} (P_{X}) = \frac{- 1}{r^{2}} {\tilde{D}}_{0} ({\tilde{q}}_{r} (P_{X}) ∥ P_{X}) = \frac{- 1}{r} log \frac{M_{0} ({\tilde{q}}_{r} (P_{X}), P_{X})}{M_{r} (P_{X}, P_{X})}, \end{matrix}$

(25)

where ${\tilde{q}}_{r} (P_{X}) = {\{\frac{p_{i} p_{i}^{r}}{\sum_{k} p_{k} p_{k}^{r}}\}}_{i = 1}^{n}$ for $r \in R \cup {\pm \infty}$ are the shifted escort distributions.
7.: (Relationship with the moments of $P_{X}$ ) The shifted Rényi Entropy of order r is the logarithm of the inverse r-th root of the r-th moment of $P_{X}$ .

$\begin{matrix} {\tilde{H}}_{r} (P_{X}) & = - \frac{1}{r} log E_{P_{X}} {P_{X}^{r}} = log \frac{1}{\sqrt[r]{E_{P_{X}} {P_{X}^{r}}}} \end{matrix}$

(26)

Proof.

Note that properties used in the following are referred to the Proposition they are stated in. Property 1 issues from Property 1.2 and Hartley’s information function being order-inverting or antitone. Since the free parameter r is allowed to take values in

[- \infty, \infty]

, Property 2 follows directly from Property 1. With respect to Property 3, we have, from

U_{X} = 1 / | supp X | = 1 / n

and Property A1.3:

{\tilde{H}}_{r} (\frac{1}{n}) = - log M_{r} (\frac{1}{n}, \frac{1}{n}) = - log \frac{1}{n} = log n .

For Property 4 we have:

{\tilde{H}}_{- 1} (P_{X}) = - log {(\sum_{i} p_{i} \cdot p_{i}^{- 1})}^{- 1} = - log {(n)}^{- 1} = log n

While for Property 5,

\begin{matrix} {\tilde{D}}_{r} (P_{X} ∥ U_{X}) & = \frac{1}{r} log [\sum_{i} p_{i} {(\frac{p_{i}}{u_{i}})}^{r}] = \frac{1}{r} log [\sum_{i} p_{i} {(\frac{p_{i}}{1 / n})}^{r}] = \frac{1}{r} log [n^{r} (\sum_{i} p_{i} p_{i}^{r})] \\ = log n + log {(\sum_{i} p_{i} p_{i}^{r})}^{1 / r} = {\tilde{H}}_{r} (U_{X}) - {\tilde{H}}_{r} (P_{X}) . \end{matrix}

For the third term of Property 6, we have from (17) with natural logarithm, with

P_{X}

in the role both of

\vec{w}

and

\vec{x}

\begin{matrix} \frac{d}{d r} {\tilde{H}}_{r} (P_{X}) = \frac{d}{d r} (- ln M_{r} (P_{X}, P_{X})) = - \frac{\frac{d}{d r} M_{r} (P_{X}, P_{X})}{M_{r} (P_{X}, P_{X})}, \end{matrix}

whence the property follows directly from (5). For the first identity, though, we have:

\begin{matrix} \frac{d {\tilde{H}}_{r} (P_{X})}{d r} & = - \frac{d}{d r} [\frac{1}{r} ln \sum_{i} p_{i} p_{i}^{r}] = - [\frac{- 1}{r^{2}} ln \sum_{i} p_{i} p_{i}^{r} + \frac{1}{r} \sum_{i} \frac{p_{i} p_{i}^{r}}{\sum_{i} p_{i} p_{i}^{r}} ln p_{i}] . \end{matrix}

If we introduce the abbreviation

\begin{matrix} {\tilde{q}}_{r} (P_{X}) & = {\tilde{q}}_{r} (P_{X}, P_{X}) = {{\tilde{q}}_{r} {(P_{X})}_{i}}_{i = 1}^{n} = {\{\frac{p_{i} p_{i}^{r}}{\sum_{k} p_{k} p_{k}^{r}}\}}_{i = 1}^{n} \end{matrix}

(27)

noticing that

ln \sum_{k} p_{k} p_{k}^{r} = \sum_{i} {\tilde{q}}_{r} {(P_{X})}_{i} ln (\sum_{k} p_{k} p_{k}^{r})

, since

{\tilde{q}}_{r} (P_{X})

is a distribution, and factoring out

- 1 / r^{2}

:

\begin{matrix} \frac{d {\tilde{H}}_{r} (P_{X})}{d r} & = - \frac{1}{r^{2}} [- \sum_{i} {\tilde{q}}_{r} {(P_{X})}_{i} ln (\sum_{k} p_{k} p_{k}^{r}) + r (\sum_{i} {\tilde{q}}_{r} {(P_{X})}_{i} ln p_{i}) \pm \sum_{i} {\tilde{q}}_{r} {(P_{X})}_{i} ln p_{i}] \\ = - \frac{1}{r^{2}} [- \sum_{i} {\tilde{q}}_{r} {(P_{X})}_{i} ln (\sum_{k} p_{k} p_{k}^{r}) + (r + 1) \sum_{i} {\tilde{q}}_{r} {(P_{X})}_{i} ln p_{i} - \sum_{i} {\tilde{q}}_{r} {(P_{X})}_{i} ln p_{i}] \\ = - \frac{1}{r^{2}} [\sum_{i} {\tilde{q}}_{r} {(P_{X})}_{i} ln \frac{p_{i} p_{i}^{r}}{\sum_{k} p_{k} p_{k}^{r}} - \sum_{i} {\tilde{q}}_{r} {(P_{X})}_{i} ln p_{i}] = - \frac{1}{r^{2}} \sum_{i} {\tilde{q}}_{r} {(P_{X})}_{i} ln \frac{{\tilde{q}}_{r} {(P_{X})}_{i}}{p_{i}} \end{matrix}

and recalling the definition of the shifted divergence we have the result.

For Property 7, in particular, the probability of any event is a function of the random variable

P_{X} (x_{i}) = p_{i}

whose r-th moment of

P_{X}

is

\begin{matrix} E_{X} {P_{X}^{r}} = \sum_{i} p_{i} p_{i}^{r} = {(M_{r} (P_{X}, P_{X}))}^{r} \end{matrix}

(28)

The result follows by applying the definition of the shifted entropy in terms of the means. □

Remark 6.

In the preceding proof we have introduced the notion of shifted escort probabilities

{\tilde{q}}_{r} (P_{X})

acting in the shifted Rényi entropies as the analogues of the escort probabilities in the standard definition (see [5] and Section 2.1). This notion of shifted escort probabilities is the one requested by Property 1.6 by instantiation of variables

{\tilde{q}}_{} (P_{X}) = {\tilde{q}}_{} (P_{X}, P_{X})

. But notice also that

{({\tilde{q}}_{r} (P_{X}))}_{i} = \frac{p_{i} p_{i}^{r}}{\sum_{k} p_{k} p_{k}^{r}} = \frac{p_{i}^{α}}{\sum_{k} p_{k}^{α}} = {(q_{α} (P_{X}))}_{i}

is just the shifting of the traditional escort probabilities [5].

Note that for

P_{X} \in R_{\geq 0}^{n}

:

${\tilde{q}}_{0} (P_{X})$ is the normalization of $P_{X}$ . In fact, $P_{X} \in Δ^{n - 1}$ if and only if we have ${\tilde{q}}_{0} (P_{X}) = P_{X}$ .
${\tilde{q}}_{- 1} (P_{X}) (x_{i}) = {| supp P_{X} |}^{- 1}$ if $x_{i} \in supp P_{X}$ and 0 otherwise.
Furthermore, if $P_{X}$ has P maxima (M minima), then ${\tilde{q}}_{\infty} (P_{X})$ ( ${\tilde{q}}_{- \infty} (P_{X})$ ) is an everywhere null distribution but at the indices where the maxima (minima) of $P_{X}$ are situated:

$\begin{matrix} {\tilde{q}}_{\infty} (P_{X}) (x_{i}) & = \{\begin{matrix} \frac{1}{P} & x_{i} \in arg max P_{X} \\ 0 & otherwise \end{matrix} & {\tilde{q}}_{- \infty} (P_{X}) (x_{i}) & = \{\begin{matrix} \frac{1}{M} & x_{i} \in arg min P_{X} \\ 0 & otherwise \end{matrix} \end{matrix}$

Another important point made clear by this relation to the means is the fact that all positive measures have a Rényi spectrum: although so far we conceived the origin of information to be a probability function, nothing precludes applying the same procedure to non-negative, non-normalized quantities with

\sum_{x} f_{X} (x) \neq 1

, e.g., masses, sums, amounts of energy, etc.

It is well-understood that in this situation Rényi’s entropy has to be slightly modified to accept this procedure. The reason for this is Property 1.1 of the means: generalized means are 1-homogeneous in the numbers being averaged, but 0-homogeneous in the weights. In the Rényi spectrum both these roles are fulfilled by the pmf. Again the escort distributions allow us to analyze the measure:

Lemma 4.

Consider a random variable

X \sim M_{X}

with non-normalized measure

M_{X} (x_{i}) = m_{i}

such that

\sum_{i} m_{i} = M \neq 1

. Then the normalized probability measure

{\tilde{q}}_{0} (M_{X}) = {m_{i} / \sum_{i} m_{i}}_{i = 1}^{n}

provides a Rényi spectrum that is displaced relative to that of the measure as:

\begin{matrix} {\tilde{H}}_{r} (M_{X}) = {\tilde{H}}_{r} ({\tilde{q}}_{0} (M_{X})) - log M . \end{matrix}

(29)

Proof.

\begin{matrix} {\tilde{H}}_{r} ({\tilde{q}}_{0} (M_{X})) & = - log M_{r} ({\tilde{q}}_{0} (M_{X}), {\tilde{q}}_{0} (M_{X})) = - \frac{1}{r} log \sum_{i} \frac{m_{i}}{M} {(\frac{m_{i}}{M})}^{r} = log M - \frac{1}{r} log \sum_{i} \frac{m_{i}}{M} m_{i}^{r} \\ = log M - log M_{r} (M_{X}, M_{X}) = log M + {\tilde{H}}_{r} (M_{X}) \end{matrix}

□

Remark 7.

When

M \geq 1, - log M \leq 0

with equality for

M = 1

and that if

M < 1

then

- log M > 0

. This last was the original setting Rényi envisioned and catered for in the definitions, but nothing precludes the extension provided by Lemma 4. In this paper, although

P_{X}

can be interpreted as a pmf in the formulas, it can also be interpreted as a mass function as in the Lemma above. However, the escort probabilities are always pmfs.

Example 1.

This example uses the UCB admission data from [28]. We analyze the distribution of admissions with count vector

M_{X} = {[933 585 918 792 584 714]}^{⊤}

and probabilities

{\tilde{q}}_{0} (M_{X}) \approx {[0.21 0.13 0.20 0.17 0.13 0.16]}^{⊤}

. The names of the departments are not important, due to the symmetry property. Figure 1a shows the Rényi Spectrum extrapolated from a sample of some orders which include

r \in {- \infty, - 1, 0, 1, \infty}

.

3.1.2. Shifting Other Concepts Related to the Entropies

Other entropy-related concepts may also be shifted. In particular, the cross-entropy has an almost direct translation.

Definition 6.

The shifted Rényi cross-entropy of order

r \in [- \infty, \infty]

between two distributions

P_{X} (x_{i}) = p_{i}

and

Q_{X} (x_{i}) = q_{i}

with compatible support is

\begin{matrix} {\tilde{X}}_{r} (P_{X} ∥ Q_{X}) & = log \frac{1}{M_{r} (P_{X}, Q_{X})} \end{matrix}

(30)

Note that the case-based definition is redundant: the Shannon cross-entropy appears as

{\tilde{X}}_{0} (P_{X} ∥ Q_{X}) = - log M_{0} (P_{X}, Q_{X}) = - \sum_{i} \frac{p_{i}}{\sum_{k} p_{k}} log q_{i}

, while for

r \neq 0

we have

{\tilde{X}}_{r} (P_{X} ∥ Q_{X}) = - \frac{1}{r} log \sum_{i} p_{i} {(\frac{p_{i}}{q_{i}})}^{r}

by virtue of the definition of the means again.

Perhaps the most fundamental magnitude is the cross-entropy since it is easy to see that:

Lemma 5.

In the shifted formulation both the entropy and the divergence are functions of the cross-entropy:

\begin{matrix} {\tilde{H}}_{r} (P_{X}) & = {\tilde{X}}_{r} (P_{X} ∥ P_{X}) & {\tilde{D}}_{r} (P_{X} ∥ Q_{X}) & = {\tilde{X}}_{- r} (P_{X} ∥ Q_{X} / P_{X}) \end{matrix}

(31)

Proof.

The first equality is by comparison of definitions, while the second comes from:

\begin{matrix} {\tilde{D}}_{r} (P_{X} ∥ Q_{X}) & = \frac{1}{r} log \sum_{i} p_{i} {(\frac{p_{i}}{q_{i}})}^{r} = - \frac{1}{- r} log \sum_{i} p_{i} {(\frac{q_{i}}{p_{i}})}^{- r} = {\tilde{X}}_{- r} (P_{X} ∥ Q_{X} / P_{X}) \end{matrix}

□

Note that if we accept the standard criterion in Shannon’s entropy

0 \times log \frac{1}{0} = 0 \times \infty = 0

then the previous expression for the cross-entropy is defined even if

p_{i} = 0

.

3.2. Writing Rényi Entropies in Terms of Each Other

Not every expression valid in the case of Shannon’s entropies can be translated into Rényi entropies: recall from the properties of the Kullback–Leibler divergence its expression in terms of the Shannon entropy and cross-entropy. We have:

\begin{matrix} {\tilde{D}}_{0} (P_{X} ∥ Q_{X}) & = - {\tilde{H}}_{0} (P_{X}) + {\tilde{X}}_{0} (P_{X} ∥ Q_{X}), \end{matrix}

(32)

but, in general,

{\tilde{D}}_{r} (P_{X} ∥ Q_{X}) \neq - {\tilde{H}}_{r} (P_{X}) + {\tilde{X}}_{r} (P_{X} ∥ Q_{X})

.

However, the shifting sometimes helps in obtaining “derived expressions”. In particular, the (shifted) escort probabilities are ubiquitous in expressions dealing with Rényi entropies and divergences, and allow us to discover the deep relationships between their values for different r’s.

Lemma 6.

Let

r \in R \cup {\pm \infty}

,

P_{X} \in Δ^{n - 1}

where

Δ^{n - 1}

is the simplex over the support supp X. Then,

\begin{matrix} {\tilde{H}}_{r} (P_{X}) & = \frac{1}{r} {\tilde{D}}_{0} ({\tilde{q}}_{r} (P_{X}) ∥ P_{X}) + {\tilde{X}}_{0} ({\tilde{q}}_{r} (P_{X}) ∥ P_{X}) \end{matrix}

(33)

\begin{matrix} {\tilde{H}}_{r} (P_{X}) & = \frac{- 1}{r} {\tilde{H}}_{0} ({\tilde{q}}_{r} (P_{X})) + \frac{r + 1}{r} {\tilde{X}}_{0} ({\tilde{q}}_{r} (P_{X}) ∥ P_{X}) \end{matrix}

(34)

Proof.

First, from the definitions of shifted Rényi entropy and cross-entropy and Property 3.6 we have:

\begin{matrix} \frac{- 1}{r^{2}} {\tilde{D}}_{0} ({\tilde{q}}_{r} (P_{X}) ∥ P_{X}) = \frac{1}{r} [{\tilde{H}}_{r} (P_{X}) - {\tilde{X}}_{0} ({\tilde{q}}_{r} (P_{X}) ∥ P_{X})] \end{matrix}

Solving for

{\tilde{H}}_{r} (P_{X})

obtains the first result. By applying (32) to

{\tilde{q}}_{r} (P_{X})

and

P_{X}

we have:

\begin{matrix} {\tilde{D}}_{0} ({\tilde{q}}_{r} (P_{X}) ∥ P_{X}) & = - {\tilde{H}}_{0} ({\tilde{q}}_{r} (P_{X})) + {\tilde{X}}_{0} ({\tilde{q}}_{r} (P_{X}) ∥ P_{X}) . \end{matrix}

(35)

and putting this into (33) obtains the second result.

Another way is to prove it is from the definition of

\begin{matrix} {\tilde{H}}_{0} ({\tilde{q}}_{r} (P_{X})) & = - \sum_{i} \frac{p_{i} p_{i}^{r}}{\sum_{k} p_{k} p_{k}^{r}} log \frac{p_{i} p_{i}^{r}}{\sum_{k} p_{k} p_{k}^{r}} = \sum_{i} {\tilde{q}}_{r} (P_{X}) log (\sum_{k} p_{k} p_{k}^{r}) - \sum_{i} {\tilde{q}}_{r} (P_{X}) log p_{i}^{r + 1} \\ = log (\sum_{k} p_{k} p_{k}^{r}) - (r + 1) \sum_{i} {\tilde{q}}_{r} (P_{X}) log p_{i} = - r {\tilde{H}}_{r} (P_{X}) + (r + 1) {\tilde{X}}_{0} ({\tilde{q}}_{r} (P_{X}) ∥ P_{X}) \end{matrix}

and reorganize to obtain (34). Again inserting the definition of the Shannon divergence in terms of the cross-entropy (35), into (34) and reorganizing we get (33). □

On other occasions, using the shifted version does not help in simplifying expressions. For instance skew symmetry looks in the standard case as

D_{α} (P_{X} ∥ Q_{X}) = \frac{α}{1 - α} D_{1 - α} (Q_{X} ∥ P_{X})

, for any

0 < α < 1

([23], Proposition 2). In the shifted case we have the slightly more general expression for

r \neq 0

:

Lemma 7.

When

Q_{X}

is substituted by

P_{X}

, both probability distributions, on a compatible support, then:

\begin{matrix} {\tilde{D}}_{r} (P_{X} ∥ Q_{X}) = - \frac{r + 1}{r} \cdot {\tilde{D}}_{- (r + 1)} (Q_{X} ∥ P_{X}) \end{matrix}

(36)

Proof.

By easy rewriting of the divergence

{\tilde{D}}_{- (r + 1)} (Q_{X} ∥ P_{X})

. □

3.3. Quantities Around the Shifted Rényi Entropy

On the one hand, the existence of Hartley’s information function (9) ties up information values to probabilities and vice-versa. On the other, Rényi’s averaging function and its inverse (14) also transform probabilities into information values and vice-versa. In this section we explore the relationship between certain quantities generated by these functions, probabilities and entropies.

3.3.1. The Equivalent Probability Function

Recall that, due to Hartley’s function, from every average measure of information, an equivalent average probability emerges. To see this in a more general light, first define the extension to Hartley’s information function to non-negative numbers

I_{*} (\cdot) : [0, \infty] \to [- \infty, \infty]

as

I_{*} (p) = - ln p

. This is one-to-one from

[0, \infty]

and total onto

[- \infty, \infty]

, with inverse

{(I_{*})}^{- 1} (h) = e^{- h}

for

h \in [- \infty, \infty]

.

Definition 7.

Let

X \sim P_{X}

with Rényi spectrum

{\tilde{H}}_{r} (P_{X})

. Then the equivalent probability function of

{\tilde{P}}_{r} (P_{X})

is the Hartley inverse of

{\tilde{H}}_{r} (P_{X})

over all values of

r \in [- \infty, \infty]

\begin{matrix} {\tilde{P}}_{r} (P_{X}) = {(I_{*})}^{- 1} ({\tilde{H}}_{r} (P_{X})) \end{matrix}

(37)

Remark 8.

The equivalent probability function for a fixed probability distribution

P_{X}

is a function of parameter r—like the Rényi entropy—whose values are probabilities—in the sense that it produces values in

[0, 1]

—but it is not a probability distribution.

Analogously, due to the extended definition of the Hartley information, this mechanism, when operating on a mass measure

M_{X}

, generates and equivalent mass function

{\tilde{P}}_{r} (M_{X})

, which is not a mass measure.

Lemma 8.

Let

X \sim P_{X}

. The equivalent probability function

{\tilde{P}}_{r} (P_{X})

is the Hölder path of the probability function

P_{X}

(as a set of numbers) using the same probability function as weights.

\begin{matrix} {\tilde{P}}_{r} (P_{X}) = M_{r} (P_{X}, P_{X}) \end{matrix}

(38)

Proof.

From the definition, using b as the basis chosen for the logarithm in the information function.

\begin{matrix} {\tilde{P}}_{r} (P_{X}) = {(I_{*})}^{- 1} ({\tilde{H}}_{r} (P_{X})) = b^{- {\tilde{H}}_{r} (P_{X})} = b^{{log}_{b} M_{r} (P_{X}, P_{X})} = M_{r} (P_{X}, P_{X}) \end{matrix}

□

Note that by Remark 8 these means apply, in general, to sets of non-negative numbers and not only to the probabilities in a distribution, given their homogeneity properties. In the light of Lemma 8, the following properties of the equivalent probability function are a corollary of those of the weighted generalized power means of Proposition 1 in Section 2.1.

Corollary 1.

Let

X \sim P_{X}

be a random variable with equivalent probability function

{\tilde{P}}_{r} (P_{X})

. Then:

1.: For all $r \in [- \infty, \infty]$ , there holds that

$\begin{matrix} min_{k} p_{k} = {\tilde{P}}_{- \infty} (P_{X}) \leq {\tilde{P}}_{r} (P_{X}) \leq max_{k} p_{k} = {\tilde{P}}_{\infty} (P_{X}) \end{matrix}$

(39)
2.: If $P_{X} \equiv U_{X}$ the uniform over the same $supp P_{X}$ , then $\forall k, \forall r \in [- \infty, \infty], p_{k} = {\tilde{P}}_{r} (U_{X}) = \frac{1}{| supp P_{X} |}$ .
3.: if $P_{X} \equiv δ_{X}^{k}$ the Kroneker delta centered on $x_{k} = X (ω_{k})$ , then ${\tilde{P}}_{r} (δ_{X}^{k}) = u (r)$ where $u (r)$ is the step function.

Proof.

Claims 1 and 2 issue directly from the properties of the entropies and the inverse to the logarithm. The last claims follows from Remark 2. □

And so, in their turn, the properties of Rényi entropy can be proven from those of the equivalent probability function and Hartley’s generalized information function.

An interesting property might help recovering

P_{X}

from the equivalent probability function:

Lemma 9.

Let

X \sim P_{X}

be a random variable with equivalent probability function

{\tilde{P}}_{r} (P_{X})

. Then: for every

p_{k}

in

P_{X}

there exists an

r_{k} \in [- \infty, \infty]

such that

p_{k} = {\tilde{P}}_{r_{k}} (P_{X})

.

Proof.

This follows from the continuity of the means with respect to its parameters

\vec{w}

and

\vec{x}

. □

So if we could actually find those values

r_{k}, 1 \leq k \leq n

which return

p_{k} = {\tilde{P}}_{r} (P_{X})

we would be able to retrieve

P_{X}

by sampling

{\tilde{P}}_{r_{k}} (P_{X})

in the appropriate values

P_{X} = {{\tilde{P}}_{r_{k}} (P_{X})}_{k = 1}^{n}

. Since

n \geq 2

we know that at least two of these values are

r = \pm \infty

retrieving the value of the highest and lowest probabilities for

k = 1

and

k = n

when they are sorted by increasing probability value.

Example 2

(Continued). Figure 1b shows the equivalent probability function of the example in the previous section. The dual monotone behaviour with respect to that of the Rényi spectrum is clearly observable. We have also plotted over the axis at

r = 0

the original probabilities of the distribution to set it in the context of the properties in Corollary 1 and Lemma 9.

3.3.2. The Information Potential

In the context of Information Theoretic Learning (ITL) the information potential is an important quantity ([29], Chapter 2).

Definition 8.

Let

X \sim P_{X}

. Then the information potential

{\tilde{V}}_{r} (P_{X})

is

\begin{matrix} {\tilde{V}}_{r} (P_{X}) = E_{P_{X}} {P_{X}^{r}} = \sum_{i} \frac{p_{i}}{\sum_{k} p_{k}} p_{i}^{r} \end{matrix}

(40)

Note that the original definition of the information potential was presented in terms of parameter

α

and for distributions with

\sum_{k} p_{k} = 1

in which case

V_{α} (P_{X}) = {\tilde{V}}_{r} (P_{X})

. Now, recall the conversion function in (14)

φ^{'} (h) = b^{- r h}

. The next lemma is immediate using it on (26).

Lemma 10.

Let

X \sim P_{X}

. The information potential is the

φ^{'}

image of the shifted Rényi entropy

\begin{matrix} {\tilde{V}}_{r} (P_{X}) = φ^{'} ({\tilde{H}}_{r} (P_{X})) = b^{- r {\tilde{H}}_{r} (P_{X})} \end{matrix}

(41)

Proof.

{\tilde{V}}_{r} (P_{X}) = b^{- r {\tilde{H}}_{r} (P_{X})} = b^{{log}_{b} (\sum_{i} \frac{p_{i}}{\sum_{k} p_{k}} p_{i}^{r})} = \sum_{i} \frac{p_{i}}{\sum_{k} p_{k}} p_{i}^{r} = E_{P_{X}} {P_{X}^{r}}

▯

Incidentally, (28) gives the relation of the information potential and the generalized weighted means.

Remark 9.

The quantity in the right-hand side of (40) is also the normalizing factor or partition function of the moments of the distribution and, as such, appears explicitly in the definition of the escort probabilities (27). Usually other partition functions appear in the estimation of densities based on overt, e.g., maximum entropy [6], or in covert information criteria—e.g., Ising models [5].

3.3.3. Summary

Table 2 offers a summary of the quantities mentioned above and their relationships, while the domain diagram in Figure 2 summarizes the actions of these functions to obtain the shifted Rényi entropy. A similar diagram is, of course, available for the standard entropy, using

φ

with the

α

parameter.

Note that these quantities have independent motivation: this is historically quite evident in the case of the means [24], and the Rényi information [12] and little bit less so in the case of the information potential which arose in the context of ITL [29], hence motivated by a desire to make Rényi’s entropies more useful. Both quantities are generated from/generate entropy by means of independently motivated functions, Hartley’s transformation (9) and Rényi’s transformation (14), respectively.

Following the original axiomatic approach it would seem we first transform the probabilities into entropies using Hartley’s function and then we use the

φ^{'}

function to work out an average of these using the Kolmogorov–Nagumo formula. But due to the formulas for the information potential and the equivalent probability function we know that this is rather a composition of transformations, than a forward backward moving between entropies and probabilities. It is clear that the Hartley function and Rényi’s choice of averaging function are special for entropies, from the postulate approach to their definition.

3.4. Discussion

A number of decisions taken in the paper might seem arbitrary. In the following, we try to discuss these issues as well as alternatives left for future work.

3.4.1. Other Reparameterization of the Rényi Entropy

Not only the parameter, but also de sign of the parameter is somewhat arbitrary in the form of (12). If we choose

r^{'} = 1 - α

another generalization evolves that is, in a sense, symmetrical to the shifted Rényi entropy we have presented above, since

r^{'} = - r

. This may be better or worse for the general formulas describing entropy, etc., but presents the problem that it no longer aligns with Shannon’s original choice of sign. The

r = 0

order Rényi entropy would actually be Boltzmann’s, negative entropy or negentropy [30] and perhaps more suitable for applications in Thermodynamics [5].

Yet another formulation suggests the use of

α = 1 / 2

, equivalently

r = - 1 / 2

as the origin of the parameter [31]. From our perspective, this suggests that the origin of the Rényi entropy can be chosen adequately in each application.

3.4.2. Rényi Measures and the Means

The usefulness of the (weighted) means in relation to information-theoretic concerns was already noted and explored in [32]. However, the relationship is not in there explicitly set out in terms of the identity of the Rényi entropies and logarithmic, weighted means of probabilities but rather as a part of establishing bounds for different quantities for discrete channel characterization.

A more direct approach is found in [33] that, inspired by [32], decides to generalize several results from there and other authors concerning the Rényi entropies, divergences and the Rényi centers of a set of distributions. Unlike our proposal, this deep work adheres to the standard definition of Rényi entropies of order

α

and avoids the issue of negative orders. The focus here is in coding and channel theorems, while ours is a re-definition of the mathematical concept to make similarities with weighted means transparent, yet evident.

3.4.3. Other Magnitudes around the Rényi Entropy

Sometimes the p-norm is used as a magnitude related to the Rényi entropy much as the information potential [29] or directly seeing the relationship with the definition [5].

Definition 9.

For a set of non-negative numbers

\vec{x} = {[x_{i}]}_{i = 1}^{n} \in {[0, \infty)}^{n}

the p-norm, with

0 \leq p \leq \infty

is

\begin{matrix} ∥ \vec{x} ∥_{p} = {(\sum_{i} x_{i}^{p})}^{\frac{1}{p}} \end{matrix}

(42)

A more general definition involves both positive and negative components for

\vec{x}

, as in normed real spaces, but this is not relevant to our purposes for non-negative measures.

The p-norm has the evident problem that it is only defined for positive p whereas (14) proves that negative orders are meaningful and, indeed, interesting. A prior review of results for the negative orders can be found in [23].

We believe this is yet one more advantage of the shifting of the Rényi order: that the relation with the equivalent probability function and the information potential—the moments of the distribution—are properly highlighted.

3.4.4. Redundancy of the Rényi Entropy

Lemma 6 proves that Rényi entropies are very redundant in the sense that given its value for a particular

r_{0}

the rest can be written in terms of those entropies with different, but systematically related, r order (see Section 3.4.4).

In particular, Equations (33) and (34) in Lemma 6, and (31) in Lemma 5 allow us to use a good estimator of Shannon’s entropy to estimate the Rényi entropies and related magnitudes for all orders, special or not. Three interesting possibilities for this rewriting are:

That everything can be written in terms of $r = 0$ , e.g., in terms of Shannon’s entropy. This is made possible by the existence of estimators for Shannon’s entropy and divergence.
That everything can be written in terms of a finite $r \neq 0$ , e.g., $r = 1$ . This is possible by means of Properties 1.3 and 1.4 of the generalized power means. The work in [29] is pointing this way (perhaps including also $r = - 1$ , aka Hartley’s) capitalizing on the fact that Rényi’s entropy for data is well estimated for $r = 1$ , equivalently $α = 2$ ([29], Section 2.6).
That everything can be written in terms of the extreme values of the entropy, e.g., $r = \pm \infty$ . This is suggested by Properties 3.1 and 3.2. Supposing we had a way to estimate either ${\tilde{H}}_{- \infty} (P_{X})$ or ${\tilde{H}}_{\infty} (P_{X})$ . Then by a divide-and-conquer type of approach it would be feasible to extract all the probabilities of a distribution out of its Rényi entropy function.

3.4.5. The Algebra of Entropies

Technically, the completed non-negative reals

R_{\geq 0}

, where the means are defined, carry a complete positive semifield structure [34]. This is an algebra similar to a real-valued field but the inverse operation to addition, e.g., subtraction, is missing.

There are some technicalities involving writing the results of the operations of the extremes of the semifields—e.g., multiplication of 0 and ∞—and this makes writing closed expressions for the means with extreme values of

\vec{w}

or

\vec{x}

complicated. A sample of this is the plethora of conditions on Property 1.5. An extended notation, pioneered by Moreau [35], is however capable of writing a closed expression for the means [36].

Furthermore, taking (minus) logarithms and raising to a real power are isomorphism of semifields, so that the Rényi entropies inhabit a different positive semifield structure [36]. The graph of these isomorphic structures can be seen in Figure 2b. This means that some of the intuitions about operating with entropies are misguided. We believe that failing to give a meaning to the Rényi entropies with negative orders might have been caused by this.

3.4.6. Shifted Rényi Entropies on Continuous Distributions

The treatment we use here may be repeated on continuous measures, but the definitions of Shannon [10,21] and Rényi [26] entropies in such case run into technical difficulties solved, typically, by a process of discretization [27].

Actually we believe that the shifting would also help in this process: a form for the generalized weighted continuous means was long ago established [20] and technically solved by a change of concept and Lebesgue–Stieltjes integration instead of summation ([24], Ch. VI).

Our preliminary analyses show that the relationship with the means given by (17) also holds, and this would mean that the shifting—in aligning the Rényi entropies with the (generalized weighted) continuous means—leverages the theoretical support of the latter to sustain the former.

Definition 10

(Continuous weighted f-mean). Let

Φ (ξ)

be a measure and let f be a monotonic function of ξ with inverse

f^{- 1}

. Then a continuous version of (A2) is:

\begin{matrix} M_{f} (Φ, ξ) & = f^{- 1} \{\int f (ξ) d Φ (ξ)\} \end{matrix}

understood as a Lebesgue–Stieltjes integral.

This definition was already proposed by De Finetti [20] based upon the works of Bonferroni and Kolmogorov and thoroughly developed in ([24], Ch. VI) in connection to the discrete means. With

f (x) = x^{r}

the continuous Hölder means

M_{r} (Φ, ξ)

appear. Furthermore De Finetti found ([20], #8) that the form of the f continuous, monotone function f must be

\begin{matrix} f (x) & = a \int γ (x) d x + b & for arbitrary a, b (a \neq 0) \end{matrix}

similar to what Rényi found later for the Shannon entropy.

It is easy to see that an analogue definition of the shifted Rényi entropy but for a continuous probability density

p_{X}

with

d p_{X} (x) = p_{X} (x) d x

[5,27] is

\begin{matrix} \tilde{h} (p_{X}) = \frac{- 1}{r} log \int p_{X} (x) p_{X}^{r} (x) d x = - log M_{r} (p_{X}, p_{X}) \end{matrix}

(43)

again with the distribution acting as weight and averaged quantity. Compare this to one of the standard forms of the differential Rényi entropy [23]:

\begin{matrix} h (p_{X}) = \frac{1}{1 - α} ln \int p_{X}^{α} (x) d x \end{matrix}

The investigation of the properties of (43) is left pending for future work, though.

3.4.7. Pervasiveness of Rényi Entropies

Apart from the evident applications to signal processing and communications [29], physics [5] and cognition [11], the Rényi entropy is a measure of diversity in several disciplines [37]. We believe that, if its applicability comes from the same properties stemming from the means that we have explored in this paper as applied to positive distributions—e.g., of wealth in a population, or energy in a community—, then the expression to be used is (29).

4. Conclusions

In this paper we have advocated for the shifting of the traditional Rényi entropy order from a parameter

α

to

r = α - 1

. The shifting of the Rényi entropy and divergence is motivated by a number of results:

It aligns them with the power means and explains the apparition of the escort probabilities. Note that the importance of the escort probabilities is justified independently of their link to the means in the shifted version of entropy [5].
It highlights the Shannon entropy $r = 0$ in the role of the “origin” of entropy orders, just as the geometric means is a particular case of the weighted averaged means. This consideration is enhanced by the existence of a formula allowing us to rewrite every other order as a combination of Shannon entropies and cross entropies of escort probabilities of the distribution.
The shifting of the Rényi entropy aligns it with the moments of the distribution, thus enabling new insights into the moments’ problem.
It makes the relation between the divergence and the entropy more “symmetrical”.
It highlights the “information spectrum” quality of the Rényi entropy measure for fixed $P_{X}$ .

The shifting might or might not be justified by applications. If the concept of the means is relevant in the application, we recommend the shifted formulation.

Author Contributions

Conceptualization, F.J.V.-A. and C.P.-M.; Formal analysis, F.J.V.-A. and C.P.-M.; Funding acquisition, C.P.-M.; Investigation, F.J.V.-A. and C.P.-M.; Methodology, F.J.V.-A. and C.P.-M.; Software, F.J.V.-A.; Supervision, C.P.-M.; Validation, F.J.V.-A. and C.P.-M.; Visualization, F.J.V.-A. and C.P.-M.; Writing—original draft, F.J.V.-A. and C.P.-M.; Writing—review & editing, F.J.V.-A. and C.P.-M.

Funding

This research was funded by the Spanish Government-MinECo projects TEC2014-53390-P and TEC2017-84395-P.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. The Kolmogorov-Mean and the Kolmogorov–Nagumo Formula

The following is well known since [18,19,20,24].

Definition A1.

Given an invertible real function

f : R \to R

the Kolmogorov–Nagumo mean of a set of non-negative numbers

\vec{x} = {[x_{i}]}_{i = 1}^{n} \in {[0, \infty)}^{n}

is

\begin{matrix} K N_{f} (\vec{x}) = f^{- 1} (\sum_{i = 1}^{n} \frac{1}{n} f (x_{i})) . \end{matrix}

(A1)

Definition A1 is an instance of the following formula to work out the weighted f-mean with a set of finite, non-negative weights,

\vec{w} \in [0, \infty)

\begin{matrix} M_{f} (\vec{w}, \vec{x}) = f^{- 1} (\sum_{i = 1}^{n} \frac{w_{i}}{\sum_{k} w_{k}} f (x_{i})) . \end{matrix}

(A2)

Our interest in (A2) lies in the fact that Shannon’s and Rényi’s entropies can be seen as special cases of it, which makes its properties especially interesting.

Proposition A1

(Properties of the Kolmogorov–Nagumo means). Let

\vec{x}, \vec{w} \in {[0, \infty)}^{n}

. The following conditions hold if and only if there is a strictly monotonic and continuous function f such that (A1) holds.

1.: Continuity and strict monotonicity in all coordinates.
2.: (Symmetry or permutation invariance) Let σ be a permutation, then $M_{f} (\vec{w}, \vec{x}) = M_{f} (σ (\vec{w}), σ (\vec{x}))$ .
3.: (Reflexivity) The mean of a series of constants is the constant itself:

$M_{f} (\vec{w}, {k}_{i = 1}^{n}) = k$
4.: (Blocking) The computation of the mean can be split into computations of equal size sub-blocks.
5.: (Associativity) Replacing a k-subset of the x with their partial mean in the same multiplicity does not change the overall mean.

For a minimal axiomatization, Blocking and Associativity are redundant. A review of the axiomatization of these and other properties can be found in [22].

Appendix B. The Approach to Shannon’s Information Functions Based in Postulates

It is important to recall that Shannon set out to define the amount of information, discarding any notion of information itself. Both concepts should be distinguished clearly for methodological reasons, but can be ignored for applications that deal only with quantifying information.

Recall the Faddeev postulates for the generalization of Shannon’s entropy ([26], Chap. IX. §2):

The amount of information $H (P)$ of a sequence $P = {[p_{k}]}_{k = 1}^{n}$ of n numbers is a symmetric function of this set of values $H (P) = H (σ (P)) = H ({p_{k}}_{k = 1}^{n})$ , where $σ$ is any permutation of n-elements.
$H ({p, 1 - p})$ is a continuous function of $p, 0 \leq p \leq 1$ .
$H ({\frac{1}{2}, \frac{1}{2}}) = 1$ .
The following relation holds:

$\begin{matrix} H ({p_{1}, p_{2}, \dots, p_{n}}) = H ({p_{1} + p_{2}, \dots, p_{n}}) + (p_{1} + p_{2}) H ({\frac{p_{1}}{p_{1} + p_{2}}, \frac{p_{2}}{p_{1} + p_{2}}}) \end{matrix}$

(A3)

These postulates lead to Shannon’s entropy for

X \sim P_{X}

with binary logarithm [26]

\begin{matrix} H (P_{X}) = E_{P_{X}} {- log P_{X}} = - \sum_{k} p_{k} log p_{k} \end{matrix}

(A4)

References

Shannon, C.; Weaver, W. A mathematical Model of Communication; The University of Illinois Press: Champaign, IL, USA, 1949. [Google Scholar]
Shannon, C.E. A mathematical theory of communication. Parts I and II. Bell Syst. Tech. J. 1948, XXVII, 379–423. [Google Scholar] [CrossRef]
Shannon, C.E. A mathematical theory of communication. Part III. Bell Syst. Tech. J. 1948, XXVII, 623–656. [Google Scholar] [CrossRef]
Shannon, C. The bandwagon. IRE Trans. Inf. Theory 1956, 2, 3. [Google Scholar] [CrossRef]
Beck, C.; Schögl, F. Thermodynamics of Chaotic Systems: An Introduction; Cambridge University Press: Cambridge, UK, 1995. [Google Scholar]
Jaynes, E.T. Probability Theory: The Logic of Science; Cambridge University Press: Cambridge, UK, 1996. [Google Scholar]
Mayoral, M.M. Rényi’s entropy as an index of diversity in simple-stage cluster sampling. Inf. Sci. 1998, 105, 101–114. [Google Scholar] [CrossRef]
MacKay, D.J.C. Information Theory, Inference and Learning Algorithms; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Csiszár, I.; Shields, P.C. Information Theory and Statistics: A Tutorial. Found. Trends Commun. Inf. Theory 2004, 1, 417–528. [Google Scholar] [CrossRef]
Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2006. [Google Scholar]
Sayood, K. Information Theory and Cognition: A Review. Entropy 2018, 20, 706. [Google Scholar] [CrossRef]
Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 20–30 July 1960; University of California Press: Berkeley, CA, USA, 1961; pp. 547–561. [Google Scholar]
Havrda, J.; Charvát, F. Quantification method of classification processes. Concept of structural a-entropy. Kybernetika 1967, 3, 30–35. [Google Scholar]
Csiszár, I. Axiomatic Characterizations of Information Measures. Entropy 2008, 10, 261–273. [Google Scholar] [CrossRef]
Arndt, C. Information Measures, 1st ed.; Information and Its Description in Science and Engineering; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2004. [Google Scholar]
Rényi, A. On the Foundations of Information Theory. Revue de l’Institut International de Statistique/Rev. Int. Stat. Inst. 1965, 33, 1–14. [Google Scholar] [CrossRef]
Aczél, J.; Daróczy, Z. On measures of inFormation and Their Characterizations; Academic Press [Harcourt Brace Jovanovich, Publishers]: New York, NY, USA; London, UK, 1975. [Google Scholar]
Kolmogorov, A.N. Sur la notion de la moyenne. Atti Della Accademia Nazionale dei Lincei 1930, 12, 388–391. [Google Scholar]
Nagumo, M. Uber eine Klasse der Mittelwerte. Jpn. J. Math. Trans. Abstr. 1930, 7, 71–79. [Google Scholar] [CrossRef]
De Finetti, B. Sul concetto di media. Giornale dell Istituto Italiano degli Attuari 1931, II, 369–396. [Google Scholar]
Kolmogorov, A. On the Shannon theory of information transmission in the case of continuous signals. IRE Trans. Inf. Theory 1956, 2, 102–108. [Google Scholar] [CrossRef]
Muliere, P.; Parmigiani, G. Utility and means in the 1930s. Stat. Sci. 1993, 8, 421–432. [Google Scholar] [CrossRef]
Van Erven, T.; Harremoës, P. Rényi divergence and Kullback-Leibler divergence. IEEEE Trans. Inf. Theory 2014. [Google Scholar] [CrossRef]
Hardy, G.H.; Littlewood, J.E.; Pólya, G. Inequalities; Cambridge University Press: Cambridge, UK, 1952. [Google Scholar]
Kitagawa, T. On Some Class of Weighted Means. Proc. Phys.-Math. Soc. Jpn. 3rd Ser. 1934, 16, 117–126. [Google Scholar] [CrossRef]
Rényi, A. Probability Theory; Courier Dover Publications: Mineola, NY, USA, 1970. [Google Scholar]
Jizba, P.; Arimitsu, T. The world according to Rényi: thermodynamics of multifractal systems. Ann. Phys. 2004, 312, 17–59. [Google Scholar] [CrossRef] [Green Version]
Bickel, P.J.; Hammel, E.A.; O’Connell, J.W. Sex bias in graduate admissions: Data from Berkeley. Science 1975, 187, 398–403. [Google Scholar] [CrossRef]
Principe, J.C. Information Theoretic Learning; Information Science and Statistics; Springer: New York, NY, USA, 2010. [Google Scholar]
Brillouin, L. Science and Information Theory, 2nd ed.; Academic Press, Inc.: New York, NY, USA, 1962. [Google Scholar]
Harremoës, P. Interpretations of Rényi entropies and divergences. Phys. A Stat. Mech. Its Appl. 2005, 365, 57–62. [Google Scholar] [CrossRef]
Augustin, U. Noisy Channels. Ph.D. Thesis, Universität Erlangen, Erlangen, Germany, 1978. [Google Scholar]
Nakiboglu, B. The Rényi capacity and center. IEEE Trans. Inf. Theory 2018. [Google Scholar] [CrossRef]
Gondran, M.; Minoux, M. Graphs, Dioids and Semirings. New Models and Algorithms; Operations Research Computer Science Interfaces Series; Springer: New York, NY, USA, 2008. [Google Scholar]
Moreau, J.J. Inf-convolution, sous-additivité, convexité des fonctions numériques. J. Math. Pures Appl. 1970, 49, 109–154. [Google Scholar]
Valverde Albacete, F.J.; Peláez-Moreno, C. Entropy operates in Non-Linear Semifields. arXiv, 2017; arXiv:1710.04728. [Google Scholar]
Zhang, Z.; Grabchak, M. Entropic representation and estimation of diversity indices. J. Nonparametr. Stat. 2016, 28, 563–575. [Google Scholar] [CrossRef]

Figure 1. Rényi spectrum (a) and equivalent probability function (b)—also Hölder path—of

{\tilde{q}}_{0} (M_{X})

, the probability distribution of a simple mass measure

M_{X}

with

n = 6

(see Section 3.3). The values of the self-information (left) and probability (right) of the original distribution are shown at

r = 0

also (hollow circles). Only 5 values seem to exist because the maximal information (minimal probability) is almost superposed on a second value.

Figure 1. Rényi spectrum (a) and equivalent probability function (b)—also Hölder path—of

{\tilde{q}}_{0} (M_{X})

, the probability distribution of a simple mass measure

M_{X}

with

n = 6

(see Section 3.3). The values of the self-information (left) and probability (right) of the original distribution are shown at

r = 0

also (hollow circles). Only 5 values seem to exist because the maximal information (minimal probability) is almost superposed on a second value.

Figure 2. Schematics of relationship between some magnitudes in the text and their domains of definition (see Section 3.4.5). (a) Between entropy-related quantities; (b) Between entropy-related domains.

Table 1. Relation between the most usual weighted power means, Rényi entropies and shifted versions of them.

Mean Name	Mean $M_{r} (\vec{w}, \vec{x})$	Shifted Entropy ${\tilde{H}}_{r} (P_{X})$	Entropy Name	$α$	r
Maximum	${max}_{i} x_{i}$	${\tilde{H}}_{\infty} = - log {max}_{i} p_{i}$	min-entropy	∞	∞
Arithmetic	$\sum_{i} w_{i} x_{i}$	${\tilde{H}}_{1} = - log \sum_{i} p_{i}^{2}$	Rényi’s quadratic	2	1
Geometric	$Π_{i} x_{i}^{w_{i}}$	${\tilde{H}}_{0} = - \sum_{i} p_{i} log p_{i}$	Shannon’s	1	0
Harmonic	${(\sum_{i} w_{i} \frac{1}{x_{i}})}^{- 1}$	${\tilde{H}}_{- 1} = log n$	Hartley’s	0	$- 1$
Minimum	${min}_{i} x_{i}$	${\tilde{H}}_{- \infty} = - log {min}_{i} p_{i}$	max-entropy	$- \infty$	$- \infty$

Table 2. Quantities around the shifted Rényi entropy of a discrete distribution

P_{X}

.

Table 2. Quantities around the shifted Rényi entropy of a discrete distribution

P_{X}

.

Quantity in Terms of…	Rényi Entropy	Gen. Hölder Mean	Information Potential	Distribution
Rényi entropy	${\tilde{H}}_{r} (P_{X})$	$- log M_{r} (P_{X}, P_{X})$	$\frac{- 1}{r} log {\tilde{V}}_{r} (P_{X})$	$\frac{- 1}{r} log (\sum_{i} \frac{p_{i}}{\sum_{k} p_{k}} p_{i}^{r})$
Gen. Hölder mean	$exp (- {\tilde{H}}_{r} (P_{X}))$	$M_{r} (P_{X}, P_{X})$	${({\tilde{V}}_{r} (P_{X}))}^{\frac{1}{r}}$	${(\sum_{i} \frac{p_{i}}{\sum_{k} p_{k}} p_{i}^{r})}^{\frac{1}{r}}$
Information potential	$exp (- r {\tilde{H}}_{r} (P_{X}))$	$M_{r} {(P_{X}, P_{X})}^{r}$	${\tilde{V}}_{r} (P_{X}) = E_{P_{X}} {P_{X}^{r}}$	$\sum_{i} \frac{p_{i}}{\sum_{k} p_{k}} p_{i}^{r}$

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Valverde-Albacete, F.J.; Peláez-Moreno, C. The Case for Shifting the Rényi Entropy. Entropy 2019, 21, 46. https://doi.org/10.3390/e21010046

AMA Style

Valverde-Albacete FJ, Peláez-Moreno C. The Case for Shifting the Rényi Entropy. Entropy. 2019; 21(1):46. https://doi.org/10.3390/e21010046

Chicago/Turabian Style

Valverde-Albacete, Francisco J., and Carmen Peláez-Moreno. 2019. "The Case for Shifting the Rényi Entropy" Entropy 21, no. 1: 46. https://doi.org/10.3390/e21010046

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

The Case for Shifting the Rényi Entropy

Abstract

1. Introduction

2. Preliminaries

2.1. The Generalized Power Means

2.2. Rényi’s Entropy

2.2.1. Probability Spaces, Random Variables and Expectations

2.2.2. The Approach to Rényi’s Information Functions Based in Postulates

3. Results

3.1. The Shifted Rényi Entropy and Divergence

3.1.1. The Case for Shifting the Rényi Entropy

3.1.2. Shifting Other Concepts Related to the Entropies

3.2. Writing Rényi Entropies in Terms of Each Other

3.3. Quantities Around the Shifted Rényi Entropy

3.3.1. The Equivalent Probability Function

3.3.2. The Information Potential

3.3.3. Summary

3.4. Discussion

3.4.1. Other Reparameterization of the Rényi Entropy

3.4.2. Rényi Measures and the Means

3.4.3. Other Magnitudes around the Rényi Entropy

3.4.4. Redundancy of the Rényi Entropy

3.4.5. The Algebra of Entropies

3.4.6. Shifted Rényi Entropies on Continuous Distributions

3.4.7. Pervasiveness of Rényi Entropies

4. Conclusions

Author Contributions

Funding

Conflicts of Interest

Appendix A. The Kolmogorov-Mean and the Kolmogorov–Nagumo Formula

Appendix B. The Approach to Shannon’s Information Functions Based in Postulates

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI