Error Exponents and α-Mutual Information

Over the last six decades, the representation of error exponent functions for data transmission through noisy channels at rates below capacity has seen three distinct approaches: (1) Through Gallager’s E0 functions (with and without cost constraints); (2) large deviations form, in terms of conditional relative entropy and mutual information; (3) through the α-mutual information and the Augustin–Csiszár mutual information of order α derived from the Rényi divergence. While a fairly complete picture has emerged in the absence of cost constraints, there have remained gaps in the interrelationships between the three approaches in the general case of cost-constrained encoding. Furthermore, no systematic approach has been proposed to solve the attendant optimization problems by exploiting the specific structure of the information functions. This paper closes those gaps and proposes a simple method to maximize Augustin–Csiszár mutual information of order α under cost constraints by means of the maximization of the α-mutual information subject to an exponential average constraint.


Phase 1: The MIT School
The capacity C of a stationary memoryless channel is equal to the maximal symbolwise input-output mutual information. Not long after Shannon [1] established this result, Rice [2] observed that, when operating at any encoding rate R ă C, there exist codes whose error probability vanishes exponentially with blocklength, with a speed of decay that decreases as R approaches C. This early observation moved the center of gravity of information theory research towards the quest for the reliability function, a term coined by Shannon [3] to refer to the maximal achievable exponential decay as a function of R. The MIT information theory school, and most notably, Elias [4], Feinstein [5], Shannon [3,6], Fano [7], Gallager [8,9], and Shannon, Gallager and Berlekamp [10,11], succeeded in upper/lower bounding the reliability function by the sphere-packing error exponent function and the random coding error exponent function, respectively. Fortunately, these functions coincide for rates between C and a certain value, called the critical rate, thereby determining the reliability function in that region. The influential 1968 textbook by Gallager [9] set down the major error exponent results obtained during Phase 1 of research on this topic, including the expurgation technique to improve upon the random coding error exponent lower bound. Two aspects of those early works (and of Dobrushin's contemporary papers [12,13] on the topic) stand out: (a) The error exponent functions were expressed as the result of the Karush-Kuhn-Tucker optimization of ad-hoc functions which, unlike mutual information, carried little insight. In particular, during the first phase, center stage is occupied by the parametrized function of the input distribution P X and the random transformation (or "channel") P Y|X , 2 of 52 E 0 pρ, P X q "´log ÿ yPB˜ÿ xPA P X pxqP introduced by Gallager in [8]. (b) Despite the large-deviations nature of the setup, none of the tools from that thennascent field (other than the Chernoff bound) found their way to the first phase of the work on error exponents; in particular, relative entropy, introduced by Kullback and Leibler [14], failed to put in an appearance.
To this date, the reliability function remains open for low rates even for the binary symmetric channel, despite a number of refined converse and achievability results (e.g., [15][16][17][18][19][20][21]) obtained since [9]. Our focus in this paper is not on converse/achievability techniques but on the role played by various information measures in the formulation of error exponent results.

Phase 2: Relative Entropy
The second phase of the error exponent research was pioneered by Haroutunian [22] and Blahut [23], who infused the expressions for the error exponent functions with meaning by incorporating relative entropy. The sphere-packing error exponent function corresponding to a random transformation P Y|X is given as Roughly speaking, optimal codes of rate R ă C incur in errors due to atypical channel behavior, and large deviations establishes that the overwhelmingly most likely such behavior can be explained as if the channel would be supplanted by the one with mutual information bounded by R which is closest to the true channel in conditional relative entropy DpQ Y|X }P Y|X |P X q. Within the confines of finite-alphabet memoryless channels, this direction opened the possibility of using the combinatorial method of types to obtain refined results robustifying the choice of the optimal code against incomplete knowledge of the channel. The 1981 textbook by Csiszár and Körner [24] summarizes the main results obtained during Phase 2.

Phase 3: Rényi Information Measures
Entropy and relative entropy were generalized by Rényi [25], who introduced the notions of Rényi entropy and Rényi divergence of order α. He arrived at Rényi entropy by relaxing the axioms Shannon proposed in [1], and showed to be satisfied by no measure but entropy. Shortly after [25], Campbell [26] realized the operational role of Rényi entropy in variable-length data compression if the usual average encoding length criterion Er pcpXqqs is replaced by an exponential average α´1 log Erexppα pcpXqqs. Arimoto [27] put forward a generalized conditional entropy inspired by Rényi's measures (now known as Arimoto-Rényi conditional entropy) and proposed a generalized mutual information by taking the difference between Rényi entropy and the Arimoto-Rényi conditional entropy. The role of the Arimoto-Rényi conditional entropy in the analysis of the error probability of Bayesian M-ary hypothesis testing problems has been recently shown in [28], tightening and generalizing a number of results dating back to Fano's inequality [29].
Entropy 2021, 23,199 3 of 52 Phase 3 of the error exponent research was pioneered by Csiszár [30] where he established a connection between Gallager's E 0 function and Rényi divergence by means of a Bayesian measure of the discrepancy among a finite collection of distributions introduced by Sibson [31]. Although [31] failed to realize its connection to mutual information, Csiszár [30,32] noticed that it could be viewed as a natural generalization of mutual information. Arimoto [27] also observed that the unconstrained maximization of his generalized mutual information measure with respect to the input distribution coincides with a scaled version of the maximal E 0 function. This resulted in an extension of the Arimoto-Blahut algorithm useful for the computation of error exponent functions [33] (see also [34]) for finite-alphabet memoryless channels.
Within Haroutunian's framework [22] applied in the context of the method of types, Poltyrev [35] proposed an alternative to Gallager's E 0 function, defined by means of a cumbersome maximization over a reverse random transformation. This measure turned out to coincide (modulo different parametrizations) with another generalized mutual information introduced four years earlier by Augustin in his unpublished thesis [36], by means of a minimization with respect to an output probability measure.
The key contribution in the development of this third phase is Csiszár's paper [32] where he makes a compelling case for the adoption of Rényi's information measures in the large deviations analysis of lossless data compression, hypothesis testing and data transmission. Recall that more than two decades earlier, Csiszár [30] had already established the connection of Gallager's E 0 function and the generalized mutual information inspired by Sibson [31], which, henceforth, we refer to as the α-mutual information. Therefore, its relevance to the error exponent analysis of error correcting codes had already been established. Incidentally, more recently, another operational role was found for α-mutual information in the context of the large deviations analysis of composite hypothesis testing [37]. In addition to α-mutual information, and always working with discrete alphabets, Csiszár [32] considers the generalized mutual informations due to Arimoto [27], and to Augustin [36], which we refer to as the Augustin-Csiszár mutual information of order α. Csiszár shows that all those three generalizations of mutual information coincide upon their unconstrained maximization with respect to the input distribution. Further relationships among those Rényi-based generalized mutual informations have been obtained in recent years in [38][39][40][41][42][43][44][45]. In [32] the maximal α-mutual information or generalized capacity of order α finds an operational characterization as a generalized cutoff rate-an equivalent way to express the reliability function. This would have been the final word on the topic if it weren't for its limitation to discrete-alphabet channels, and more importantly, encoding without cost constraints.

Cost Constraints
If the transmitted codebook is cost-constrained, i.e., every codeword pc 1 , . . . , c n q is forced to satisfy ř n i"1 bpc i q ď n θ for some nonnegative cost function bp¨q, then the channel capacity is equal to the input-output mutual information maximized over input probability measures restricted to satisfy ErbpXqs ď θ. Gallager [9] incorporated cost constraints in his treatment of error exponents by generalizing (1) to the function E 0 pρ, P X , r, θq "´log ÿ yPB˜ÿ xPA P X pxq exppr bpxq´r θqP with which he was able to prove an achievability result invoking Shannon's random coding technique [1]. Gallager also suggested in the footnote of page 329 of [9] that the converse technique of [10] is amenable to extension to prove a sphere-packing converse based on (3). However, an important limitation is that that technique only applies to constantcomposition codes (all codewords have the same empirical distribution). A more powerful converse circumventing that limitation (at least for symmetric channels) was given by [46] also expressing the upper bound on the reliability function by optimizing (3) with respect Entropy 2021, 23, 199 4 of 52 to ρ, r and P X . A notable success of the approach based on the optimization of (3) was the determination of the reliability function (for all rates below capacity) of the direct detection photon channel [47]. In contrast, the Phase Two expression (2) for the sphere-packing error exponent for cost-constrained channels is much more natural and similar to the way the expression for channel capacity is impacted by cost constraints, namely we simply constrain the maximization in (2) to satisfy ErbpXqs ď θ. Unfortunately, no general methods to solve the ensuing optimization have been reported.
Once cost constraints are incorporated, the equivalence among the maximal α-mutual information, maximal order-α Augustin-Csiszár mutual information, and maximal Arimoto mutual information of order α breaks down. Of those three alternatives, it is the maximal Augustin-Csiszár mutual information under cost constraints that appears in the error exponent functions. The challenge is that Augustin-Csiszár mutual information is much harder to evaluate, let alone maximize, than α-mutual information. The Phase 3 effort to encompass cost constraints started by Augustin [36] and was continued recently by Nakiboglu [43]. Their focus was to find a way to express (3) in terms of Rényi information measures. Although, as we explain in Item 62, they did not quite succeed, their efforts were instrumental in developing key properties of the Augustin-Csiszár mutual information.

Organization
To enhance readability and ease of reference, the rest of this work is organized in 81 items, grouped into Section 13 and an appendix.
Basic notions and notation (including the key concept of α-response) are collected in Section 2. Unlike much of the literature on the topic, we do not restrict attention to discrete input/output alphabets, nor do we impose any topological structures on them.
The paper is essentially self-contained. Section 3 covers the required background material on relative entropy, Rényi divergence of order α, and their conditional versions, including a key representation of Rényi divergence in terms of relative entropies and a tilted probability measure, and additive decompositions of Rényi divergence involving the α-response.
Section 4 studies the basic properties of α-mutual information and order-α Augustin-Csiszár mutual information. This includes their variational representations in terms of conventional (non-Rényi) information measures such as conditional relative entropy and mutual information, which are particularly simple to show in the main range of interest in applications to error exponents, namely, α P p0, 1q.
The interrelationships between α-mutual information and order-α Augustin-Csiszár mutual information are covered in Section 5, which introduces the dual notions of α-adjunct and xαy-adjunct of an input probability measure.
The maximizations with respect to the input distribution of α-mutual information and order-α Augustin-Csiszár mutual information account for their role in the fundamental limits in data transmission through noisy channels. Section 6 gives a brief review of the results in [45] for the maximization of α-mutual information. For Augustin-Csiszár mutual information, Section 7 covers its unconstrained maximization, which coincides with its αmutual information counterpart. Section 8 proposes an approach to find C c α pθq, the maximal Augustin-Csiszár mutual information of order α P p0, 1q subject to ErbpXqs ď θ. Instead of trying to identify directly the input distribution that maximizes Augustin-Csiszár mutual information, the method seeks its xαy-adjunct. This is tantamount to maximizing α-mutual information over a larger set of distributions.
Section 9 shows where the maximization on the right side is unconstrained. In other words, the minimax of Gallager's E 0 function (3) with cost constraints is shown to be equal to the maximal Augustin-Csiszár mutual information, thereby bridging the existing gap between the Phase 1 and Phase 3 representations alluded to earlier in this introduction. As in [48], Section 10 defines the sphere-packing and random-coding error exponent functions in the natural canonical form of Phase 2 (e.g., (2)), and gives a very simple proof of the nexus between the Phase 2 and Phase 3 representations, namely, with or without cost constraints. In this regard, we note that, although all the ingredients required were already present at the time the revised version of [24] was published three decades after the original, [48] does not cover the role of Rényi's information measures in channel error exponents. Examples illustrating the proposed method are given in Sections 11 and 12 for the additive Gaussian noise channel under a quadratic cost function, and the additive exponential noise channel under a linear cost function, respectively. Simple parametric expressions are given for the error exponent functions, and the least favorable channels that account for the most likely error mechanism (Section 1.2) are identified in both cases.

Relative Information and Information Density
We begin with basic terminology and notation required for the subsequent development.

1.
If pA, F , Pq is a probability space, X " P indicates PrX P F s " PpF q for all F P F .

2.
If probability measures P and Q defined on the same measurable space pA, F q satisfy PpAq " 0 for all A P F such that QpAq " 0, we say that P is dominated by Q, denoted as P ! Q. If P and Q dominate each other, we write P !" Q. If there is an event such that PpAq " 0 and QpAq " 1, we say that P and Q are mutually singular, and we write P K Q.

3.
If P ! Q, then dP dQ is the Radon-Nikodym derivative of the dominated measure P with respect to the reference measure Q. Its logarithm is known as the relative information, namely, the random variable ı P}Q paq " log dP dQ paq P r´8,`8q, a P A.
As with the Radon-Nikodym derivative, any identity involving relative informations can be changed on a set of measure zero under the reference measure without incurring in any contradiction. If P ! Q ! R, then the chain rule of Radon-Nikodym derivatives yields ı P}Q paq`ı Q}R paq " ı P}R paq, a P A.
Throughout the paper, the base of exp and log is the same and chosen by the reader unless explicitly indicated otherwise. We frequently define a probability measure P from the specification of ı P}Q and Q " P since If X " P and Y " Q, it is often convenient to write ı X}Y pxq instead of ı P}Q pxq. Note that Example 1. If X " N`µ X , σ 2 X˘( Gaussian with mean µ X and variance σ 2 X ) and Y " N`µ Y , σ 2 Y˘, then, 4. Let pA, F q and pB, G q be measurable spaces, known as the input and output spaces, respectively. Likewise, A and B are referred to as the input and output alphabets respectively. The simplified notation P Y|X : A Ñ B denotes a random transformation from pA, F q to pB, G q, i.e. for any x P A, P Y|X"x p¨q is a probability measure on pB, G q, and for any B P G , P Y|X"¨p Bq is an F -measurable function. 5.
We abbreviate by P A the set of probability measures on pA, F q, and by P AˆB the set of probability measures on pAˆB, F b G q. If P P P A and P Y|X : A Ñ B is a random transformation, the corresponding joint probability measure is denoted by P P Y|X P P AˆB (or, interchangeably, P Y|X P). The notation P Ñ P Y|X Ñ Q simply indicates that the output marginal of the joint probability measure P P Y|X is denoted by Q P P B , namely, 6.
If P X Ñ P Y|X Ñ P Y and P Y|X"a ! P Y , the information density ı X;Y : AˆB Ñ r´8, 8q is defined as ı X;Y pa; bq " ı P Y|X"a }P Y pbq, pa, bq P AˆB.
Following Rényi's terminology [49], if P X P Y|X ! P XˆPY , the dependence between X and Y is said to be regular, and the information density can be defined on px, yq P AˆB. Henceforth, we assume that P Y|X is such that the dependence between its input and output is regular regardless of the input probability measure. For example, if X " Y P R, then P Y|X"a pAq " 1ta P Au, and their dependence is not regular, since for any P X with non-discrete components P XY Entropy 2021, 1, 0 6 of 52 Example 1. If X ∼ N µ X , σ 2 X (Gaussian with mean µ X and variance σ 2 4. Let (A, F ) and (B, G ) be measurable spaces, known as the input and output spaces, respectively. Likewise, A and B are referred to as the input and output alphabets respectively. The simplified notation P Y|X : A → B denotes a random transformation from (A, F ) to (B, G ), i.e. for any x ∈ A, P Y|X=x (·) is a probability measure on (B, G ), and for any B ∈ G , P Y|X=· (B) is an F -measurable function. 5.
We abbreviate by P A the set of probability measures on (A, F ), and by P A×B the set of probability measures on (A × B, F ⊗ G ). If P ∈ P A and P Y|X : A → B is a random transformation, the corresponding joint probability measure is denoted by P P Y|X ∈ P A×B (or, interchangeably, P Y|X P). The notation P → P Y|X → Q simply indicates that the output marginal of the joint probability measure P P Y|X is denoted by Q ∈ P B , namely, 6.
If P X → P Y|X → P Y and P Y|X=a P Y , the information density ı X;Y : Following Rényi's terminology [49], if P X P Y|X P X × P Y , the dependence between X and Y is said to be regular, and the information density can be defined on (x, y) ∈ A × B. Henceforth, we assume that P Y|X is such that the dependence between its input and output is regular regardless of the input probability measure. For example, if X = Y ∈ R, then P Y|X=a (A) = 1{a ∈ A}, and their dependence is not regular, since for any P X with non-discrete components P XY P X × P Y . 7.
Let α > 0, and P X → P Y|X → P Y . The α-response to P X ∈ P A is the output probability measure P Y[α] P Y with relative information given by where κ α is a scalar that guarantees that P Y[α] is a probability measure. Invoking (9), we obtain For brevity, the dependence of κ α on P X and P Y|X is omitted. Jensen's inequality applied to (·) α results in κ α ≤ 0 for α ∈ (0, 1) and κ α ≥ 0 for α > 1. Although the α-response has a long record of services to information theory, this terminology and notation were introduced recently in [45]. Alternative terminology and notation were proposed in [42], which refers to the α-response as the order α Rényi mean. Note that κ 1 = 0 and the 1-response to P X is P Y . If p Y[α] and p Y|X denote the densities of P Y [α] and P Y|X with respect to some common dominating measure, then (13) becomes For α > 1 (resp. α < 1) we can think of the normalized version of p α Y|X as a random transformation with less (resp. more) "noise" than p Y|X .
Let α ą 0, and P X Ñ P Y|X Ñ P Y . The α-response to P X P P A is the output probability measure P Yrαs ! P Y with relative information given by where κ α is a scalar that guarantees that P Yrαs is a probability measure. Invoking (9), we obtain For brevity, the dependence of κ α on P X and P Y|X is omitted. Jensen's inequality applied to p¨q α results in κ α ď 0 for α P p0, 1q and κ α ě 0 for α ą 1. Although the α-response has a long record of services to information theory, this terminology and notation were introduced recently in [45]. Alternative terminology and notation were proposed in [42], which refers to the α-response as the order α Rényi mean. Note that κ 1 " 0 and the 1-response to P X is P Y . If p Yrαs and p Y|X denote the densities of P Yrαs and P Y|X with respect to some common dominating measure, then (13) becomes For α ą 1 (resp. α ă 1) we can think of the normalized version of p α Y|X as a random transformation with less (resp. more) "noise" than p Y|X . We will have opportunity to apply the following examples.

Example 2.
If Y " X`N, where X " N`µ X , σ 2 X˘i ndependent of N " N`µ N , σ 2 N˘, then the α-response to P X is Example 3. Suppose that Y " X`N, where N is exponential with mean ζ, independent of X, which is a mixed random variable with density with α µ ě ζ. Then, Yrαs, the α-response to P X , is exponential with mean α µ.

Relative Entropy and Rényi Divergence
Given a pair of probability measures pP, Qq P P 2 A , relative entropy and Rényi divergence gauge the distinctness between P and Q.

9.
Provided P ! Q, the relative entropy is the expectation of the relative information with respect to the dominated measure with equality if and only if P " Q. If P ! Q, then DpP}Qq " 8. As in Item 3, if X " P and Y " Q, we may write DpX}Yq instead of DpP}Qq, in the same spirit that the expectation and entropy of P are written as ErXs and HpXq, respectively.
10. Arising in the sequel, a common optimization in information theory finds, among the probability measures satisfying an average cost constraint, that which is closest to a given reference measure Q in the sense of Dp¨}Qq. For that purpose, the following result proves sufficient. Incidentally, we often refer to unconstrained maximizations over probability distributions. It should be understood that those optimizations are still constrained to the sets P A or P B . As customary in information theory, we will abbreviate max P X PP A by max X or max P X . Theorem 1. Let P Z P P A and suppose that g : A Ñ r0, 8q is a Borel measurable mapping. Then, achieved uniquely by PX !" P Z defined by ı X˚}Z paq "´gpaq´log Erexpp´gpZqqs, a P A. (22) Proof.
Note that since g is nonnegative, η " Erexpp´gpZqqs P p0, 1s. Furthermore, Entropy 2021, 23, 199 8 of 52 Therefore, the subset of P A for which the term in t¨u in (21) is finite is nonempty: Fix any P X from that subset, (which therefore satisfies P X ! P Z ! PX) and invoke the chain rule (7) to write which is uniquely minimized by letting P X " PX. Note that for typographical convenience we have denoted X˚" PX.
11. Let p and q denote the Radon-Nikodym derivatives of probability measures P and Q, respectively, with respect to a common dominating σ-finite measure µ. The Rényi divergence of order α P p0, 1q Y p1, 8q between P and Q is defined as [25,50] where (28) and (29) hold if P Example 1. If X ∼ N µ X , σ 2 X (Gaussian with mean µ X and variance σ 2 X ) and Y ∼ N µ Y , σ 2 Y , then, 4. Let (A, F ) and (B, G ) be measurable spaces, known as the input and output spaces, respectively. Likewise, A and B are referred to as the input and output alphabets respectively. The simplified notation P Y|X : A → B denotes a random transformation from (A, F ) to (B, G ), i.e. for any x ∈ A, P Y|X=x (·) is a probability measure on (B, G ), and for any B ∈ G , P Y|X=· (B) is an F -measurable function.

5.
We abbreviate by P A the set of probability measures on (A, F ), and by P A×B the set of probability measures on (A × B, F ⊗ G ). If P ∈ P A and P Y|X : A → B is a random transformation, the corresponding joint probability measure is denoted by P P Y|X ∈ P A×B (or, interchangeably, P Y|X P). The notation P → P Y|X → Q simply indicates that the output marginal of the joint probability measure P P Y|X is denoted by Q ∈ P B , namely, 6.
If P X → P Y|X → P Y and P Y|X=a P Y , the information density ı X;Y : Following Rényi's terminology [49], if P X P Y|X P X × P Y , the dependence between X and Y is said to be regular, and the information density can be defined on (x, y) ∈ A × B. Henceforth, we assume that P Y|X is such that the dependence between its input and output is regular regardless of the input probability measure. For example, if X = Y ∈ R, then P Y|X=a (A) = 1{a ∈ A}, and their dependence is not regular, since for any P X with non-discrete components P XY P X × P Y . 7.
Let α > 0, and P X → P Y|X → P Y . The α-response to P X ∈ P A is the output probability measure P Y[α] P Y with relative information given by where κ α is a scalar that guarantees that P Y[α] is a probability measure. Invoking (9), we obtain For brevity, the dependence of κ α on P X and P Y|X is omitted. Jensen's inequality applied to (·) α results in κ α ≤ 0 for α ∈ (0, 1) and κ α ≥ 0 for α > 1. Although the α-response has a long record of services to information theory, this terminology and notation were introduced recently in [45]. Alternative terminology and notation were proposed in [42], which refers to the α-response as the order α Rényi mean. Note that κ 1 = 0 and the 1-response to P X is P Y . If p Y[α] and p Y|X denote the densities of P Y [α] and P Y|X with respect to some common dominating measure, then (13) becomes For α > 1 (resp. α < 1) we can think of the normalized version of p α Y|X as a random transformation with less (resp. more) "noise" than p Y|X . Q, and in (27), R is a probability measure that dominates both P and Q. Note that (28) and (29) state that pt´1qD t pX}Yq and t D 1`t pX}Yq are the cumulant generating functions of the random variables ı X}Y pYq and ı X}Y pXq, respectively. The relative entropy is the limit of D α pP}Qq as α Ò 1, so it is customary to let D 1 pP}Qq " DpP}Qq. For any α ą 0, D α pP}Qq ě 0 with equality if and only if P " Q. Furthermore, D α pP}Qq is non-decreasing in α, satisfies the skew-symmetric property p1´αqD α pP}Qq " α D 1´α pQ}Pq, α P r0, 1s, and inf αPp0,1q 12. The expressions in the following pair of examples will come in handy in Sections 11 and 12.
Example 4. Suppose that σ 2 α " α σ 2 1`p 1´αqσ 2 0 ą 0 and α P p0, 1q Y p1, 8q. Then, Entropy 2021, 23, 199 9 of 52 Example 5. Suppose Z is exponentially distributed with unit mean, i.e., its probability density function is e´t1tt ě 0u. For d 0 ě d 1 and α such that p1´αq µ 0`α µ 1 ą 0 we obtain 13. Intimately connected with the notion of Rényi divergence is the tilted probability measure P α defined, if D α pP 1 }P 0 q ă 8, by where Q is any probability measure that dominates both P 0 and P 1 . Although (37) is defined in general, our main emphasis is on the range α P p0, 1q, in which, as long as P 0 M P 1 , the tilted probability measure is defined and satisfies P α ! P 0 and P α ! P 1 , with corresponding relative informations where we have used the chain rule for P α ! P 0 ! Q and P α ! P 1 ! Q. Taking a linear combination of (38)-(41) we conclude that, for all a P A, Henceforth, we focus particular attention on the case α P p0, 1q since that is the region of interest in the application of Rényi information measures to the evaluation of error exponents in channel coding for codes whose rate is below capacity. In addition, often proofs simplify considerably for α P p0, 1q. 14. Much of the interplay between relative entropy and Rényi divergence hinges on the following identity, which appears, without proof, in (3) of [51].

Theorem 2.
Let α P p0, 1q and assume that P 0 M P 1 are defined on the same measurable space. Then, for any P ! P 1 and P ! P 0 , where P α is the tilted probability measure in (37) and (43) holds regardless of whether the relative entropies are finite. In particular,
15. Relative entropy and Rényi divergence are related by the following fundamental variational representation.
Theorem 3. Fix α P p0, 1q and pP 1 , P 0 q P P 2 A . Then, the Rényi divergence between P 1 and P 0 satisfies where the minimum is over P A . If P 0 M P 1 , then the right side of (47) is attained by the tilted measure P α , and the minimization can be restricted to the subset of probability measures which are dominated by both P 1 and P 0 .

Proof.
If P 0 K P 1 , then both sides of (47) are`8 since there is no probability measure that is dominated by both P 0 and P 1 . If P 0 M P 1 , then minimizing both sides of (43) with respect to P yields (47) and the fact that the tilted probability measure attains the minimum therein.
The variational representation in (47) was observed in [39] in the finite-alphabet case, and, contemporaneously, in full generality in [50]. Unlike Theorem 3, both of those references also deal with α ą 1. The function dpαq " p1´αq D α pP 1 }P 0 q, with dp1q " lim αÒ1 dpαq, is concave in α because the right side of (47) is a minimum of affine functions of α. 16. Given random transformations P Y|X : A Ñ B, Q Y|X : A Ñ B, and a probability measure P X P P A on the input space, the conditional relative entropy is Analogously, the conditional Rényi divergence is defined as Entropy 2021, 23, 199

of 52
A word of caution: the notation in (50) conforms to that in [38,45] but it is not universally adopted, e.g., [43] uses the left side of (50) to denote the Rényi generalization of the right side of (49). We can express the conditional Rényi divergence as where (52) holds if P X P Y|X ! P X Q Y|X . Jensen's inequality applied to (51) results in Nevertheless, an immediate and crucial observation we can draw from (51) is that the unconstrained maximizations of the sides of (53) and of (54) over P X do coincide: for all α ą 0, 17. Conditional Rényi divergence satisfies the following additive decomposition, originally pointed out, without proof, by Sibson [31] in the setting of finite A.
Theorem 4. Given P X P P A , Q Y P P B , P Y|X : A Ñ B, and α P p0, 1q Y p1, 8q, we have Furthermore, with κ α as in (14),

Proof.
Select an arbitrary probability measure R Y P P B that dominates both Q Y and P Y , and, therefore, P Yrαs too. Letting pX, Zq " P XˆRY , we have where (61) follows from (13), and (62) follows from the chain rule of Radon-Nikodym derivatives applied to P Yrαs ! P Y ! R Y . Then, (58) follows by specializing Q Y " P Yrαs , and the proof of (57) is complete, upon plugging (58) into the right side of (63).
A proof of (57) in the discrete case can be found in Appendix A of [37].
18. For all α ą 0, given two inputs pP X , Q X q P P 2 A and one random transformation P Y|X : A Ñ B, Rényi divergence (and, in particular, relative entropy) satisfies the data processing inequality, where P X Ñ P Y|X Ñ P Y , and Q X Ñ P Y|X Ñ Q Y . The data processing inequality for Rényi divergence was observed by Csiszár [52] in the more general context of f -divergences. More recently it was stated in [39,50]. Furthermore, given one input P X P P A and two transformations P Y|X : A Ñ B and Q Y|X : A Ñ B, conditioning cannot decrease Rényi divergence, Since D α pP Y|X } Q Y|X |P X q " D α pP X P Y|X } P X Q Y|X q, (65) follows by applying (64) to a deterministic transformation which takes an input pair and outputs the second component. Inequalities (53) and (65) imply the convexity of D α pP}Qq in pP, Qq for α P p0, 1s.

Dependence Measures
In this paper we are interested in three information measures that quantify the dependence between random variables X and Y, such that P X Ñ P Y|X Ñ P Y , namely, mutual information, and two of its generalizations, α-mutual information and Augustin-Csiszár mutual information of order α.

Theorem 4 and (72) result in the additive decomposition
for any Q Y with D α pP Yrαs } Q Y q ă 8, thereby generalizing the well-known decomposition for mutual information, which, in contrast to (77), is a simple consequence of the chain rule whenever the dependence between X and Y is regular, and of Lemma A1 in general. 22.
23. If α P p0, 1q, (47) and (69) result in For α ą 1 a proof of (81) is given in [39] for finite alphabets. 24. Unlike IpP X , P Y|X q, we can express I α pP X , P Y|X q directly in terms of its arguments without involving the corresponding output distribution or the α-response to P X . This is most evident in the case of discrete alphabets, in which (76) becomes For example, if X is discrete and H α pXq denotes the Rényi entropy of order α, then for all α ą 0, If X and Y are equiprobable with PrX ‰ Ys " δ, then, in bits, I α pX; Yq " 1´h α pδq, where h α pδq denotes the binary Rényi entropy. 25. In the main region of interest, namely, α P p0, 1q, frequently we use a different parametrization in terms of ρ ą 0, with α " 1 1`ρ .

Theorem 5.
For any ρ ą 0, we have the upper bound
Just like (53), we will show in Section 7 that (86) becomes an equality upon the unconstrained maximization of both sides. 26. Before introducing the last dependence measure in this section, recall from Definition 7 and (58) that P Yrαs ! P Y , the α-response (of P Y|X ) to P X defined by where the expectation is with respect to X " P X . We proceed to define P Yxαy ! P Y , the xαy-response (of P Y|X ) to P X by means of with X " P X . Note that P Yx1y " P Yr1s " P Y . 27. In the case of discrete alphabets, (92) becomes the implicit equation which coincides with (9.24) in Fano's 1961 textbook [7], with s Ð 1´α, and is also given by Haroutunian in (19) of [22]. For example, if A " B is discrete and Y " X, then P Yxαy " P X , while P α Yrαs pyq " c P X pyq, y P A. 28. The xαy-response satisfies the following identity, which can be regarded as the counterpart of (57) satisfied by the α-response.
Theorem 6. Fix P X P P A , P Y|X : A Ñ B and Q Y P P B . Then, Proof.
For brevity we assume Q Y ! P Y . Otherwise, the proof is similar adopting a reference measure that dominates both Q Y and P Y . The definition of unconditional Rényi divergence in Item 11 implies that we can write pα´1q times the exponential of the left side of (94) as where pX, Yq " P XˆPY , (96) follows from (92), and (97) follows from the definition of unconditional Rényi divergence in (27).
Taking expectation with respect to X " P X of (106)-(108) yields (99) because of Lemma A1 and (105). If α ě 1, then Jensen's inequality applied to the right side of (94) results in (98) but with the opposite inequality. Moreover, (107) is reversed and the remainder of the proof holds verbatim.
In the case of finite input-alphabets, a different proof of (99) is given in Appendix B of [54]. 29. Introduced in the unpublished dissertation [36] and rescued from oblivion in [32], the Augustin-Csiszár mutual information of order α is defined for α ą 0 as where (111) follows from (98) if α P p0, 1s, and from the reverse of (99) if α ě 1. We conform to the notation in [40], where I a α was used to denote the difference between entropy and Arimoto-Rényi conditional entropy. In [32,39,43] the Augustin-Csiszár mutual information of order α is denoted by I α . In Augustin's original notation [36], I ρ pP X q means I c 1´ρ pP X , P Y|X q, ρ P p0, 1q. Independently of [36], Poltyrev [35] introduced a functional (expressed as a maximization over a reverse random transformation) which turns out to be ρI c 1 1`ρ pX; Yq and which he denoted by E 0 pρ, P X q, although in Gallager's notation that corresponds to ρI 1 1`ρ pX; Yq, as we will see in (233). I c 0 pX; Yq and I c 8 pX; Yq are defined by taking the corresponding limits. 30. In the discrete case, (110) boils down to which can be juxtaposed with the much easier expression in (82) for I α pX; Yq involving no further optimization. Minimizing the Lagrangian, we can verify that the minimizer in (112) satisfies (93). With pX, s Yq " P XˆQY , we have where the expectations are with respect to X. 31. The respective minimizers of (72) and (110), namely, the α-response and the xαyresponse, are quite different. Most notably, in contrast to Item 7, an explicit expression for P Yxαy is unknown. Instead of defining P Yxαy through (92), [36] defines it, equivalently, as the fixed point of the operator (dubbed the Augustin operator in [43]) which maps the set of probability measures on the output space to itself, where X " P X . Although we do not rely on them, Lemma 34.2 of (α P p0, 1q) and Lemma 13 of [43] (α ą 1) claim that the minimizer in (110), referred to in [43] as the Augustin mean of order α, is unique and is a fixed point of the operator T α regardless of P X . Moreover, Lemma 13(c) of [43] establishes that for α P p0, 1q and finite input alphabets, repeated iterations of the operator T α with initial argument P Yrαs converge to P Yxαy . 32. It is interesting to contrast the next example with the formulas in Examples 2 and 6.
This result can be obtained by postulating a zero-mean Gaussian distribution with variance v 2 α as P Yxαy and verifying that (92) is indeed satisfied if v 2 α is chosen as in (116). The first step is to invoke (32), which yields where we have denoted s 2 Assembling (120) and (121), the right side of (92) becomes where (124) follows by Gaussian integration, and the marvelous simplification in (125) is satisfied provided that we choose Comparing (122) and (125), we see that (92) is indeed satisfied with Yxαy " N`0, v 2 α˘i f v 2 α satisfies the quadratic equation (126), whose solution is in (116)-(118). Invoking (32) and (116), we obtain Beyond its role in evaluating the Augustin-Csiszár mutual information for Gaussian inputs, the Gaussian distribution in (116) has found some utility in the analysis of finite blocklength fundamental limits for data transmission [55]. 33. This item gives a variational representation for the Augustin-Csiszár mutual information in terms of mutual information and conditional relative entropy (i.e., non-Rényi information measures). As we will see in Section 10, this representation accounts for the role played by Augustin-Csiszár mutual information in expressing error exponent functions.
Theorem 8. For α P p0, 1q, the Augustin-Csiszár mutual information satisfies the variational representation in terms of conditional relative entropy and mutual information, where the minimum is over all the random transformations from the input to the output spaces.

Proof.
Invoking (47) with pP 1 , P 0 q Ð pP Y|X"x , Q Y q we obtain " min Averaging over x " P X , followed by minimization with respect to Q Y yields (128) upon recalling (67).
In the finite-alphabet case with α P p0, 1q Y p1, 8q, the representation in (128) is implicit in the appendix of [32], and stated explicitly in [39], where it is shown by means of a minimax theorem. This is one of the instances in which the proof of the result is considerably easier for α P p0, 1q; we can take the following route to show (128) for α ą 1. Neglecting to emphasize its dependence on P X , denote Invoking (47) we obtain Averaging (132) with respect to P X followed by minimization over Q Y , results in which shows ě in (128). If a minimax theorem can be invoked to show equality in (134), then (128) is established for α ą 1. For that purpose, for fixed R Y|X , f p¨, R Y|X q is convex and lower semicontinuous in Q Y on the set where it is finite. Rewriting it can be seen that f pQ Y ,¨q is upper semicontinuous and concave (if α ą 1). A different, and considerably more intricate route is taken in Lemma 13(d) of [43], which also gives (128) for α ą 1 assuming finite input alphabets. 34. Unlike mutual information, neither I α pX; Yq " I α pY; Xq nor I c α pX; Yq " I c α pY; Xq hold in general.
35. It was shown in Theorem 5.2 of [38] that α-mutual information satisfies the data processing lemma, namely, if X and Z are conditionally independent given Y, then I α pZ; Xq ď mintI α pZ; Yq, I α pY; Xqu.
37. The convexity/concavity properties of the generalized mutual informations are summarized next.
Ip¨, P Y|X q and I c α p¨, P Y|X q are concave functions. The same holds for I α p¨, P Y|X q if α ą 1.

(c)
If α P p0, 1q, then IpP X ,¨q, I α pP X ,¨q and I c α pP X ,¨q are convex functions.
In general, it holds since (67) is the infimum of linear functions of P X . The same reasoning applies to Augustin-Csiszár mutual information in view of (110). For α-mutual information with α ą 1, notice from (51) that D α pP Y|X } Q Y |P X q is concave in P X if α ą 1. Therefore, (c) The convexity of IpP X ,¨q and I α pP X ,¨q follow from the convexity of D α pP}Qq in pP, Qq for α P p0, 1s as we saw in Item 18. To show convexity of I c α pP X ,¨q if α P p0, 1q, we apply (169) in Item 45 with P Y|X " λP 1 Y|X`p 1´λqP 0 Y|X , and invoke the convexity of I α pP X ,¨q: Although not used in the sequel, we note, for completeness, that if α P p0, 1q Y p1, 8q, [38] (see corrected version in [41]) shows that exp´´1´1 α¯I α p¨, P Y|X q¯{pα´1q is concave.

5.
Interplay between I α pP X , P Y|X q and I c α pP X , P Y|X q In this section we study the interplay between both notions of mutual informations of order α, and, in particular, various variational representations of these information measures.
38. For given α P p0, 1q Y p1, 8q and P Y|X : A Ñ B, define Q Xrαs !" P X , the α-adjunct of P X by with κ α the constant in (14) and P Yrαs , the α-response to P X . 39. Example 9. Let Y " X`N with X " N`0, σ 2 X˘i ndependent of N " N`0, σ 2 N˘, and snr " 40. Theorem 10. The xαy-response to Q Xrαs is P Yrαs , the α-response to P X .

Proof.
We just need to verify that (92) is satisfied if we substitute Yxαy by Yrαs, and instead of taking the expectation in the right side with respect to X " P X we take it with respect to r X " Q Xrαs . Then, where (154) is by change of measure, (155) follows by substitution of (152), and (156) is the same as (13). 41. For given α P p0, 1q Y p1, 8q and P Y|X : A Ñ B, we define Q Xxαy !" P X , the xαyadjunct of an input probability measure P X through where P Yxαy is the xαy-response to P X and υ α is a normalizing constant so that Q Xxαy is a probability measure. According to (9), we must have Hence, 42. With the aid of the expression in Example 7, we obtain Example 10. Let Y " X`N with X " N`0, σ 2 X˘i ndependent of N " N`0, σ 2 N˘, and snr " Then, the xαy-adjunct of the input is which, in contrast to Q Xrαs , has larger variance than σ 2 X if α P p0, 1q.
43. The following result is the dual of Theorem 10.
Theorem 11. The α-response to Q Xxαy is P Yxαy , the xαy-response to P X . Therefore,

Proof.
The proof is similar to that of Theorem 10. We just need to verify that we obtain the right side of (92) if on the right side of (91) we substitute P X by Q Xxαy and P Yrαs by P Yxαy . Let s X " Q Xxαy . Then, where (162) 44. By recourse to a minimax theorem, the following representation is given for α P p0, 1q Y p1, 8q in the case of finite alphabets in [39], and dropping the restriction on the finiteness of the output space in [43]. As we show, a very simple and general proof is possible for α P p0, 1q.
Theorem 12. Fix α P p0, 1q, P X P P A and P Y|X : A Ñ B. Then, where the minimum is attained by Q Xrαs , the α-adjunct of P X defined in (152).

Proof.
The variational representations in (81) and (128) result in (165). To show that the minimum is indeed attained by Q Xrαs , recall from Theorem 10 that the xαyresponse to Q Xrαs is P Yrαs . Therefore, evaluating the term in tu in (165) for Q X Ð Q Xrαs yields, with r X " Q Xrαs , where (167) follows from (152) and (168)  Theorem 13. Fix α P p0, 1q, P X P P A and P Y|X : A Ñ B. Then, The maximum is attained by Q Xxαy , the xαy-adjunct of P X defined by (157).

Proof.
First observe that (165) implies that ě holds in (169). Second, the term in tu on the right side of (169) evaluated at Q X Ð Q Xxαy becomes p1´αq I α pQ Xxαy , P Y|X q´DpP X } Q Xxαy q " p1´αq I α pQ Xxαy , P Y|X q`p1´αqI c α pP X , P Y|X q`υ α " p1´αqI c α pP X , P Y|X q, where (170) follows by taking the expectation of minus (157) with respect to P X . Therefore, ď also holds in (169) and the maximum is attained by Q Xxαy , as we wanted to show.
Hinging on Theorem 8, Theorems 12 and 13 are given for α P p0, 1q which is the region of interest in the analysis of error exponents. Whenever, as in the finite-alphabet case, (128) holds for α ą 1, Theorems 12 and 13 also hold for α ą 1.
Notice that since the definition of Q Xxαy involves P Yxαy , the fact that it attains the maximum in (169) does not bring us any closer to finding I c α pX; Yq for a specific input probability measure P X . Fortunately, as we will see in Section 8, (169) proves to be the gateway to the maximization of I c α pX; Yq in the presence of input-cost constraints. 46. Focusing on the main range of interest, α P p0, 1q, we can express (169) as where we have defined the function (dependent on α, P X , and P Y|X ) and ξ α is the solution to 9 Ipξ α q " 1 1´α .
Recall that the maxima over the input distribution in (172) and (175) are attained by the xαy-adjunct Q Xxαy defined in Item 41. 47. At this point it is convenient to summarize the notions of input and output probability measures that we have defined for a given α, random transformation P Y|X , and input probability measure P X : • P Y : The familiar output probability measure P X Ñ P Y|X Ñ P Y , defined in Item 5. • P Yrαs : The α-response to P X , defined in Item 7. It is the unique achiever of the minimization in the definition of α-mutual information in (67). • P Yxαy : The xαy-response to P X defined in Item 26. It is the unique achiever of the minimization in the definition of Augustin-Csiszár α-mutual information in (110). • Q Xrαs : The α-adjunct of P X , defined in (152). The xαy-response to Q Xrαs is P Yrαs . Furthermore, Q Xrαs achieves the minimum in (165). • Q Xxαy : The xαy-adjunct of P X , defined in (157). The α-response to Q Xxαy is P Yxαy . Furthermore, Q Xxαy achieves the maximum in (169).

Maximization of I α pX; Yq
Just like the maximization of mutual information with respect to the input distribution yields the channel capacity (of course, subject to conditions [57]), the maximization of I α pX; Yq and of I c α pX; Yq arises in the analysis of error exponents, as we will see in Section 10. A recent in-depth treatment of the maximization of α-mutual information is given in [45]. As we see most clearly in (82) for the discrete case, when it comes to its optimization, one advantage of I α pX; Yq over IpX; Yq is that the input distribution does not affect the expression through its influence on the output distribution. 48. The maximization of α-mutual information is facilitated by the following result. Theorem 14 ([45]). Given α P p0, 1q Y p1, 8q; a random transformation P Y|X : A Ñ B; and, a convex set P Ă P A , the following are equivalent. (a) PX P P attains the maximal α-mutual information on P, I α pPX, P Y|X q " max PPP I α pP, P Y|X q ă 8.
(b) For any P X P P, and any output distribution Q Y P P B , where PY rαs is the α-response to PX.
Moreover, if P Yrαs denotes the α-response to P X , then D α pP Yrαs }PY rαs q ď I α pPX, P Y|X q´I α pP X , P Y|X q ă 8.
Note that, while I α p¨, P Y|X q may not be maximized by a unique (or, in fact, by any) input distribution, the resulting α-response PY rαs is indeed unique. If P is such that none of its elements attain the maximal I α , it is known [42,45] that the α-response to any asymptotically optimal sequence of input distributions converges to PY rαs . This is the counterpart of a result by Kemperman [58] concerning mutual information. 49. The following example appears in [45]. Example 11. Let Y " X`N where N " N`0, σ 2 N˘i ndependent of X. Fix α P p0, 1q and P ą 0. Suppose that the set, P Ă P A , of allowable input probability measures consists of those that satisfy the constraint We can readily check that X˚" Np0, Pq satisfies (181) with equality, and as we saw in Example 2, its α-response is PY rαs " N p0, α P`σ 2 q. Theorem 14 establishes that PX does indeed maximize the α-mutual information among all the distributions in P, yielding (recall Example 6) max P X PP Curiously, if, instead of P defined by the constraint (181), we consider the more conventional P " tX : ErX 2 s ď Pu, then the left side of (182) is unknown at present. Numerical evidence shows that it can exceed the right side by employing non-Gaussian inputs. (56) and (178) implies that if PX attains the finite maximal unconstrained α-mutual information and its α-response is denoted by PY rαs , then, max X I α pX; Yq " max PPP I α pP, P Y|X q " max aPA D α pP Y|X"a }PY rαs q,

Recalling
which requires that PXpAα q " 1, with Aα " " x P A : D α pP Y|X"x }PY rαs q " max aPA D α pP Y|X"a }PY rαs q * . For discrete alphabets, this requires that if x R Aα , then PXpxq " 0, which is tantamount to with equality for all x P A such that PXpxq ą 0. For finite-alphabet random transformations this observation is equivalent to Theorem 5.6.5 in [9]. 51. Getting slightly ahead of ourselves, we note that, in view of (128), an important consequence of Theorem 15 below, is that, as anticipated in Item 25, the unconstrained maximization of I α pX; Yq for α P p0, 1q can be expressed in terms of the solution to an optimization problem involving only conventional mutual information and conditional relative entropy. For ρ ě 0, 7. Unconstrained Maximization of I c α pX; Yq 52. In view of the fact that it is much easier to determine the α-mutual information than the order-α Augustin-Csiszár information, it would be advantageous to show that the unconstrained maximum of I c α pX; Yq equals the unconstrained maximum of I α pX; Yq. In the finite-alphabet setting, in which it is possible to invoke a "minisup" theorem (e.g., see Section 7.1.7 of [59]), Csiszár [32] showed this result for α ą 0. The assumption of finite output alphabets was dropped in Theorem 1 of [42], and further generalized in Theorem 3 of the same reference. As we see next, for α P p0, 1q, it is possible to give an elementary proof without restrictions on the alphabets. (187)

Proof.
In view of (143), ě holds in (187). To show ď, we assume sup X I α pX; Yq ă 8 as, otherwise, there is nothing left to prove. The unconstrained maximization identity in (183) implies sup X I α pX; Yq " sup aPA D α pP Y|X"a }PY rαs q (188) where PY rαs is the unique α-response to any input that achieves the maximal α-mutual information, and if there is no such input, it is the limit of the α-responses to any asymptotically optimal input sequence (Item 48).
Furthermore, if tX n u is asymptotically optimal for I α , i.e., lim nÑ8 I α pX n ; Y n q " sup X I α pX; Yq, then tX n u is also asymptotically optimal for I c α because for any δ ą 0, we can find N, such that for all n ą N, ě I α pX n ; Y n q.

Maximization of I c α pX; Yq Subject to Average Cost Constraints
This section is at the heart of the relevance of Rényi information measures to error exponent functions. 53. Given α P p0, 1q, P Y|X : A Ñ B, a cost function b : A Ñ r0, 8q and real scalar θ ě 0, the objective is to maximize the Augustin-Csiszár mutual information allowing only those probability measures that satisfy ErbpXqs ď θ, namely, Unfortunately, identity (187) no longer holds when the maximizations over the input probability measure are cost-constrained, and, in general, we can only claim C c α pθq ě sup P X : A conceptually simple approach to solve for C c α pθq is to (a) postulate an input probability measure PX that achieves the supremum in (197); (b) solve for its xαy-response PY using (92); (c) show that pPX, PY q is a saddle point for the game with payoff function where Q Y P P A and P X is chosen from the convex subset of P A of probability measures which satisfy ErbpXqs ď θ. Since PY is already known, by definition, to be the xαy-response to PX, verifying the saddle point is tantamount to showing that BpP X , PY q is maximized by PX among tP X P P A : ErbpXqs ď θu. Theorem 1 of [43] guarantees the existence of a saddle point in the case of finite input alphabets. In addition to the fact that it is not always easy to guess the optimum input PX (see e.g., Section 12), the main stumbling block is the difficulty in determining the xαy-response to any candidate input distribution, although sometimes this is indeed feasible as we saw in Example 7. 54. Naturally, Theorem 15 implies If the unconstrained maximization of I c α p¨, P Y|X q is achieved by an input distribution X ‹ that satisfies ErbpX ‹ qs ď θ, then equality holds in (200), which, in turn, is equal to I c α pP ‹ X , P Y|X q. In that case, the average cost constraint is said to be inactive. For most cost functions and random transformations of practical interest, the cost constraint is active for all θ ą 0. To ascertain whether it is, we simply verify whether there exists an input achieving the right side of (200), which happens to satisfy the constraint.
If so, C c α pθq has been found. The same holds if we can find a sequence tX n u such that ErbpX n qs ď θ and I α pX n ; Y n q Ñ sup X I α pX; Yq. Otherwise, we proceed with the method described below. Thus, henceforth, we assume that the cost constraint is active. 55. The approach proposed in this paper to solve for C c α pθq for α P p0, 1q hinges on the variational representation in (172), which allows us to sidestep having to find any xαy-response. Note that once we set out to maximize I c α pP X , P Y|X q over P " tP X P P A : ErbpXqs ď θu, the allowable Q X in the maximization in (175) range over a ξ-blow-up of P defined by Γ ξ pPq " tQ X P P A : DP X P P, such that DpP X }Q X q ď ξu. (201) As we show in Item 56, we can accomplish such an optimization by solving an unconstrained maximization of the sum of α-mutual information and a term suitably derived from the cost function. 56. It will not be necessary to solve for (176), as our goal is to further maximize (172) over P X subject to an average cost constraint. The Lagrangian corresponding to the constrained optimization in (197) is where on the left side we have omitted, for brevity, the dependence on θ stemming from the last term on the right side. The Lagrange multiplier method (e.g., [60]) implies that if X˚achieves the supremum in (197), then there exists ν˚ě 0 such that for all P X on A and ν ě 0, Note from (202) that the right inequality in (203) can only be achieved if and, consequently, The pivotal result enabling us to obtain C c α pθq without the need to deal with Augustin-Csiszár mutual information is the following. Theorem 16. Given α P p0, 1q, ν ě 0, P Y|X : A Ñ B, and b : A Ñ r0, 8q, denote the function Then, and C c α pθq " min νě0 tν θ`A α pνqu. (208)
In conclusion, we have shown that the maximization of Augustin-Csiszár mutual information of order α subject to ErbpXqs ď θ boils down to the unconstrained maximization of a Lagrangian consisting of the sum of α-mutual information and an exponential average of the cost function. Circumventing the need to deal with xαy-responses and with Augustin-Csiszár mutual information of order α leads to a particularly simple optimization, as illustrated in Sections 11 and 12. 57. Theorem 16 solves for the maximal Augustin-Csiszár mutual information of order α under an average cost constraint without having to find out the input probability measure PX that attains it nor its xαy-response PY (using the notation in Item 53). Instead, it gives the solution as Although we are not going to invoke a minimax theorem, with the aid of Theorem 9-(b) we can see that the functional within the inner brackets is concave in P X ; Furthermore, if V P p0, 1s, then log ErV ν s is easily seen to be convex in ν with the aid of the Cauchy-Schwarz inequality. Before we characterize the saddle point pν˚, QXq of the game in (215) we note that pPX, PY q can be readily obtained from pν˚, QXq.
where τ α is a normalizing constant ensuring that PX is a probability measure. Proof.

(a)
We had already established in Theorem 13 that the maximum on the right side of (210) is achieved by the xαy-adjunct of P X . In the special case ν " ν˚, such P X is PX. Therefore, QX, the argument that achieves the maximum in (206) for ν " ν˚, is the xαy-adjunct of PX.
According to Theorem 11, the α-response to QX is the xαy-response to PX, which is PY by definition. (c) For ν " ν˚, PX achieves the supremum in (209) and the infimum in (211). Therefore, (216) follows from Theorem 1 with Z " QX and gp¨q given by (214) particularized to ν " ν˚.
The saddle point of (215) admits the following characterization.

Proof.
First, we show that the scalar ν˚ě 0 that minimizes satisfies (217). If we abbreviate V " exp`´p1´αqbpX˚q˘P p0, 1s, then the dominated convergence theorem results in d dν Therefore, (217) is equivalent to 9 f pν˚q " 0, which is all we need on account of the convexity of f p¨q. To show (218), notice that for all a P A, where (223) is (216) and (224) is (157) where (227)  With the same approach, we can postulate, for every ν ě 0, an input distribution R ν X , whose α-response R ν Yrαs satisfies where the only condition we place on c α pνq is that it not depend on a P A. If this is indeed the case, then the same derivation in (226)-(229) results in and we determine ν˚as the solution to θ "´9 c α pν˚q, in lieu of (217). Sections 11 and 12 illustrate the effortless nature of this approach to solve for A α pνq. Incidentally, (230) can be seen as the α-generalization of the condition in Problem 8.2 of [48], elaborated later in [61].

Gallager's E 0 Functions and the Maximal Augustin-Csiszár Mutual Information
In keeping with Gallager's setting [9], we stick to discrete alphabets throughout this section. 59. In his derivation of an achievability result for discrete memoryless channels, Gallager [8] introduced the function (1), which we repeat for convenience, Comparing (82) and (232), we obtain E 0 pρ, P X q " ρ I 1 which, as we mentioned in Section 1, is the observation by Csiszár in [30] that triggered the third phase in the representation of error exponents. Popularized in [9], the E 0 function was employed by Shannon, Gallager and Berlekamp [10] for ρ ě 0 and by Arimoto [62] for ρ P p´1, 0q in the derivation of converse results in data transmission, the latter of which considers rates above capacity, a region in which error probability increases with blocklength, approaching one at an exponential rate. For the achievability part, [8] showed upper bounds on the error probability involving E 0 pρ, P X q for ρ P r0, 1s. Therefore, for rates below capacity, the α-mutual information only enters the picture for α P p0, 1q. One exception in which Rényi divergence of order greater than 1 plays a role at rates below capacity was found by Sason [63], where a refined achievability result is shown for binary linear codes for output symmetric channels (a case in which equiprobable P X maximizes (233)), as a function of their Hamming weight distribution. Although Gallager did not have the benefit of the insight provided by the Rényi information measures, he did notice certain behaviors of E 0 reminiscent of mutual information. For example, the derivative of (233) with respect to ρ, at ρ Ð 0 is equal to IpX; Yq. As pointed out by Csiszár in [32], in the absence of cost constraints, Gallager's E 0 function in (232) satisfies max P X E 0 pρ, P X q " ρ max in view of (233) and (187).
Recall that Gallager's modified E 0 function in the case of cost constraints is E 0 pρ, P X , r, θq "´log ÿ yPB˜ÿ xPA P X pxq exppr bpxq´r θqP which, like (232) he introduced in order to show an achievability result. Up until now, no counterpart to (234) has been found with cost constraints and (235). This is accomplished in the remainder of this section. 60. In the finite alphabet case the following result is useful to obtain a numerical solution for the functional in (206). More importantly, it is relevant to the discussion in Item 61.

Theorem 19.
In the special case of discrete alphabets, the function in (206) is equal to where the maximization is over all G : A Ñ r0, 8q such that ÿ aPA Gpaq expp´p1´αqνbpaqq " 1.
(241) 61. We can now proceed to close the circle between the maximization of Augustin-Csiszár mutual information subject to average cost constraints (Phase 3 in Section 1) and Gallager's approach (Phase 1 in Section 1).

Theorem 20.
In the discrete alphabet case, recalling the definitions in (202) and (235) , for ρ ą 0, max P X E 0 pρ, P X , r, θq " ρ max where the maximizations are over P A .

Proof.
With the maximization of (235) with the respect to the input probability measure yields where • the maximization on the right side of (247) is over all G : A Ñ r0, 8q that satisfy (237), since that constraint is tantamount to enforcing the constraint that P X P P A on the left side of (247); • (248) ðù Theorem 19; • (249) ðù Theorem 16.
The proof of (242) is complete once (244) is invoked to substitute α and ν from the right side of (249). If we now minimize the outer sides of (245)-(249) with respect to r we obtain, using (205) and (244), In p. 329 of [9], Gallager poses the unconstrained maximization (i.e., over P X P P A ) of the Lagrangian Note the apparent discrepancy between the optimizations in (243) and (253): the latter is parametrized by r and γ (in addition to ρ and θ), while the maximization on the right side of (243) does not enforce any average cost constraint. In fact, there is no disparity since Gallager loc. cit. finds serendipitously that γ " 0 regardless of r and θ, and, therefore, just one parameter is enough. 62. The raison d'être for Augustin's introduction of I c α in [36] was his quest to view Gallager's approach with average cost constraints under the optic of Rényi information measures. Contrasting (232) and (235) and inspired by the fact that, in the absence of cost constraints, (232) satisfies a variational characterization in view of (69) and (233), Augustin [36] dealt, not with (235), but with min Q Y D α pP Y|X }Q Y |P X q, whereP Y|X"x " P Y|X"x exp`r 1 bpxq˘.
Assuming finite alphabets, Augustin was able to connect this quantity with the maximal I c α pX; Yq under cost constraints in an arcane analysis that invokes a minimax theorem. This line of work was continued in Section 5 of [43], which refers to min Q Y D α pP Y|X }Q Y |P X q as the Rényi-Gallager information. Unfortunately, sincẽ P Y|X is not a random transformation, the conditional pseudo-Rényi divergence D α pP Y|X }Q Y |P X q need not satisfy the key additive decomposition in Theorem 4 so the approach of [36,43] fails to establish an identity equating the maximization of Gallager's function (235) with the maximization of Augustin-Csiszár mutual information, which is what we have accomplished through a crisp and elementary analysis.

Error Exponent Functions
The central objects of interest in the error exponent analysis of data transmission are the functions E sp pR, P X q and E r pR, P X q of a random transformation P Y|X : A Ñ B. Reflecting the three different phases referred to in Section 1, there is no unanimity in the definition of those functions. Following [48], we adopt the standard canonical Phase 2 (Section 1.2) definitions of those functions, which are given in Items 63 and 67.
63. If R ě 0 and P X P P A , the sphere-packing error exponent function is (e.g., (10.19) of [48]) E sp pR, P X q " min 64. As a function of R ě 0, the basic properties of (254) for fixed pP X , P Y|X q are as follows.
Entropy 2021, 23,199 34 of 52 (a) If R ě IpP X , P Y|X q, then E sp pR, P X q " 0; If R ă IpP X , P Y|X q, then E sp pR, P X q ą 0; (c) The infimum of the arguments for which the sphere-packing error exponent function is finite is denoted by R 8 pP X q; (d) On the interval R P pR 8 pP X q, IpP X , P Y|X qq, E sp pR, P X q is convex, strictly decreasing, continuous, and equal to (254) where the constraint is satisfied with equality. This implies that for R belonging to that interval, we can find ρ R ě 0 so that for all r ě 0, 65. In view of Theorem 8 and its definition in (254), it is not surprising that E sp pR, P X q is intimately related to the Augustin-Csiszár mutual information, through the following key identity. Proof.
First note that ě holds in (256) because from (128) we obtain, for all ρ ě 0, where (260) follows from the definition in (254). To show ď in (256) for those R such that 0 ă E sp pR, P X q ă 8, Property (d) in Item 64 allows us to write where (262) follows from (255).
To determine the region where the sphere-packing error exponent is infinite and show (257), first note that if R ă I c 0 pX; Yq " lim αÓ0 I c α pX; Yq, then E sp pR, P X q " 8 because for any ρ ě 0, the function in tu on the right side of (256) satisfies where (264) follows from the monotonicity of I c α pX; Yq in α we saw in (143). Conversely, if I c 0 pX; Yq ă R ă 8, there exists P p0, 1q such that I c pX; Yq ă R, which implies that in the minimization we may restrict to those Q Y|X such that IpP X , Q Y|X q ď R, and consequently, I c pX; Yq ě 1´ E sp pR, P X q. Therefore, to avoid a contradiction, we must have E sp pR, P X q ă 8. The remaining case is I c 0 pX; Yq " 8. Again, the monotonicity of the Augustin-Csiszár mutual information implies that I c α pX; Yq " 8 for all α ą 0. So, (128) prescribes DpQ Y|X }P Y|X |P X q " 8 for any Q Y|X is such that IpP X , Q Y|X q ă 8. Therefore, E sp pR, P X q " 8 for all R ě 0, as we wanted to show.
Augustin [36] provided lower bounds on error probability for codes of type P X as a function of I c α pX; Yq but did not state (256); neither did Csiszár in [32] as he was interested in a non-conventional parametrization (generalized cutoff rates) of the reliability function. As pointed out in p. 5605 of [64], the ingredients for the proof of (256) were already present in the hint of Problem 23 of Section II.5 of [24]. In the discrete case, an exponential lower bound on error probability for codes with constant composition P X is given as a function of I c 1 1`ρ pP X , P Y|X q in [44,64]. As in [64], Nakiboglu [65] gives (256) as the definition of the sphere-packing function and connects it with (254) in Lemma 3 therein, within the context of discrete input alphabets. In the discrete case, (257) is well-known (e.g., [66]), and given by (83). As pointed out in [40], max X I c 0 pX; Yq is the zero-error capacity with noiseless feedback found by Shannon [67], provided there is at least a pair pa 1 , a 2 q P A 2 such that P Y|X"a 1 K P Y|X"a 2 . Otherwise, the zero-error capacity with feedback is zero. 66. The critical rate, R c pP X q, is defined as the smallest abscissa at which the convex function E sp p¨, P X q meets its supporting line of slope´1. According to (256), 67. If R ě 0 and P X P P A , the random-coding exponent function is (e.g., (10.15) of [48]) with rts`" maxt0, tu. 68. The random-coding error exponent function is determined by the sphere-packing error exponent function through the following relation, illustrated in Figure 1.
pX; Yq E sp pR,P X q E r pR,P X q R Figure 1. E sp p¨, P X q and E r p¨, P X q. E r pR, P X q " min rěR E sp pr, P X q`r´R ( E sp pR, P X q, R P rR c pP X q, IpP X , P Y|X qs; I c 1 2 pX; Yq´R, R P r0, R c pP X qs. Proof. Identities (268) and (269) are well-known (e.g. Lemma 10.4 and Corollary 10.4 in [48]). To show (270), note that (256) expresses E sp p¨, P X q as the supremum of supporting lines parametrized by their slope´ρ. By definition of critical rate (for brevity, we do not show explicitly its dependence on P X ), if R P rR c , IpP X , P Y|X qs, then E sp pR, P X q can be obtained by restricting the optimization in (256) to ρ P r0, 1s. In that segment of values of R, E sp pR, P X q " E r pR, P X q according to (269). Moreover, on the interval R P r0, R c s, we have max ρPr0,1s where we have used (266) and (269).
The first explicit connection between E r pR, P X q and the Augustin-Csiszár mutual information was made by Poltyrev [35] although he used a different form for I c α pX; Yq, as we discussed in (29). 69. The unconstrained maximizations over the input distribution of the sphere-packing and random coding error exponent functions are denoted, respectively, by Coding theorems [8][9][10]22,48] have shown that when these functions coincide they yield the reliability function (optimum speed at which the error probability vanishes with blocklength) as a function of the rate R ă max X IpX; Yq. The intuition is that, for the most favorable input distribution, errors occur when the channel behaves so atypically that codes of rate R are not reliable. There are many ways in which the channel may exhibit such behavior and they are all unlikely, but the most likely among them is the one that achieves (254). It follows from (187), (256) and (270) that (274) and (275) can be expressed as Therefore, we can sidestep working with the Augustin-Csiszár mutual information in the absence of cost constraints. 70. Shannon [1] showed that, operating at rates below maximal mutual information, it is possible to find codes whose error probability vanishes with blocklength; for the converse, instead of error probability, Shannon measured reliability by the conditional entropy of the message given the channel output. That alternative reliability measure, as well as its generalization to Arimoto-Rényi conditional entropy, is also useful analyzing the average performance over code ensembles. It turns out (see e.g., [28,68]) that, below capacity, those conditional entropies also vanish exponentially fast in much the same way as error probability with bounds that are governed by E sp pRq and E r pRq thereby lending additional operational significance to those functions. 71. We now introduce a cost function b : A Ñ r0, 8q and real scalar θ ě 0, and reexamine the optimizations in (274) and (275) allowing only those probability measures that satisfy ErbpXqs ď θ. With a patent, but unavoidable, abuse of notation we define E sp pR, θq " sup where (279) where (284) follows from (270). In particular, if we define the critical rate and the cutoff rate as respectively, then it follows from (270) that Summarizing, the evaluation of E sp pR, θq and E r pR, θq can be accomplished by the method proposed in Section 8, at the heart of which is the maximization in (206) involving α-mutual information instead of Augustin-Csiszár mutual information. In Sections 11 and 12, we illustrate the evaluation of the error exponent functions with two important additive-noise examples.

Additive Independent Gaussian Noise; Input Power Constraint
We illustrate the procedure in Item 58 by taking Example 6 considerably further.
73. Suppose A " B " R, bpxq " x 2 , and P Y|X"a " N`a, σ 2 N˘. We start by testing whether we can find R ν X P P A such that its α-response satisfies (230). Naturally, it makes sense to try R ν X " N`0, σ 2˘f or some yet to be determined σ 2 . As we saw in Example 6, this choice implies that its α-response is R ν Yrαs " N`0, α σ 2`σ2 N˘. Specializing Example 4, we obtain where (292) follows if we choose the variance of the auxiliary input as In (294) we have introduced an alternative, more convenient, parametrization for the Lagrange multiplier λ " 2 ν σ 2 N log e P p0, αq.

296)
" where we denoted snr " θ Entropy 2021, 23, 199 39 of 52 In accordance with Theorem 16 all that remains is to minimize (297) with respect to ν, or equivalently, with respect to λ. Differentiating (297) with respect to λ, the minimum is achieved at λ˚satisfying whose only valid root (obtained by solving a quadratic equation) is with ∆ defined in (118). So, for α P p0, 1q, (208) becomes Letting α " 1 1`ρ , we obtain 74. Alternatively, it is instructive to apply Theorem 18 to the current Gaussian/quadratic cost setting. Suppose we let QX " N`0, σ˚2˘, where σ˚2 is to be determined. With the aid of the formulas where µ ě 0, and X " N`0, σ 2˘, (217) becomes upon substituting σ 2 Ð σ˚2 and Likewise (218) translates into (291) and (292) with pν, σ 2 q Ð pν˚, σ˚2q, namely, Entropy 2021, 23, 199 40 of 52 Eliminating σ˚2 from (305) by means of (308) results in (299) and the same derivation that led to (300) shows that it is equal to ν˚θ`c α pν˚q. 75. Applying Theorem 17, we can readily find the input distribution, PX, that attains C c α pθq as well as its xαy-response PY (recall the notation in Item 53). According to Example 2, PY , the α-response to QX is Gaussian with zero mean and variance where (309) follows from (308) and (310) follows by using the expression for ∆ in (118). Note from Example 7 that PY is nothing but the xαy-response to N`0, snr σ 2

N˘.
We can easily verify from Theorem 17 that indeed PX " N`0, snr σ 2 N˘s ince in this case (216) becomes which can only be satisfied by PX " N`0, snr σ 2 N˘i n view of (305). As an independent confirmation, we can verify, after some algebra, that the right sides of (127) and (300) are identical. In fact, in the current Gaussian setting, we could start by postulating that the distribution that maximizes the Augustin-Csiszár mutual information under the second moment constraint does not depend on α and is given by PX " Np0, θq. Its xαyresponse PY xαy was already obtained in Example 7. Then, an alternative method to find C c α pθq, given in Section 6.2 of [43], is to follow the approach outlined in Item 53. To validate the choice of PX we must show that it maximizes BpP X , PY xαy q (in the notation introduced in (199)) among the subset of P A which satisfies ErX 2 s ď θ. This follows from the fact that D α´P Y|X"x }PY xαy¯i s an affine function of x 2 .
76. Let's now use the result in Item 73 to evaluate, with a novel parametrization, the error exponent functions for the Gaussian channel under an average power constraint. Theorem 23. Let A " B " R, bpxq " x 2 , and P Y|X"a " N`a, σ 2 N˘. Then, for β P r0, 1s, The critical rate and cutoff rate are, respectively,
Note that the parametric expression in (312) and (313) (shown in Figure 2) is, in fact, a closed-form expression for E sp pR, snr σ 2 N q since we can invert (313) to obtain The random coding error exponent is with the critical rate R c and cutoff rate R 0 in (314) and (315), respectively. It can be checked that (326) coincides with the expression given by Gallager [9] (p. 340) where he optimizes (235) with respect to ρ and r, but not P X , which he just assumes to be P X " Np0, θq. The expression for R c in (314) can be found in (7.4.34) of [9]; R 0 in (314) is implicit in p. 340 of [9], and explicit in e.g., [69].  77. The expression for E sp pR, θq in Theorem 23 has more structure than meets the eye. The analysis in Item 73 has shown that E sp pR, P X q is maximized over P X with second moment not exceeding θ by PX " Np0, θq regardless of R P´0, 1 2 logp1`snrq¯. The fact that we have found a closed-form expression for (254) when evaluated at such input probability measure and P Y|X"a " N`a, σ 2 N˘i s indicative that the minimum therein is attained by a Gaussian random transformation QY |X . This is indeed the case: define the random transformation In comparison with the nominal random transformation P Y|X"a " N`a, σ 2 N˘, this channel attenuates the input and contaminates it with a more powerful noise. Then, Furthermore, invoking (33), we get where (333) is (312). Therefore, QY |X does indeed achieve the minimum in (254) if P Y|X"a " N`a, σ 2 N˘a nd PX " Np0, θq. So, the most likely error mechanism is the result of atypically large noise strength and an attenuated received signal. Both effects cannot be combined into additional noise variance: there is no σ 2 ą 0 such that Q Y|X"a " N`a, σ 2˘a chieves the minimum in (254).

Additive Independent Exponential Noise; Input-Mean Constraint
This section finds the sphere-packing error exponent for the additive independent exponential noise channel under an input-mean constraint.
It is shown in [70,71] that max X : ErXsďθ IpX; X`Nq " logp1`snrq, achieved by a mixed random variable with density To determine C c α psnr ζq, α P p0, 1q, we invoke Theorem 18. A sensible candidate for the auxiliary input distribution QX is a mixed random variable with density where Γ˚P p0, 1q is yet to be determined. This is an attractive choice because its α-response, QY rαs , is particularly simple: exponential with mean α µ " ζ Γ˚, as we can verify using Laplace transforms. Then, if Z is exponential with unit mean, with the aid of Example 5, we can write So, (218) is satisfied with To evaluate (217), it is useful to note that if γ ą´1, then Therefore, the left side of (217) specializes to, withX˚" QX, while the expectation on the right side of (217) is given by Therefore, (217) yields with ρ " 1´α α . So, finally, (220), (344) and (345) give the closed-form expression C c α pθq " snr Γ˚log e´log Γ˚`1 1´α logpα`p1´αqΓ˚q.
As in Item 73, we can postulate an auxiliary distribution that satisfies (230) for every ν ě 0. This is identical to what we did in (341)-(343) except that now (344) and (345) hold for generic ν and Γ. Then, (351) is the result of solving θ "´9 c α pν˚q, which is, in fact, somewhat simpler than obtaining it through (217). 79. We proceed to get a very simple parametric expression for E sp pR, θq. Theorem 24. Let A " B " r0, 8q, bpxq " x, and Y " X`N, with N exponentially distributed, independent of X, and ErNs " ζ. Then, under the average cost constraint ErbpXqs ď ζ snr, where η P p0, 1s.
Now we go ahead and express both ρ˚and Γ˚as functions of snr and R exclusively. We may rewrite (357)-(360) as which, when plugged in (361), results in ρ˚" p1`snrq expp´Rq´1 where we have introduced Evidently, the left identity in (372) is the same as (355).
The critical rate and the cutoff rate are obtained by particularizing (360) and (356) to ρ˚" 1 and ρ " 1, respectively. This yields As in (326), the random coding error exponent is E r pR, ζ snrq " # E sp pR, ζ snrq, R P pR c , logp1`snrqq; with the critical rate R c and cutoff rate R 0 in (373) and (375), respectively. This function is shown along with E sp pR, ζ snrq in Figure 3 for snr " 3. 80. In parallel to Item 77, we find the random transformation that explains the most likely mechanism to produce errors at every rate R, namely the minimizer of (254) when P X " PX, the maximizer of the Augustin-Csiszár mutual information of order α. In this case, PX is not as trivial to guess as in Section 11, but since we already found QX in (339) with Γ " Γ˚, we can invoke Theorem 17 to show that the density of PX achieving the maximal order-α Augustin-Csiszár mutual information is pXptq " Γα`p 1´αqΓ˚δ ptq`1´Γα`p 1´αqΓ˚α Γζ e´t Γ˚{ζ 1tt ą 0u, whose mean is, as it should, α ζ Γ˚1´Γα`p1´αqΓ˚" ζ snr " θ.
Let QY be exponential with mean θ`κ, and QY |X"a have density qY |X"a ptq " 1 κ e´t´a κ 1tt ě au, and η as defined in (372). Using Laplace transforms, we can verify that PX Ñ QY |X Ñ QY where PX is the probability measure with density in (377). Let Z be unit-mean exponentially distributed. Writing mutual information as the difference between the output differential entropy and the noise differential entropy we get IpPX, QY |X q " hppθ`κqZq´hpκZq (381) in view of (363). Furthermore, using (335) and (379), DpQY |X } P Y|X |PXq " log ζ κ`ˆκ ζ´1˙l og e (384) " log η`ˆ1 η´1˙l og e (385) where we have used (380) and (354). Therefore, we have shown that QY |X is indeed the minimizer of (254). In this case, the most likely mechanism for errors to happen is that the channel adds independent exponential noise with mean ζ{η, instead of the nominal mean ζ. In this respect, the behavior is reminiscent of that of the exponential timing channel for which the error exponent is dominated (at least above critical rate) by an exponential server which is slower than the nominal [72].

Recap
81. The analysis of the fundamental limits of noisy channels in the regime of vanishing error probability with blocklength growing without bound expresses channel capacity in terms of a basic information measure: the input-output mutual information maximized over the input distribution. In the regime of fixed nonzero error probability, the asymptotic fundamental limit is a function of not only capacity but channel dispersion [73], which is also expressible in terms of an information measure: the variance of the information density obtained with the capacity-achieving distribution. In the regime of exponentially decreasing error probability (at fixed rate below capacity) the analysis of the fundamental limits has gone through three distinct phases. No information measures were involved during the first phase and any optimization with respect to various auxiliary parameters and input distribution had to rely on standard convex optimization techniques, such as Karush-Kuhn-Tucker conditions, which not only are cumbersome to solve in this particular setting, but shed little light on the structure of the solution. The second phase firmly anchored the problem in a large deviations foundation, with the fundamental limits expressed in terms of conditional relative entropy as well as mutual information. Unfortunately, the associated maximinimization in (2) did not immediately lend itself to analytical progress. Thanks to Csiszár's realization of the relevance of Rényi's information measures to this problem, the third phase has found a way to, not only express the error exponent functions as a function of information measures, but to solve the associated optimization problems in a systematic way. While, in the absence of cost constraints, the problem reduces to finding the maximal α-mutual information, cost constraints make the problem much more challenging because of the difficulty in determining the order-α Augustin-Csiszár mutual information. Fortunately, thanks to the introduction of an auxiliary input distribution (the xαy-adjunct of the distribution that maximizes I c α ), we have shown that α-mutual information also comes to the rescue in the maximization of the order-α Augustin-Csiszár mutual information in the presence of average cost constraints. We have also finally ended the isolation of Gallager's E 0 function with cost constraints from the representations in Phases 2 and 3. The pursuit of such a link is what motivated Augustin in 1978 to define a generalized mutual information measure. Overall, the analysis has given yet another instance of the benefits of variational representations of information measures, leading to solutions based on saddle points. However, we have steered clear of off-the-shelf minimax theorems and their associated topological constraints. We have worked out two channels/cost constraints (additive Gaussian noise with quadratic cost, and additive exponential noise with a linear cost) that admit closedform error-exponent functions, most easily expressed in parametric form. Furthermore, in Items 77 and 80 we have illuminated the structure of those closed-form expressions by identifying the anomalous channel behavior responsible for most errors at every given rate. In the exponential noise case, the solution is simply a noisier exponential channel, while in the Gaussian case it is the result of both a noisier Gaussian channel and an attenuated input. These observations prompt the question of whether there might be an alternative general approach that eschews Rényi's information measures to arrive at not only the most likely anomalous channel behavior, but the error exponent functions themselves.

4.
Let (A, F ) and (B, G ) be measurable spaces, known as the input and output space respectively. Likewise, A and B are referred to as the input and output alphabe respectively. The simplified notation P Y|X : A → B denotes a random transformatio from (A, F ) to (B, G ), i.e. for any x ∈ A, P Y|X=x (·) is a probability measure o (B, G ), and for any B ∈ G , P Y|X=· (B) is an F -measurable function. 5.
We abbreviate by P A the set of probability measures on (A, F ), and by P A×B th set of probability measures on (A × B, F ⊗ G ). If P ∈ P A and P Y|X : A → B is random transformation, the corresponding joint probability measure is denoted b P P Y|X ∈ P A×B (or, interchangeably, P Y|X P). The notation P → P Y|X → Q simpl indicates that the output marginal of the joint probability measure P P Y|X is denote by Q ∈ P B , namely, Q(B) = P Y|X (B|x) dP X (x) = E P Y|X (B|X) , B ∈ G . (11

6.
If P X → P Y|X → P Y and P Y|X=a P Y , the information density ı X;Y : A × B → [−∞, ∞) is defined as ı X;Y (a; b) = ı P Y|X=a P Y (b), (a, b) ∈ A × B.
(12 Following Rényi's terminology [49], if P X P Y|X P X × P Y , the dependence betwee X and Y is said to be regular, and the information density can be defined on (x, y) A × B. Henceforth, we assume that P Y|X is such that the dependence between i input and output is regular regardless of the input probability measure. For exampl if X = Y ∈ R, then P Y|X=a (A) = 1{a ∈ A}, and their dependence is not regular, sinc for any P X with non-discrete components P XY P X × P Y . 7.
Let α > 0, and P X → P Y|X → P Y . The α-response to P X ∈ P A is the output probabilit measure P Y[α] P Y with relative information given by regardless of whether the right side is finite.
Proof. If P ! Q ! R, we may invoke the chain rule (7) to decompose ı P}R paq´ı Q}R paq " ı P}Q paq.
Then, the result follows by taking expectations of (A2) when a Ð X " P.
To show that (A1) also holds when P

5.
We abbreviate by P A the set of probability measures on (A, F ), and by P A×B the set of probability measures on (A × B, F ⊗ G ). If P ∈ P A and P Y|X : A → B is a random transformation, the corresponding joint probability measure is denoted by P P Y|X ∈ P A×B (or, interchangeably, P Y|X P). The notation P → P Y|X → Q simply indicates that the output marginal of the joint probability measure P P Y|X is denoted by Q ∈ P B , namely, Q(B) = P Y|X (B|x) dP X (x) = E P Y|X (B|X) , B ∈ G . 6.
If P X → P Y|X → P Y and P Y|X=a P Y , the information density ı X;Y : A × B → [−∞, ∞) is defined as ı X;Y (a; b) = ı P Y|X=a P Y (b), (a, b) ∈ A × B.
Following Rényi's terminology [49], if P X P Y|X P X × P Y , the dependence between X and Y is said to be regular, and the information density can be defined on (x, y) ∈ A × B. Henceforth, we assume that P Y|X is such that the dependence between its input and output is regular regardless of the input probability measure. For example, if X = Y ∈ R, then P Y|X=a (A) = 1{a ∈ A}, and their dependence is not regular, since for any P X with non-discrete components P XY P X × P Y . 7.
Let α > 0, and P X → P Y|X → P Y . The α-response to P X ∈ P A is the output probability measure P Y[α] P Y with relative information given by where κ α is a scalar that guarantees that P Y[α] is a probability measure. Invoking (9), we obtain κ α = α log E E 1 α [exp(α ı X;Y (X;Ȳ))|Ȳ] , (X,Ȳ) ∼ P X × P Y .
For brevity, the dependence of κ α on P X and P Y|X is omitted. Jensen's inequality applied to (·) α results in κ α ≤ 0 for α ∈ (0, 1) and κ α ≥ 0 for α > 1. Although the α-response has a long record of services to information theory, this terminology and notation were introduced recently in [45]. Alternative terminology and notation were proposed in [42], which refers to the α-response as the order α Rényi mean. Note that κ 1 = 0 and the 1-response to P X is P Y . If p Y[α] and p Y|X denote the densities of P Y [α] and P Y|X with respect to some common dominating measure, then (13) becomes For α > 1 (resp. α < 1) we can think of the normalized version of p α Y|X as a random transformation with less (resp. more) "noise" than p Y|X . Q, i.e., that the expectation on the left side is 8, we invoke the Lebesgue decomposition theorem (e.g. p. 384 of [74]), which ensures that we can find α P r0, 1q, P 0 K Q and P 1 ! Q, such that