Some Families of Jensen-like Inequalities with Application to Information Theory

It is well known that the traditional Jensen inequality is proved by lower bounding the given convex function, f(x), by the tangential affine function that passes through the point (E{X},f(E{X})), where E{X} is the expectation of the random variable X. While this tangential affine function yields the tightest lower bound among all lower bounds induced by affine functions that are tangential to f, it turns out that when the function f is just part of a more complicated expression whose expectation is to be bounded, the tightest lower bound might belong to a tangential affine function that passes through a point different than (E{X},f(E{X})). In this paper, we take advantage of this observation by optimizing the point of tangency with regard to the specific given expression in a variety of cases and thereby derive several families of inequalities, henceforth referred to as “Jensen-like” inequalities, which are new to the best knowledge of the author. The degree of tightness and the potential usefulness of these inequalities is demonstrated in several application examples related to information theory.


Introduction
As is well known, the Jensen inequality is one of the most fundamental and useful mathematical tools in a variety of fields, including information theory. Interestingly, it includes many other very well-known inequalities, which are important on their own, as special cases. Among many examples, we mention the Shwartz-Cauchy inequality (which in turn supports uncertainty principles and the Cramér-Rao bound), the Lyapunov inequality, the Hölder inequality, and the inequalities among the harmonic, geometric and arithmetic means. In the field of information theory, the Jensen inequality stands at the basis of the information inequality (i.e., the non-negativity of the relative entropy), the data processing inequality (which in turn leads to the Fano inequality), and the inequality between conditional and unconditional entropies. Moreover, it plays a central role in support of the derivation of single-letter formulas in Shannon theory and in the theory of maximum entropy under moment constraints (see, for example, Chapter 12 of [1]).
In many cases (such as the one above), the optimal value of the parameter(s) (e.g., the parameter a in the above discussion) can be found in closed form. In other cases, the resulting expressions may not lend themselves to closed-form optimization, and then we have two possibilities: (i) carry out the optimization numerically, and (ii) select an arbitrary choice of a and obtain a valid lower bound, bearing in mind that an educated guess can potentially result in a good bound.

2.
Our inequalities provide two types of bounds: (i) bounds that require the calculation of the first two moments (or equivalently, the first two cumulants) of X, and (ii) bounds that require the calculation of the moment-generating function (MGF) of X and its derivative, or equivalently, the cumulant-generating function (CGF) of X and its derivative. All these types of moments are often easily calculable in closed form, especially in situations where X is given by the sum of independent and identically distributed (i.i.d.) random variables, which is frequently encountered in informationtheoretic applications.

3.
Most of our derivations extend to convex functions of more than one variable. 4.
The classes of Jensen-like inequalities that we consider allow enough flexibility to obtain derivations of lower bounds on functions that are not necessarily convex, and even for some concave functions, and thereby open the door for another route to reverse Jensen inequalities. This can be accomplished by representing the given function in one of the categories discussed (e.g., a product of a convex function and a non-negaive function, a product of two non-negative convex functions, a composition of a monotone function and a convex function, etc.).

5.
We demonstrate the utility of the Jensen-like inequalities in several examples of information-theoretic relevance. We also display numerical results that exemplify the degree of tightness of these bounds. 6.
Our Jensen-like inequalities have the desirable property of becoming tighter as X becomes more and more concentrated around its mean, just like the ordinary Jensen inequality. 7.
Throughout the paper, we confine ourselves to lower bounds on expectations of expressions that include a convex function f , but it should be understood that they all continue to apply also if f is concave and the inequalities are reversed. 8.
It should be understood that the classes of Jensen-like inequalities that we derive in this work are just examples that demonstrate the basic underlying idea of optimizing the point of tangency to the given convex function for the specific expression at hand. It is conceivable that the same idea can be applied to many more situations of theoretical and practical interest.
In all forthcoming derivations, it will be assumed that the convex functions involved are weakly convex and differentiable. In other words, we will rely on the well-known fact that a differentiable convex function, f (x), is nowhere below the supporting line, (x) = f (a) + f (a)(x − a), for every value of the parameter a in the domain of the independent variable, x [25] (p. 69, eq. (3.2)). In order to show that the point of zeroderivative of the lower bound (w.r.t. a) indeed yields a maximum (and not a minimum, etc.) of the lower bound, we will need to further assume that f is twice differentiable, but such an assumption will not limit the applicability of the claimed lower bound, because the lower bound applies to any value of a, including the point of zero-derivative, even if this point cannot be proved to yield the maximum of the lower bound using the standard methods. Similar comments apply when the lower bound will depend on more than one parameter. In the remaining part of this article, each section is devoted to a different class of Jensenlike inequalities, which corresponds to a different form of an expression that includes the convex function, f .

A Product of a Convex Function and a Non-Negative Function
In this section, we focus on lower bounding expressions of the form E{ f (X)g(X)}, where f is convex and g is non-negative. Indeed, let f : R → R be a convex function and let g : R → R + be a non-negative function. Then, for any a ∈ R, To find the value of a that maximizes the r.h.s., we equate the derivative to zero and obtain: whose solution is readily obtained as and it is easy to verify that the second derivative at a = a * is − f (a * )E{g(X)} < 0, which means that it is a maximum (at least a local one). The resulting lower bound on E{ f (X)g(X)} is then given by This result extends straightforwardly to the case where X is a vector provided that f is jointly convex and differentiable in all components of X. In particular, it extends to the case where f and g act as different random variables, X and Y, with a joint distribution: We next consider several examples.
Example 1. Let f (x) = − ln x and g(x) = x, x > 0. Applying Inequality (8), Note that the function −x ln x is concave, rather than convex, yet we have here a lower bound (rather than an upper bound) to its expectation, namely, a reversed Jensen inequality. The first term on the right-most side is the (ordinary) Jensen upper bound on E{−X ln X}, and the second term is the gap, which depends not only on the expectation of X but also on its variance, which manifests the fluctuations around E{X}. Clearly, if Var{X} = 0, the second term vanishes, which makes sense, because when X is a degenerated random variable, Jensen's inequality is achieved with equality and there is no gap. This inequality has an immediate application for obtaining a lower bound to the expectation of the empirical entropy of a sequence drawn by a memoryless source, which is relevant in the context of universal source coding [26]. Each term of the empirical entropy is of the form −X ln X, where X = N(u)/N, N(u) is the number of occurrences of a letter u in a randomly drawn N-tuple from a memoryless source, P, with a finite alphabet, U . Clearly, each N(u) is a binomial random variable with N trials and probability of success, P(u). In this case, E{X} = P(u) and Var{X} = P(u)[1 − P(u)]/N. Thus, denoting the entropy and the empirical entropy, respectively, by with the convention that 0 ln 0 = 0, we have: where |U | is the cardinality of U . The use of the ordinary Jensen inequality yields an upper bound rather than a lower bound, E{Ĥ} ≤ H. We conclude that the expected empirical entropy, E{Ĥ}, is sandwiched between H and H − (|U | − 1)/N, which is reasonable because the variance of the empirical probabilities, N(u)/N, decays at the rate of 1/N.

Example 2.
Let s and t be two real numbers whose difference, s − t, is either negative or larger than unity. Now, let g(x) = x t , and f (x) = x s−t . Then, In particular, for t = 1 and s / ∈ (1, 2), this becomes which is, once again, a bound that depends only on the first two moments of X. For s ∈ (0, 1), the function x s is concave, and so, this is a reversed version of the Jensen inequality. For s ≤ 0 and s ≥ 2, the function x s is convex, and so, this is an improved version of the Jensen inequality: While the first factor, [E{X}] s , corresponds to the ordinary Jensen inequality, the second factor expresses the improvement, which depends on the relative fluctuation term, Var{X}/[E{X}] 2 . The degree of improvement depends, of course, on the variance of X. If the variance vanishes, there is nothing to improve because the ordinary Jensen inequality becomes an equality. On the other hand, the larger the variance, the larger the gap between the ordinary Jensen bound, [E{X}] s , and the improved one. Accordingly, this also demonstrates the role of the optimization of the parameter a as opposed to the default choice of a = E{X} of the ordinary Jensen inequality.
To particularize this example even further, consider the problem of randomized guessing under a distribution Q (see, e.g., [27] and many references therein). Then, the probability of a single success in guessing a discrete alphabet random variable, X, given that we know that X = x (but not the guesser), is Q(x). In sequential guessing until the first success, the number of guesses, G, is a geometric RV with parameter p = Q(x), whose mean and variance are 1/p and (1 − p)/p 2 , respectively. For s ∈ (1, 2), Example 3. Let f be an arbitrary convex function and let g(x) = e sx , where s is a given real number. Then, Inequality (8) becomes: where ψ(s) = ln E{e sX } is the CGF of X and ψ (s) is its derivative. This gives a lower bound in terms of the CGF of X and its derivative. The ordinary Jensen inequality is obtained as the special case of s = 0, where ψ(0) = 0 and ψ (0) = E{X}.

A Composition of a Monotone Function and a Convex Function
Another family of Jensen-like inequalities corresponds to the need to lower bound an expression of the form E{g[ f (X)]}, where f is convex as before and g is a monotonically non-decreasing function. The general idea is to carry out the optimization of the r.h.s. of the following inequality.
In the important special case where g(x) = e x , we have: where ψ(·) is again the CGF of X. The optimal value, a * , of a, is the solution to the equation obtained by equating the derivative of the exponent to zero, i.e., where ψ (·) and ψ (·) are the first and the second derivatives of ψ(·), respectively.

A Product of a Convex Function and a Monotone-Convex Composition
Yet another class of Jensen-like inequalities corresponds to lower bounding the expectation of the product of two functions, where one is convex and the other is a composition of a non-negative monotonically non-decreasing function and a convex function, i.e., where f and g are convex and h is monotonically non-decreasing and non-negative. For the case where h(x) = e x , we end up with a bound that depends on the CGF of X and its derivative: Maximizing with respect to b while a is kept fixed yields b * = ψ [ f (a)], and we obtain:

Example 5.
Considering the case where f (x) = − ln x and g(x) = x ln x, we may obtain a reversed Jensen-like inequality, namely, a lower bound to the expectation of the concave function ln X: Defining the MGF φ(s) = E{e sX } = e ψ(s) , we have: We obtained a lower bound in terms of the MGF and its derivative (or, equivalently, the CGF and its derivative), which is appealing in cases where X is the sum of i.i.d. random variables.
Accordingly, we now particularize this example further by examining the case where . . , k, being independent random variables. The motivation of assessing an expression of the form, E ln 1 , is two-fold. The first is that it is useful for bounding the ergodic capacity of the single-input, multipleoutput (SIMO) channel, where {Y i } designates random channel transfer coefficients (see, e.g., [22,28,29] and references therein). The second is that it is relevant for bounding the joint differential entropy associated with the multivariate Cauchy density. Here, (Y 1 , . . . , Y k ) are not Gaussian as defined above, but their multivariate Cauchy density can be represented as a continuous mixture of i.i.d. zero-mean Gaussian random variables, where the mixture is taken over all possible variances-see [22] (Example 6) for the details. In this case, Thus, and It follows that The Jensen upper bound, ln(1 + kσ 2 ), and the lower bound (45) are displayed in Figure 1 for σ 2 = 1 and k = 1, 2, . . . , 100. As can be seen, the bounds are quite close. Interestingly, the choice α = 1/(kσ 2 ) yields results that are very close to those of the optimal α. Another instance of this example is the circularly symmetric complex Gaussian channel whose signal-to-noise ratio (SNR), Z, is a random variable (e.g., due to fading), which is known to both the transmitter and the receiver. The capacity is given by C = E{ln(1 + gZ)}, where g is a certain deterministic gain factor and the expectation is with respect to the randomness of Z. For simplicity, let us assume that Z is distributed exponentially, i.e., where the parameter θ > 0 is given. In this case, and and so, In Figure 2, we plot this lower bound as a function of θ for g = 5 and compare it to the Jensen upper bound, ln(1 + g/θ) (red curve) and to the lower bound of [22] (Sect. 4.1, Example 1). As can be seen, the lower bound proposed here is considerably tighter, especially for small θ.

Example 6.
Yet another example of this family of Jensen-like inequalities applies to obtaining a lower bound to E{X t }, where t is an arbitrary real. For a given t, let s ≥ 0 be either larger than 1 − t or smaller than −t, and consider the case where f (x) = x t+s , g(x) = −s ln x and h(x) = e x . Then, Choosing b = ψ (−s/a), and changing the optimization variable a into α = 1/a, we obtain More specifically, if , with parameter p, then φ(s) = (pe s + q) n , where q = 1 − p. We then obtain (54) The first factor is (EX) t . The second factor tends to unity as n grows, because pe −s/np + q ≈ p(1 − s/(np)) + q = 1 − s/n, and so, (pe −s/np + q) n ≈ (1 − s/n) n ≈ e −s . For t ≥ 1 and t ≤ 0, the function f (x) = x t is convex, and so, (EX) t is the ordinary Jensen lower bound. In this case, the bound is valuable if the multiplicative factor, e s (pe −s/(np) + q) n e −s(t+s)/(np) (pe −s/(np) + q) t+s , is larger than unity. If 0 < t < 1, the function f (x) = x t is concave, and then (EX) t is an upper bound. Of course, the parameter s can be optimized, too. Some numerical results for t = 0.5 are depicted in Figure 3. As can be seen, the upper and the lower bounds are fairly close. Another application of this example is related to estimation theory. Let θ ∈ R and let Y 1 , . . . , Y n be i.i.d., with mean θ and variance σ 2 . Consider the t-th moment of the estimation and so, with either s ≥ 1 − t/2 or s ≤ −t/2. For α = ζn/σ 2 (ζ > 0 being a constant), we have: where for t ∈ [0, 2], the first factor, σ t /n t/2 , is the Jensen upper bound. The second factor, is the gap between the Jensen upper bound and the proposed lower bound. In Figure 4, we display this factor. The result µ 2 = 1 is expected, because for t = 2 and s = 0, the calculation is trivially exact. Note that the maximization over ζ, for a given s, can be carried out in closed form by equating to zero the partial derivative of ln[(ζe) s /(1 + 2ζs) (t+1)/2+s ] with respect to ζ. The optimal ζ turns out to be equal to 1/(t + 1) (independently of s), and so, Finally, it should be pointed out that this family of Jensen-like bounds opens the door also to lower-bound calculations on the form E{ f (X)/g(X)}, where f is non-negative convex and g is non-negative and concave. Using the fact the identity 1/s = ∞ 0 e −st dt, we have: and we can apply the same ideas as before to the integrand, having the freedom to optimize the bound parameters with possible dependence on t.

A Product of Two Non-Negative Convex Functions
The last family of Jensen-like bounds that we present in this work is associated with the product of two non-negative convex functions. Let both f and g be non-negative convex functions of x ≥ 0. Then, The optimal b and c are b * = E{X} and c * = E{X 2 }/E{X}, respectively. Thus, Let and assume that f (a * ) ≥ a * f (a * ) ≥ 0. Then, a * is the optimal value of a, which yields More generally, when X and Y are two random variables with a joint distribution, the above derivation easily extends to If f and g are both concave, rather than convex, then the inequalities are reversed.

Example 7.
Consider again the example of the capacity of the AWGN with a random SNR, c(Z) = ln(1 + gZ), and suppose that we wish to bound the variance of c(Z) in order to assess the fluctuations (e.g., for the purpose of bounding the outage probability). Then, obviously, To upper bound Var{c(Z)}, we may derive an upper bound to E{ln 2 (1 + gZ)} and a lower bound to E{ln(1 + gZ)}. For the latter, a lower bound was already proposed earlier in Example 5. For the former, we may use the present inequality with the choice f (z) = g(z) = ln(1 + gz), which can easily be shown to satisfy the requirements. We then obtain the following upper bound, which depends merely on the first two moments of Z: E{ln 2 (1 + gZ)} ≤ ln(1 + gE{Z}) · ln 1 + gE{Z} ln(1 + gE{Z 2 }/E{Z}) ln(1 + gE{Z}) .
Interestingly, the function ln 2 (1 + gx) is neither convex nor concave, yet our approach offers an upper bound, which is fairly easy to calculate provided that one can compute the first two moments of Z.

Conclusions
In this work, we have revisited the Jensen inequality on the basis of taking advantage of the freedom to optimize the choice of the supporting line that is tangential to the given convex function. This optimal choice might be different than the ordinary one when the convex function does not stand alone, but it is rather only part of a more complicated expression. This more complicated expression can sometimes be created in an artificial manner, such as in Examples 2, 5 and 6. The resulting bounds depend on either the first two moments of the independent variable, X, or on its MGF and its derivative. Both types of moments often lend themselves to relatively easy calculations. The proposed methodology can be used both for improving on the ordinary Jensen inequality (such as in Examples 2 and 4), and for generating lower bounds to expectations of non-convex or even concave (rather than convex) functions (such as in Examples 1, 2, 5 and 7). Several families of Jensen-like inequalities have been derived along with a demonstration of numerical examples with application to information theory. The tightness of the inequalities obtained was also demonstrated in those examples.