Majorization, Csiszár divergence and Zipf-Mandelbrot law

In this paper we show how the Shannon entropy is connected to the theory of majorization. They are both linked to the measure of disorder in a system. However, the theory of majorization usually gives stronger criteria than the entropic inequalities. We give some generalized results for majorization inequality using Csiszár f-divergence. This divergence, applied to some special convex functions, reduces the results for majorization inequality in the form of Shannon entropy and the Kullback-Leibler divergence. We give several applications by using the Zipf-Mandelbrot law.


Introduction and preliminaries
Well over a century ago measures were derived for assessing the distance between two models of probability distributions. Most relevant is Boltzmann's [] concept of generalized entropy in physics and thermodynamics (see Akaike [] for a brief review). Shannon [] employed entropy in his famous treatise on communication theory. Kullback-Leibler [] derived an information measure that happened to be the negative of Boltzmann's entropy, now referred to as the Kullback-Leibler (K-L) distance. The motivation for the Kullback-Leibler work was to provide a rigorous definition of information in relation to Fisher's sufficient statistics. The K-L distance has also been called the K-L discrepancy, divergence, information and number. These terms are synonyms; we use the term 'distance' in the material to follow.
A fundamental result related to the notion of the Shannon entropy is the following inequality (see []): for all positive real numbers p i and q i with Here, 'log' denotes the logarithmic function taken to a fixed base b > . Equality holds in () if q i = p i for all i. For details, see [], p.-. This result, sometimes called the fundamental lemma of information theory, has extensive applications (see for example []). Matić et al. [, , ] and [] continuously worked on Shannon's inequality and related inequalities in the probability distribution and information science. They studied and discussed in [, ] several aspects of Shannon's inequality in discrete as well as in integral forms, by presenting upper estimates of the difference between its two sides. Applications to the bounds in information theory were also given. Now we introduce the main mathematical theory explored in the presented work, the theory of majorization. It is a powerful and elegant mathematical tool which can be applied to a wide variety of problems as well as in quantum mechanics. The theory of majorization is closely related to the notions of randomness and disorder. It indeed allows us to compare two probability distributions in order to know which one is more random. Let us now give the most general definition of majorization.
The following definition is given in [], p.. Majorization: Let x = (x  , . . . , x n ), y = (y  , . . . , y n ) be n-tuples of real numbers. Then we say that y is majorized by x or that x majorizes y, in symbol, x y, if we have for j = , , . . . , n -, and Note that () is equivalent to n i=n-j+ for j = , , . . . , n -.
The following theorem, called the classical majorization theorem, is given in the monograph by Marshall et al. [], p. (see also [], p.): Theorem  (Classical majorization theorem) Let x = (x  , . . . , x n ), y = (y  , . . . , y n ) be two real n-tuples such that x i , y i ∈ J ⊂ R for i = , . . . , n. Then x majorizes y if and only if for every continuous convex function f : J → R, the following inequality holds: The following theorem is a generalization of Theorem , known as weighted majorization theorem, and was proved by Fuchs in [] (see also [], p.): Theorem  (Weighted majorization theorem) Let x = (x  , . . . , x n ), y = (y  , . . . , y n ) be two decreasing real n-tuples such that x i , y i ∈ J for i = , . . . , n. Let w = (w  , . . . , w n ) be a real n-tuple such that In [], they proved that which shows that the entropy function H b (X) reaches its maximum value on the discrete uniform probability distribution. They introduced the idea by giving the general setting of the above inequality by using the classical majorization theorem for the function f (x) = x log x, which is convex and continuous on R + . Suppose X and Y are discrete random variables with finite ranges and probability distributions p = {p i } r i= and q = {q i } r i= ( r i= p i = r i= q i = ), such that p q. Then by the majorization theorem By substituting p > (/r, . . . , /r) we get (). It is generally common to take log with a base of  in the introduced notions, but in our investigations this is not essential.
In Section , we present our main generalized results obtained from majorization inequality by using Csiszár f -divergence and then obtain corollaries in the form of Shannon entropy and the K-L distance. In Section , we give several applications using the Zipf-Mandelbrot law.

Csiszár introduced in [] and then discussed in [] the following notion.
Definition  Let f : R + → R + be a convex function, and let p := (p  , . . . , p n ) and q := (q  , . . . , q n ) be positive probability distributions. The f -divergence functional is It is possible to use non-negative probability distributions in the f -divergence functional, by defining Horváth et al.
[], p., considered functionality based on the previous definition.
Definition  Let J ⊂ R be an interval, and let f : J → R be a function. Let p := (p  , . . . , p n ) ∈ R n , and q := (q  , . . . , q n ) ∈ ], ∞[ n be such that Then we denotê

Motivated by the ideas in [] and []
, in this paper we study and discuss the majorization results in the form of divergences and entropies. The following theorem is a generalization of the result given in [], i.e., (). Assume p and q to be n-tuples, then we define The following theorem is the connection between Csiszár f -divergence and weighted majorization inequality as one sequence is monotonic.
Theorem  Assume J ⊂ R to be an interval, f : J → R to be a continuous convex function, p i , r i (i = , . . . , n) to be real numbers and q i (i = , . . . , n) to be positive real numbers, such that If f is a continuous concave function, then the reverse inequalities hold in () and ().
Proof (a): We use Theorem (a) with substitutions x i := p i q i , y i := r i q i , w i := q i and q i >  (i = , . . . , n). Then we get ().
We can prove part (b) with similar substitutions in Theorem (b).
Theorem  Assume J ⊂ R to be an interval, g : J → R to be a function, such that x → xg(x) (x ∈ J) to be a continuous convex function, p i and r i (i = , . . . , n) to be real numbers and q i (i = , . . . , n) to be positive real numbers satisfying () and () with (a) If r q is decreasing, then The theory of majorization and the notion of entropic measure of disorder are closely related. Based on this fact, the aim of this paper is to look for majorization relations with the connection to entropic inequalities. This was interesting to do for two main reasons. The first one is the fact that the majorization relations are usually stronger than the entropic inequalities, in the sense that they imply these entropic inequalities, but the converse is not true. The second reason is the fact that, when we dispose of majorization relations between two different quantum states, we know that we can transform one of the states into the other using some unitary transformation. The concept of entropy alone would not allow us to prove such a property. The Shannon entropy was introduced in the field of classical information. There are two ways of viewing the Shannon entropy. Suppose we have a random variable X, and we learn its value. In one point of view, the Shannon entropy quantifies the amount of information as regards the value of X (after measurement). In another point of view, the Shannon entropy tells us the amount of uncertainty about the variable of X before we learn its value (before measurement).
We mention two special cases of the previous result. The first case corresponds to the entropy of a discrete probability distribution.
(a) If r q is a decreasing n-tuple and the base of log is greater than , then the following estimates for the Shannon entropy of q hold: If the base of log is in between  and , then the reverse inequality holds in (). (b) If p q is an increasing n-tuple and the base of log is greater than , then the following estimates for the Shannon entropy of q hold: If the base of log is in between  and , then the reverse inequality holds in (). The second case corresponds to the relative entropy or the K-L distance between two probability distributions.
Definition  The K-L distance between the positive probability distributions p := (p  , . . . , p n ) and q := (q  , . . . , q n ) is defined by Corollary  Assume J ⊂ R to be an interval, and p i , r i and q i (i = , . . . , n) to be positive real numbers satisfying () and () with (a) If r q is a decreasing n-tuple and the base of log is greater than , then If the base of log is in between  and , then the reverse inequality holds in (). (a) If r q is a decreasing n-tuple and the base of log is greater than , then the following comparison inequality between K-L distance of (r, q) and (p, q) holds: If the base of log is in between  and , then the reverse inequality holds in ().
(b) If p q is an increasing n-tuple and the base of log is greater than , then the following comparison inequality between K-L distance of (r, q) and (p, q) holds: If the base of log is in between  and , then the reverse inequality holds in ().
Proof (a): Substitute g(x) := log x in Theorem (a). Then we get (). We can prove part (b) with substitution g(x) := log x in Theorem (b).
Remark  We give the above results when one sequence is monotone by using Theorem , but we can give all the above results when both sequences are monotone by using the weighted majorization theorem, Theorem , for w i >  (i = , . . . , n).

Applications to the Zipf-Mandelbrot entropy
The term Zipfian distribution refers to a distribution of probabilities of occurrence that follows Zipf 's law. Zipf 's law is an experimental law, not a theoretical one; i.e. it describes an occurrence rather than predicting it from some kind of theory: the observation that, in many natural and man-made phenomena, the probability of occurrence of many random items starts high and tapers off. Thus, a few occur very often while many others occur rarely. The formal definition of this law is P n = /n a , where P n is the frequency of occurrence of the nth ranked item and a is close to . Converted to language, this means that the rank of a word (in terms of its frequency) is approximately inversely proportional to its actual frequency, and so produces a hyperbolic distribution. To put Zipf 's law in another way (see [, ]): fr = C, where r = the rank of a word, f = the frequency of occurrence of that word, and C = a constant (the value of which depends on the subject under consideration). Essentially this shows an inverse proportional relationship between a word's frequency and its frequency rank. Zipf called this curve the 'standard curve' . Texts from natural languages do not, of course, behave with such absolute mathematical precision. They can not, because, for one thing, any curve representing empirical data from large texts will be a stepped graph, since many non-high-frequency words will share the same frequency. But the overall consensus is that texts match the standard curve significantly well. Li [] writes 'this distribution, also called Zipf 's law, has been checked for accuracy for the standard corpus of the present-day English [Kućera and Francis] with very good results. ' See Miller [] for a concise summary of the match between actual data and the standard curve.
Zipf also studied the relationship between the frequency of occurrence of a word and its length. In The Psycho-Biology of Language, he stated that 'it seems reasonably clear that shorter words are distinctly more favored in language than longer words. ' Apart from the use of this law in information science and linguistics, Zipf 's law is used in economics. This distribution in economics is known as Pareto's law, which analyzes the distribution of the wealthiest members of the community [], p.. These two laws are the same in the mathematical sense, but they are applied in different contexts [], p.. The same type of distribution that we have in Zipf 's and Pareto's law, also known as the power law, can also be found in other scientific disciplines, such as physics, biology, earth and planetary sciences, computer science, demography and the social sciences []. In probability theory and statistics, the cumulative distribution function (CDF) of a realvalued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x and we often denote by CDF the following ratio: The cumulative distribution function is an important application of majorization.
In the case of a continuous distribution, it gives the area under the probability distribution functions, also used to specify the distribution of multivariable random variables.
There are various applications of CDF. For example, in learning to rank, the CDF arises naturally as a probability measure over inequality events of the type {X ≤ x}. The joint CDF lends itself to problems that are easily described in terms of inequality events in which statistical dependence relationships also exist among events. Examples of this type of problem include web search and document retrieval [-], predicting rating of movies [] or predicting multiplayer game outcomes with a team structure []. In contrast to the canonical problems of classification or regression, in learning to rank we are required to learn some mapping from inputs to inter-dependent output variables, so that we may wish to model both stochastic orderings of variable states and statistical dependence relationships between variables.
In the following application, we use two of the Zipf-Mandelbrot laws for different parameters.
. . , n) and the base of log is greater than , then If the base of log is in between  and , then the reverse inequality holds in ().
. . , n) and the base of log is greater than , then If the base of log is in between  and , then the reverse inequality holds in ().
We can easily check that  (i+t  ) s  H n,t  ,s  is decreasing over i = , . . . , n and similarly r i too. Now, we investigate the behavior of r q for q i >  (i = , , . . . , n); take which shows that r q is decreasing. So all the assumptions of Corollary (a) are true. Then by using () we get ().
(b) If we switch the role of r i into p i , then by using () in Corollary (b) we get ().
The following application is a special case of the above result.
If the base of log is greater than , then If the base of log is in between  and , then the reverse inequality holds in ().
(a) If (i+t  ) s  (i++t  ) s  ≤ q i+ q i (i = , . . . , n) and the base of log is greater than , then n i= q i log  q i (i + t  ) s  H n,t  ,s  ≥ n i= q i log  q i (i + t  ) s  H ,t  ,s  .

(   )
If the base of log is in between  and , then the reverse inequality holds in (). (b) If (i+t  ) s  (i++t  ) s  ≥ q i+ q i (i = , . . . , n) and the base of log is greater than , then n i= q i log  q i (i + t  ) s  H n,t  ,s  ≤ n i= q i log  q i (i + t  ) s  H ,t  ,s  .
(   ) If the base of log is in between  and , then the reverse inequality holds in ().
Proof We can prove by a similar method as given in Application  with substitutions p i :=  (i+t  ) s  H n,t  ,s  and r i :=  (i+t  ) s  H n,t  ,s  in Corollary  instead of Corollary , to get the required results.
The following result is a special case of Application .
Application  Assume p and r to be the Zipf-Mandelbrot laws with parameters n ∈ {, , . . .}, t  , t  ≥  and s  , s  > , respectively, satisfying (). If the base of log is greater than , then