Zipf's law, power laws, and maximum entropy

Zipf's law, and power laws in general, have attracted and continue to attract considerable attention in a wide variety of disciplines - from astronomy to demographics to software structure to economics to linguistics to zoology, and even warfare. A recent model of random group formation [RGF] attempts a general explanation of such phenomena based on Jaynes' notion of maximum entropy applied to a particular choice of cost function. In the present article I argue that the cost function used in the RGF model is in fact unnecessarily complicated, and that power laws can be obtained in a much simpler way by applying maximum entropy ideas directly to the Shannon entropy subject only to a single constraint: that the average of the logarithm of the observable quantity is specified.


Introduction
Zipf's law [1][2][3], and power laws in general [4][5][6], have and continue to attract considerable attention in a wide variety of disciplines-from astronomy to demographics to software structure to economics to zoology, and even to warfare [7]. Typically one is dealing with integer-valued observables (numbers of objects, people, cities, words, animals, corpses), with n ∈ {1, 2, 3, . . .}. Sometimes the range of values is allowed to be infinite (at least in principle), sometimes a hard upper bound N is fixed (e.g., total population if one is interested in subdividing a fixed population into sub-classes). Particularly interesting probability distributions are probability laws of the form • Zipf's law: p n ∝ 1/n; • power laws: p n ∝ 1/n z ; • hybrid geometric/power laws: p n ∝ w n /n z .
Specifically, a recent model of random group formation (RGF), see [8], attempts a general explanation of such phenomena based on Jaynes' notion of maximum entropy [9][10][11][12][13] applied to a particular choice of cost function [8]. (For recent related work largely in a demographic context see [14][15][16][17][18]. For related work in a fractal context implemented using an iterative framework see [19].) In the present paper I shall argue that the specific cost function used in the RGF model is in fact unnecessarily complicated (in fact RGF most typically leads to a hybrid geometric/power law, not a pure power law), and that power laws can be obtained in a much simpler way by applying maximum entropy ideas directly to the Shannon entropy itself [20,21] subject only to a single constraint: that the average of the logarithm of the observable quantity is specified. Similarly, I would argue that (at least as long as the main issue one is interested in is 'merely' the minimum requirements for obtaining a power law) the appeal to a fractal framework and the iterative model adopted by Pastor-Satorras and Wagensberg [19] is also unnecessarily complicated.
To place this observation in perspective, I will explore several variations on this theme, modifying both the relevant state space and the number of constraints, and will briefly discuss the relevant special functions of mathematical physics that one encounters (zeta functions, harmonic series, poly-logarithms). I shall also discuss an extremely general Gibbs-like model, and the use of non-Shannon entropies (the Rényi [22] and Tsallis [23] entropies and their generalizations). There is a very definite trade-off between simplicity and generality, and I shall very much focus on keeping the discussion as technically simple as possible, and on identifying the simplest model with minimalist assumptions.

Power laws in infinite state space
Let us define the set of observable quantities to be positive integers n ∈ {1, 2, 3, . . .}, without any a priori upper bound. The maximum entropy approach [9][10][11][12][13] seeks to estimate the probabilities p n by maximizing the Shannon entropy [20,21] S = − n p n ln p n (1) subject to a (small) number of constraints/cost functions-representing our limited state of knowledge regarding the underlying process. For example, the RGF model of Baek et al [8] uses one constraint and one (relatively complicated) cost function [8], in addition to the 'trivial' normalization constraint n p n = 1. Instead, let us consider the single constraint ln n ≡ ∞ n=1 p n ln n = χ.
Let us now maximize the Shannon entropy subject to this constraint. This is best done by introducing a Lagrange multiplier z corresponding to the constraint ln n , plus a second Lagrange multiplier λ corresponding to the 'trivial' normalization constraint, and considering the quantityŜ Of course there is no loss of generality in redefining the Lagrange multiplier λ → Z as follows: Varying with respect to the p n yields the extremality condition −z ln n − ln Z − ln p n = 0 with explicit solution Here ζ (z) is the Riemann zeta function [24][25][26][27], a relatively common and well-known special function, and the condition z > 1 is required to make the sum ∞ n=1 n −z = ζ (z) converge. This is enough to tell you that one will never exactly reproduce Zipf's law (z = 1) in this particular manner, though one can get arbitrarily close. The value of the Lagrange multiplier z (which becomes the exponent in the power law) is determined self-consistently in terms of χ by demanding Then z ∈ (1, ∞) while χ ∈ (0, ∞). For practical calculations it is often best to view the exponent z as the single free parameter and χ(z) as the derived quantity, but this viewpoint can easily be inverted if desired. Near z = 1 we have the analytic estimate where γ denotes Euler's constant. At maximum entropy, imposing the extremality condition and summing, we havê Near z = 1 we have the analytic estimatê For future reference, it is useful to observe that [27, see p 118] where the Stieltjes constants γ m satisfy A better estimate, using Euler-Mclaurin summation, is The quick lesson to take is this: by applying maximum entropy considerations to the single constraint ln n = χ you can get a pure power law with any exponent z > 1. Furthermore, note that the quantity is the geometric mean of the integers {1, 2, 3, . . .} with the exponents weighted by the probabilities p n . So one can just as easily obtain the pure power laws considered above by maximizing the entropy subject to the constraint that this geometric mean takes on a specified value.

Power laws in finite state space
If one desires an exact Zipf law (exponent z = 1, so that p n ∝ 1/n), then because the harmonic series diverges, ∞ n=1 1/n = ∞, it is clear that something in the above formulation needs to change. Perhaps the easiest thing to do is to introduce an explicit maximum value of n, call it N , so that we take the set of observables to be positive integers n ∈ {1, 2, 3, . . . , N }. (Physicists would call this an infra-red cutoff, or large-distance cutoff.) The maximum entropy approach now amounts to considerinĝ Varying with respect to the p n and maximizing again yields the same extremality condition but now implying Here H N (z) is the (reasonably well known) generalized harmonic function [27] Compared with the previous case the only real difference lies in the normalization function. However, because the sum is now always finite, there is no longer any constraint on the value of the exponent z, in fact we can have z ∈ (−∞, ∞). The case z = 1 is Zipf's law, while z = 0 is a uniform distribution, and z < 0 corresponds to an 'inverted hierarchy' where large values are more common than small values. The price paid for this extra flexibility is that the model now has two free parameters, which can be chosen to be z and N . One has the self-consistency constraint It is easy to check that χ is now bounded by χ ∈ (0, ln N ). At maximum entropy we now havê Because this is now a two-parameter model, it will always (naively) be a 'better' fit to observational data than a single-parameter model. Sometimes (for z 1) retreating to this twoparameter model is necessary, but for z > 1 the one-parameter model of the previous section (N → ∞) should be preferred.

Zipf's law in finite state space
If for observational or theoretical reasons one is certain that z = 1 (Zipf's law), then the model reduces as follows: the state space is n ∈ {1, 2, 3, . . . , N } where N is now the only free parameter. Then explicitly forcing z → 1 one considerŝ This is completely equivalent to considerinĝ but writing the quantity to be maximized in this way hides the role of the Shannon entropy.
Varying with respect to the p n and maximizing now yields a (very) slightly different extremality condition and so Here H N is the (ordinary) harmonic number [27] Then Furthermore, at maximum entropŷ Now we have already seen Therefore and at maximum entropŷ Note that you can use this to estimate the size N of the state space one needs to adopt in order to be compatible with the (observed) logarithmic average ln n . Indeed This relates the size of the required state space N to the square of the geometric mean exp ln n .

Hybrid geometric/power models in infinite state space
To generate a hybrid geometric/power law model, similar in output to the RGF model [8], but with considerably simpler input assumptions, simply take both the logarithmic average ln n = χ, and the arithmetic average n = µ, to be specified-and then maximize the Shannon entropy subject to these two constraints, (plus the trivial normalization constraint). That is, introduce two Lagrange multipliers z and w, and maximizê Varying with respect to the p n yields −z ln n + n ln w − ln Z − ln p n = 0 (34) with solution Here the normalizing constant is the well-known poly-logarithm function [24][25][26][27] Li z (w) = ∞ n=1 w n n z , Li 1 (w) = − ln(1 − w). Then while Furthermore, at maximum entropŷ Note that the probability function arising in this model is fully as general as that arising in the RGF model [8], but with what is perhaps a somewhat clearer interpretation. This is because the RGF model uses what may be viewed as an unnecessarily complicated 'cost function', with an unnecessary degeneracy in the parameters. Indeed, from [8] one sees (in their notation, slightly different from current notation) That is So the RGF cost function [8] is simply a linear combination of Shannon entropy, the logarithmic mean ln k , and a redundant constant offset ln N . (Unfortunately the N of Baek et al [8] is not the same as the N used in this paper, the RGF parameter N corresponds to the number of independent realizations of the underlying statistical process one considers-it is the number of simulations, or number of universes in the statistical ensemble.) The additional complexity implicit in the RGF model can to some extent be viewed as a sideeffect of forcing data into discrete ordinal boxes when one does not necessarily have good physical/mathematical/demographic reasons for knowing which particular boxes are 'best', how the boxes are to be assigned ordinal numbers, and where the box boundaries should most profitably be placed. Apart from the issues raised above, one could in addition explicitly restrict the state space to be finite, adding yet another free parameter (M in the language of Baek et al [8], N in the language of this note), but there is little purpose in doing so-the key insight is this: once the data are assigned to ordinal boxes, hybrid geometric/power laws drop out automatically and straightforwardly by maximizing the Shannon entropy subject to the two very simple constraints ln n = χ and n = µ.

Zipf's law: geometric version in infinite state space
If for observational or theoretical reasons one is certain that z = 1 (Zipf's law), but for whatever reason feels a finite state space cutoff N is inappropriate, then a geometric version of Zipf's law can be extracted from the hybrid model. Setting z = 1 the model reduces as follows: the state space is now n ∈ {1, 2, 3, . . .} whilê This is completely equivalent to maximizinĝ but writing the quantity to be maximized in this way hides the role of the Shannon entropy.
Varying with respect to the p n yields − ln n + n ln w − ln Z − ln p n = 0 (46) with solution where now the normalizing constant is This can be viewed as a generalization/modification of the Gibbs distribution where we have explicitly pulled out some of the constraints (the g-type constraints) to make them look powerlaw-like, while the remaining constraints (the f -type constraints) are left in Boltzmann-like form. Then while Furthermore, at maximum entropŷ It is only once one specifies particular choices for the functions f a and g i that the model becomes concrete, and only at that stage might one need to focus on particular special functions of mathematical physics. The model is extremely general-the drawback is that, because it can fit almost anything, it can 'explain' almost anything, and so can predict almost nothing.

Non-Shannon entropies
Shannon's entropy is by far the best motivated of the entropy functions infesting the literature. Without making any particular commitment to the advisability of doing so, we can certainly ask what happens if we apply maximum entropy ideas to non-Shannon entropies (such as the Rényi [22] or Tsallis [23] entropies and their generalizations). Let us define an entropic zeta function by which certainly converges for s 1 and may converge on a wider region. Then where in both cases the Shannon entropy is recovered in the limit a → 0. More generally let us consider a generalized entropy of the form for an arbitrary smooth function f (·). Let us further impose a constraint on the bth moment With power laws being explicitly built in as input into both the generalized entropy and the constraint, it is perhaps not too surprising that we will manage to get power laws dropping out.
One is now interested in maximizinĝ 1 + a)) .
Varying the p n leads to with solution (An overall factor of (1 + a) f (ζ S (1 + a)) simply drops out of the calculation.) If the number of states is finite, then we cannot a priori discard the parameter w, and we have derived a distorted power law. (The probability distribution then interpolates between a pure power law for w = 0 and a uniform distribution for w = ∞.) If on the other hand, the number of states is infinite then normalizability enforces w → 0, and λ factors out. We then have a pure power law If the state space is the positive integers n ∈ {1, 2, 3, . . .} then Z → ζ (−b/a), the Riemann zeta function, and the sum converges only for −b/a > 1. So one of the two parameters (a, b) must be negative. In this situation now requiring both −b − b/a > 1 and −b/a > 1, while at maximum entropŷ So yes, one can also extract power laws from maximum entropy applied to non-Shannon entropies (in particular, the generalized Rényi-Tsallis entropies), but the process is (at best) rather clumsy, and seems an exercise in overkill. Apart from the whole question of whether or not non-Shannon entropies are particularly interesting, one should ask whether the result is particularly useful? This derivation does not seem to be in any way an improvement over the simpler one based directly on the Shannon entropy, so its utility is dubious.

Summary and discussion
The main point of this paper is that power laws (and their variants, including hybrid geometric/power laws) have a very natural and straightforward interpretation in terms of the maximum entropy formalism pioneered by Jaynes. The key to obtaining a pure power law in the simplest possible manner lies in maximizing the Shannon entropy while imposing the simple constraint ln n = χ. Depending on other features of the specific model under consideration, detailed analysis leads to certain of the special functions of mathematics (the Riemann zeta function, generalized harmonic functions, poly-logarithms or even ordinary logarithms), but these are relatively well-known mathematical objects, which are still tolerably simple to deal with. Adding additional features (finite size state space, extra constraints) can (and typically will) modify both the functional form and the normalization constants appearing in the probability distribution. As always there is a trade-off between simplicity and flexibility. A more complicated model (with more free parameters) has a more flexible probability distribution, but this comes at a real (if often unacknowledged) cost in terms of internal complexity. A rather general Gibbs-like model is laid out and briefly discussed. We also briefly discuss applying maximum entropy ideas to non-Shannon entropies (such as the Rényi and Tsallis entropies). There is very definitely a trade-off in both elegance and plausibility, and I would argue strongly that the simplest and most elegant model consists of the Shannon entropy, a constraint on ln n = χ, and a trivial normalization constraint on the sum of probabilities.
The fact that the logarithmic average ln n plays such an important role in power laws seems to have a connection with the fact that logarithmic scales are ubiquitous in classifying various natural and social phenomena. For instance • stellar magnitudes are logarithmic in stellar luminosity, • earthquake magnitudes (modified Richter scale) are logarithmic in energy release, • sound intensity decibels are logarithmic in pressure, • the acidity/alkalinity pH scale is logarithmic in hydrogen ion concentration, • musical octaves are logarithmic in frequency, • war severity can be characterized as being logarithmic in casualty count [7].
In many cases the utility of a logarithmic scale can be traced back to an approximate logarithmic sensitivity in human perceptual systems, but it is very easy to confound cause and effect. After all, in the presence of power-law distributed external stimuli, there is a significant disadvantage in having the human perceptual system overwhelmed by large numbers of low-impact events, suggesting an evolutionary pressure towards suppressing sensitivity to low-impact events. This suggests that logarithmic sensitivity in human (and animal) perceptual systems is evolutionarily preferred for those senses that are subject to an external bath of power-law distributed stimuli.
Fortunately, for the purposes of applying maximum entropy ideas one does not need to know which is the cause and which is the effect-one is 'merely' using Bayesian principle to estimate underlying probabilities in the presence of limited knowledge; for current purposes this is most typically the single piece of information that ln n = χ.