Legendre transformation and information geometry for the maximum entropy theory of ecology

Here I investigate some mathematical aspects of the maximum entropy theory of ecology (METE). In particular I address the geometrical structure of METE endowed by information geometry. As novel results, the macrostate entropy is calculated analytically by the Legendre transformation of the log-normalizer in METE. This result allows for the calculation of the metric terms in the information geometry arising from METE and, by consequence, the covariance matrix between METE variables.


Introduction
The method of maximum entropy (MaxEnt) is usually associated with Jaynes' work [1][2][3] connecting statistical physics and the information entropy proposed by Shannon [4] -although its mathematics is known since Gibbs [5]. It consists of selecting probability distributions by maximizing a functional -namely entropy -usually under a set of expected values constraints, arriving at what is known as Gibbs distributions. Since Shore and Johnson [6] MaxEnt has been understood as a general method for inference -see also [7][8][9] -hence it is not surprising that (i) Gibbs distributions are what is known in statistical theory as exponential family -the only distributions for which sufficient statistics exist (see e.g. [10]), (ii) MaxEnt encompasses the methods of Bayesian statistics [11], and (iii) MaxEnt has found successful applications in several fields of science (e.g [12][13][14][15][16][17][18][19][20][21][22]).
One of the scientific fields in which MaxEnt has been successfully applied is macroecology. The work of Harte and collaborators [23][24][25][26][27] presents what is known as the maximum entropy theory of ecology (METE). It consists of finding, through MaxEnt, a joint conditional distribution for the abundance of a species and the metabolic rate of its individuals. From the marginalization and expected values of the MaxEnt distribution, it is possible to obtain (i) the species abundance distribution (Fisher's log series), (ii) the species-area distribution, (iii) the distribution for metabolic rates over individuals, and (iv) the relationship between the metabolic rate of individuals in a species and that species abundance -for a comprehensive confirmation of METE with experimental data see [28]. In a recent article Harte [29] brings forward the need for dynamical models based on MaxEnt, as METE assume the variables to be static 1 . .
The field known as information geometry (IG) [33][34][35][36] assigns a Riemannian geometry structure to probability distributions. In information geometry the distances are given by the Fisher-Rao information metric (FRIM) [37,38], which is the only metric in accordance with the grouping property of probability distributions [39]. IG has found important applications for probabilistic dynamical systems [34,[40][41][42][43]. Here the FRIM terms for the distributions arising from METE will be calculated. In a future publication I will build upon the results obtained here towards an entropic dynamics [43] using METE variables.
The layout of the paper is as follows: The following section (2) presents MaxEnt in general terms followed by the MaxEnt process in METE. In particular we obtain the macrostate entropy through the Legendre transform, and the Lambert W special function [44,45], which is a novel result to the best of my knowledge. Section 3 presents some general results of IG and calculate the information metric terms for METE. Section 4 concludes the present article by commenting on possible applications and perspectives for IG in a dynamical theory of macroecology.

Maximum Entropy
In information theory, probability distributions encode the available information about a system's variables x ∈ X . MaxEnt consists of updating from a prior distribution q(x)usually, but not necessarily, taken to be uniform -to a posterior ρ(x) that maximizes the entropy functional under a set of constraints meant to represent the known information about the system. Usually these constraints are the expected values A i of a set of real valued 1 It is relevant to say that Jaynes applied dynamical methods based on information theory for nonequilibrium statistical mechanics [30] leading to what is now known as maximum caliber [31,32]. However, maximum caliber assumes a Hamiltonian dynamics and, therefore, does not generalize to ecology and other complex systems.
functions {a i (x)} namely sufficient statistics. The distribution ρ is found as the solution to the following optimization problem where dx refers to the appropriate measure of the set X ; if one is interested in a discrete set X = {x µ }, where µ corresponds to an enumeration of X , we have dx = µ , if one is interested in a continuous subset of real variables, e.g.
where λ = {λ i } is the set of Lagrange multipliers dual to the expected values A = {A i } and Z(λ) is a normalization factor given by Above, and on the remainder of this article, we use Einstein's summation notation The expected values can be recovered as We will refer to F as the log-normalizer, which displays a role similar to free energy in statistical mechanics. If one is able to invert the equations arriving from (4), obtaining this way λ i (A) they can express the probability distributions in terms of the expected values, ρ(x|A) = ρ(x|λ(A)). This also allows one to calculate the entropy H at its maximum -that means H[ρ(x|A)] for ρ in (2) -as a function of the expected values, rather than a functional of ρ, obtaining We will refer to H(A) as the macrostate entropy, which is what we refer to in statistical mechanics as thermodynamical entropy -meaning the one that appears in the laws of thermodynamics 2 . . One can see from (5) that H(A) is the Legendre transformation [46] of F (λ). It also follows that λ i = ∂H ∂A i .

METE
The first step towards a MaxEnt description involves choosing the appropriate variables for the problem at hand. In METE [24] one assumes an ecosystem of S species supporting N individuals with a total metabolic rate E, meaning in a unit of time the ecosystem consumes a quantity E of energy. The state of the system x on MaxEnt is defined for a singular species as the number of individuals (abundance) n, n ∈ {1, 2, . . . , N } and the metabolic rate of an individual of that species ε, ε ∈ [1, E] -note that one can choose a system of units so that the smallest metabolic rate is the unit, ε min = 1. We represent the state as x = (n, ε). The second step consists of assigning the sufficient statistics that appropriately captures the information about the system. In METE [24] the statistics chosen are the number of individuals in the species a 1 (n, ε) . = n and the total metabolic rate a 2 (n, ε) . = nε. Substituting these into the defined expected value constrains for the sufficient statistics (1), we obtain constraints on average abundance per species and a constrain on the average metabolic consumption per species The defined variable N and E will replace A 1 and A 2 , respectively, when convenient.
Having the state variables and the sufficient statistics chosen, we can compute all quantities defined in the previous subsection for the specific system defined by METE. With a uniform prior q, this leads to the canonical distribution (2) of the form where the normalization factor (3) is given by from which the expected values (4) can be calculated as These are complicated equations, however some approximations may make them more treatable.
A fair assumption, knowing what the variables are supposed to represent, is that there are far more individuals than species, N S and the average metabolic rate per individual is far greater than the unit of metabolic rate E/N = E /N 1. This allows for a sequence of approximation that we will treat like assumptions here, namely (i) e −λ 2 nE e −λ 2 n , (ii) Ee −λ 2 nE e −λ 2 n , (iii) λ 1 + λ 2 1, and (iv) e −(λ 1 +λ 2 )N 1. Further explanation on the validity of these assumptions, under S N E, can be seen in [24,26] and their confirmation by numerical calculation can be seen in [24]. Under this understanding we can substitute (9) .
We can also rewrite (10b) obtaining In order to obtain the macrostate entropy analytically (5) one needs to perform the Legendre transformation for METE, which includes inverting (11) and (12) obtaining λ 1 (N , E ) and λ 2 (N , E ). In page 149 of [24] it is said to be unfeasible. However, it is possible to do so obtaining where β(N ) and W −1 refers to the second main branch of the Lambert W function (see [44,45]). The details on how (13) inverts (11) and (12) are presented in Appendix A. The macrostate entropy can be calculated directly from (5) as With the calculation of the macrostate entropy finished, we can move into a geometric description of METE.

Information geometry
This section presents the elementary notions of IG -for more in depth discussion and examples see e.g. [33][34][35][36] -and some useful identities for the IG of Gibbs distributions. IG consists of assigning a Riemmanian geometry structure to the space of probability distributions, meaning if a set of distributions p(x|θ) is parametrized by a finite number of coordinates, θ = {θ i }, the distances -which are a measure of distinguishability -d between the neighbouring distributions P (x|θ + dθ) and P (x|θ) are given by d 2 = g ij dθ i dθ j . The work of Cencov [39] demonstrated that the only metric invariant under Markov embeddings -and, therefore, the only one adequate to represent a space of probability distributions -is the metric of the form know as FRIM.
Considering the MaxEnt results presented in previous section, we can restrict our investigation to the Gibbs distributions using the expected values A as coordinates -θ i = A i and P (x|θ) = ρ(x|A) as in (2). Two useful expressions arise in that case -for proofs see e.g. [33] -first: the metric terms are the Hessian of the negative of macrostate entropy, meaning and second: the covariance matrix between the sufficient statistics a i (x) is the inverse matrix of g ij , meaning We can, then, see how these quantities are calculated for METE.

Information geometry of METE
By substituting the macrostate entropy for METE (15) in (17) we obtain the FRIM terms: where g = det g ij . Per (18) and from the general form of inverse matrix of a two dimensional matrix, the covariance matrix terms can be calculated directly inverting (19) obtaining completing the calculation. The matrix C ij can be interpreted directly as the covariance between a species abundance and its total metabolic rate -METE sufficient statistics. The information metric terms presented in (19) allow for further studies on dynamical ecology from a information theory background, as we will comment in the following section.

Discussion and perspectives
The present article calculates the macrostate entropy (15) for METE. This was made possible by the analytical calculation of the Lagrange multipliers (13) as functions of the expected values (10), previously believed to be unfeasible. This allows for a complete description of METE in terms of the average abundance N and the expected metabolic rate E of each of the ecosystem species. This opens a broad range of investigations possible by analytical calculations. In particular, the IG arising from METE is presented by calculating the FRIM terms in (19). Independently of any geometric interpretation, that was equivalent to calculate the covariance between METE sufficient statistics (20). The variables that define an ecosystem's state are not expected to remain constant. Because of this, and the growing relevance of IG in dynamical systems, the calculations made in the present article are an important step into expanding maximum entropy ideas into further investigation in macroecology. One possible example, that I intend to explore in a future publication, is using the results presented here towards an entropic dynamics for ecology.

A On the Lambert W function
In this appendix we will explain how (13) inverts (11) and (12). The Lambert W function is defined as the solution of The python library SciPy [47] implements the numerical calculation of W . This relates to (11b) in the following manner: by defining the variable β = λ 1 + λ 2 we obtain It is relevant to say that, from (21), W (x) is multivaluedthe terminology Lambert W 'function' is used loosely. The several single-valued functions that solve (21) are known as the different 'branches' of the Lambert W. In (13) and (14) only the W −1 branch was taken into account. Given our object of study, we will restrict to functions that are guaranteed to give a β that is real for large N . As explained in [44], the two branches W 0 (x) and W −1 (x) are real and analytic for −e −1 < x < 0, of equivalently β is real for N > e. Coherent with the fact that (11) was derived for large N .   1 presents the graphs of β obtained from the W 0 (x) and W −1 (x) branches, as well as a comparison to the β obtained numerically from inverting (11a). Even though per (22) the β obtained by both branches inverts (11b), it can be seen from Fig. 1 that only the one obtained from W −1 (x) approximates the inverse of (11a) for large N and, therefore, it is the only one appropriate for the present investigation.
To complete the claim that λ 1 and λ 2 in (13) are calculated analytically, it is relevant to say that W −1 − 1 N can be calculated using the series expansion (see page 153 in [44]) a m z m , where z = 2(log N − 1) , and a m is defined recursively as a 0 = 1, a 1 = 1, and Note that real z implies N > e, which is coherent with the condition for W −1 to be real.