A methodology to determine the maximum value of weighted Gini–Simpson index

Weighted Gini–Simpson index is an analytical tool that promises to be widely used concerning biological and economics applications, relative to the assessment of diversity measured by compositional proportions of a system defined with a finite number of elementary states characterized by positive weights. In this paper, a current literature review on the theme is presented and the mathematical properties of the index are outlined, focusing on the location of the maximizer (maximum point) and evaluation of the maximum value, with emphasis in the role of the Lagrange multiplier critical value—closely related with the harmonic mean of the weights—which is shown to be a barrier concerning the feasibility of the solution. Sequential procedures are presented, either backward or forward, which are used to obtain the correct values of the maximum point coordinates, thus allowing for the computation of the right maximum value of the index. Also, new theoretical results are provided, such as the calculus of limits and partial derivatives related to the critical solution, used to assess of the effectiveness of the algorithms herein proposed and discussed.


Simpson and Gini-Simpson indices
Considering a simplex of dimension m − 1 defined as � m−1 = p i ≥ 0, i = 1, . . . m; m i=1 p i = 1 , where the numbers p i denote relative extension measures, usually probabilities or proportions, 1 Simpson's index, originally mentioned as a measure of the concentration of a classification (Simpson 1949) is evaluated with the formula C = m i=1 p 2 i and its symmetric form D = 1 − C is usually named 2 Gini-Simpson index (e.g., Rao 1982), and used as a measure of biological or phylogenetic diversity until today (e.g., Tryjanowski et al. 2015;Zaller et al. 2015;Brocchieri 2015), since we can rewrite the correspondent formula as D = m i=1 p i (1 − p i ) and interpret it associated to the probability that any two random individuals in a population are assigned to different populations or genetic clusters (e.g., Chybicki et al. 2014). In biological studies, more than seven decades ago, the term p i (1 − p i ) was already mentioned as the contribution to the sampling variance due to any one species being sometimes observed and sometimes not (Fisher et al. 1943), later stated as the probability of interspecific encounters (Hurlbert 1971), or the probability of drawing two individuals of different type from a given collection (Gregorius and Gillet 2008). Good (1953), in a paper inspired by Alan Turing, defined parametric measures of heterogeneity of populations of s species, evaluated as c m,n = s i=1 p m i − log p i n with m, n = 0,1,2,…. Using this formalism it follows that the case c 2,0 is Simpson's concentration index C, while c 1,1 is Shannon (1948) statistical entropy. The attribution to Corrado Gini of the earliest formulation of index D, more than a century ago, was related with the themes of variability devoted to the measurement of quantitative phenomena, and mutability, this one concerned with the measurement of qualitative phenomena. It is mentioned that Gini presented about 13 versions of the index (Ceriani and Verme 2012) and measuring variability is considered to be at the core of his procedure (de Finetti 1931). Sen (2005) says that Gini index opened the avenue for research in diversity analysis of qualitative categorical data models.

Weighted Simpson and Gini-Simpson indices
It is not easy finding references relative to the use of weighted Simpson's concentration index C w = m i=1 w i p 2 i which seems to have been first explicitly stated and used as an inverse measure for antigenic diversity of a virus population (Nowak 1994); it was also used recently as a price-weighted biodiversity index of catch in freshwater fisheries in Malawi (Kasulo and Perrings 2006). Sharma et al. (1978) discussed a non-additive information measure they named "generalized useful information of degree α" relative to a utility information scheme with m positive real numbers w i also defined in the simplex Δ m−1 , denoted parametrically 3 as , from which we can retrieve weighted Gini-Simpson index evaluating the semi-value of I α (W) with α = 2, obtaining m i=1 w i p i (1 − p i ). The weights {w i } i=1,…,m allow for taking into account different features related to ecological or economic values of species or other components of a system characterized by the proportions {p i } i=1,…,m including the sampling effort, the phylogenetic distances or conservation values, to name a few possible applications.
Weighted Gini-Simpson (WGS) index seems to have been formerly conceived and studied as an analytical tool addressing the diagnosis of landscape mosaics composition (Casquilho 1999) where the maximum point of the index and its maximum value were discussed using Lagrange multipliers method. It was also stated as an approach used to assess inequality measures under the scope of utility theory (Sen 1999). Guiasu and Guiasu (2003) outlined conditional and weighted measures of ecological diversity presenting the formulas for the maximum value of WGS index and the optimal proportions, results which were further retrieved and generalized for triads of species (Guiasu and Guiasu 2010). Also, Casquilho (2009) discussed an issue relative to habitats valuation with complex numbers, conceiving weighted Gini-Simpson index as a sum of variances of interdependent Bernoulli variables indexed by positive characteristic values, either ecological or economic.
Several theoretical developments were presented in the following, from which stand out, concerning ecological or related biological fields: the application of weighted Gini-Simpson to assess ecomosaics compositional scenarios (Casquilho 2011); the application to biodiversity partitioning and measuring of diversity with respect to the pairs of species (Guiasu and Guiasu 2012); Ricotta et al. (2012), discussing Rao's quadratic index under the scope of functional rarefaction, claim that their method is suitable to be extended to any concave diversity measure including WGS index; also, weighted Gini-Simpson index was said to be closely related to a unified framework based on Hill numbers concerning specific, phylogenetic, functional and other diversity measures (Chiu and Chao 2012;Chao et al. 2014); Guiasu and Guiasu (2014) proceeded with developments concerning the use of the index as a biodiversity assessment tool for interdependent species; Pavoine and Izsák (2014) formulated a new parametric index of diversity related to Rao's quadratic entropy and discuss connections relative to other indices including WGS index; last, WGS index was combined with expected utility generating a non-expected utility device (Casquilho 2015). Other empirical studies or applications using WGS index will be mentioned in the discussion of results.

Stating the problem
The problem addressed and discussed in this paper is that, in general, the maximum point coordinates of WGS index must be computed with a sequential procedure, because the formulas available in the most relevant literature concerning the issue (e.g., Guiasu 2003, 2010) are valid only within limited ranges of values of the set of predefined nonnegative weights {w i } i=1,…,m . If this remark is not scrutinized, the blind use of those formulas may inflate the proportions of the heaviest weighted components and lead to an erroneous "maximum value" evaluation, which can have pernicious consequences in subsequent normalization procedures or other inferences on the subject. Though the problem was previously mentioned (Casquilho 1999(Casquilho , 2009(Casquilho , 2011, it was not fully systematized and analytically focused, and one of the procedures proposed in this paper is new. In fact, the problem at stake has an old root, as Jaynes (1957) had already pointed out that the negative term − p 2 i has the difficulty arising from the fact that conditional maxima cannot be found by a stationary property involving Lagrange multipliers, because the results do not, in general, satisfy the axiomatic condition p i ≥ 0. We will see that it is such a kind of problem which is at the core of the subject that will be discussed in the following. Next, the main mathematical properties of weighted Gini-Simpson index will be reviewed, focusing on the critical solution and the meaning of the Lagrange multiplier value as a parameter controlling the feasibility of the solution. Then, sequential backward and forward procedures or algorithms are outlined, associated with simple numerical examples illustrating the performance of the method, and last, results will be discussed.

Review of the theoretical framework
Renaming v 2 i as v 2 i = w i , the weighted Gini-Simpson index, measuring the variability of a system with such a characterization, is defined with the formula (1): Index D w is a continuous real function with domain in a compact set, the m − 1 simplex, what entails Bolzano-Weierstrass theorem to ensure that the index attains maximum and minimum values, as well as all the intermediate in its range. The inequality D w ≥ 0 is easily seen to be true as the index is conceived as a sum of nonnegative terms, thus one can conclude straightforwardly that the minimum value D w = 0 is reached at every vertex of the simplex (p j = 1 and p i = 0 if i ≠ j). Moreover, it is shown that index D w is a concave function (e.g. Casquilho 1999;Guiasu and Guiasu 2010).

Lagrange multiplier method and feasibility of solution
Also, index D w is a real differentiable function, hence one can use the auxiliary Lagrangian function denoted as as an analytical tool for finding candidates to constrained extreme points of D w (e.g. Bertsekas 1996) located in the hyperplane defined by the equation m i=1 p i = 1. The calculus of generic partial derivative(s) in the variable(s) p i entails the following result: Searching the critical or stationary point of function L implies the system of equations: As the weights are positive real numbers by hypothesis (w i > 0) one can state the immediate conclusion that the optimal proportions should verify the conditions p i ≥ 0 ⇔ w i ≥ α, from what follows that the value of the Lagrange multiplier is a barrier, or limit value, concerning the feasibility of the solution evaluated with this method.
Computing the associated closure condition m i=1 p i = 1 using (2), one obtains the critical value of the Lagrange multiplier: From what follows that the critical point evaluated combining (2) and (3) is defined by the equations: Formula (5) presented next, relative to the presumed maximum value of the index 4 is the result of the evaluation of (1) replacing {p i } i=1,…,m with the critical proportions defined in (4).

Results and discussion
From Eq. (3) one can conclude immediately that for m ≥ 3 the inequality α* > 0 is verified and for m = 2 reduces to α* = 0 which implies the trivial result when the simplex is 1-dimensional: p * 1 = p * 2 = 0.5; also, the critical coordinates p * i evaluated with (2) verify intrinsically the condition p * i ≤ 1, as the following equivalences show p * i ≤ 1 ⇔ w i − α * ≤ 2w i ⇔ −α * ≤ w i which is true because α* ≥ 0, the equality sign just holding for the 1-simplex; last, whether w i = α* one gets the value p * i = 0. Next, it will be proved that the critical solution defined by Eq. (4) may not be feasible and, subsequently, cannot be used straightforwardly to evaluate the maximum value of the index as defined by formula (5).

Analytical study of the critical point
Both formulas (4) and (5) are the right results whenever we have w i ≥ α* for i = 1,…,m meaning that m inequalities must be verified simultaneously. Whether there is at least one value that verifies w i < α* the evaluation of the optimal solution needs a revision in a sequential procedure, though that can only happen if m ≥ 4. In fact, successive replacements and simplifications allow for obtaining the equivalent results: Thus, for m = 3 one can see that the condition w i > 0 is trivially verified and the critical proportions are properly defined as the maximizer coordinates: . . , m 4 Here we use the notation of Guiasu and Guiasu (2003); an equivalent formula with a different notation may be found in Casquilho (1999:121,122).
A direct inspection of formula (4)-for which may be helpful to rewrite the denominator as w i m j=1 1 w j = 1 + j� =i w i w j -shows that the critical proportion p * i increases with the value of the corresponding w i when all the other weights remain fixed, and, on the contrary, decreases with the increasing value(s) of other w j (j ≠ i).
The calculus of limits on formulas (4) for m ≥ 3 also clarifies the issue: = 1 2 , and from this result one can conclude that there is a supremum (least upper bound) for the optimal point coordinates: p * i < 0.5 in the context (and p * i = 0.5 if m = 2); also, lim w i →0 + p * i = 3/2 − m/2 which is the same result that is obtained when w i is fixed and all the other weights w k (k ≠ i) tend simultaneously to positive infinite. For example, if m = 5 we get the result lim {w k } k� =i →+∞ p * i = −1. This negative value shows that when applying formulas (4) we can obtain non-feasible solutions, becoming negative without bound as the dimension of the simplex increases.
Also, the calculus of partial derivatives in Eq. (4) shows that the value p * i increases with the corresponding weight w i and decreases when any other weight w k increases. In fact, one can see that, for m ≥ 3, the following inequalities hold: So, what happens if there is any w i < α*? Then, the corresponding critical value p * i , although lying in the hyperplane defined by the equation m i=1 p * i = 1 is not located in the simplex, and the solution is not feasible. In the general case with m ≥ 4 there will be m′ ≥ 3 non-null optimal proportions as stated by Eq. (7) and m − m′ null coordinates. In some well balanced sets of weights it may happen that m′ = m but it is a particular case, not the general one.
Now, formulas (4) and (5) may be used replacing m by m′ and calculating the optimal proportions and the maximum value of the index D w with the corresponding set of weights-all the remaining optimal proportions being null and the respective weights discarded from the evaluation.

Sequential backward procedure
Whether the forward procedure previously discussed helps checking the consistency of the problem stated by the feasibility condition expressed in inequalities (6), one can see that a sequential backward procedure is more effective, applying directly Eq. (4) and nothing else. Adopting the same ordering w (1) ≥ w (2) ≥ · · · ≥ w (m−1) ≥ w (m) begin computing p * (m) and if p * (m) > 0 then Eq. (4) are proper, all the coordinates can be calculated directly and also Eq. (5) applies straightforwardly with no problem; otherwise, if p * (m) < 0 then set p * (m) = 0 withdraw w (m) from further calculations and compute p * with Eq. (4) modified with m′ = m − 1 and the corresponding set of weights; proceed with the same reasoning, recurring, until one finds an order (k) such that p * (k) > 0, then stop; set all null coordinates p * (k+1) = · · · = p * (m) = 0, and Eq. (4) apply with dimension reset as m = k evaluated with the corresponding weights {w (i) } i=1,…,k .

Discussion
The index D w is a continuous real function defined in a compact domain and its range is 0 ≤ D w ≤ D * w , the minimum value D w = 0 occurring in every vertex of the simplex Δ m−1 . The maximum value of the index denoted D * w has to be evaluated verifying the feasibility condition expressed by inequalities (6) before applying straightforwardly Eq. (5). In general, except for very specific and balanced sets of weights, the maximum point of D w will not occur in the interior of the simplex but in a k-face with 3 ≤ k < m, as was shown by the theoretical results followed by sequential procedures and illustrated with numeric examples, leading to some null optimal coordinates. 5 Obviously, the maximum value of WGS index could also be computed as D * w = k i=1 w i p * i 1 − p * i thus avoiding Eq. (5), but that still implies checking the feasibility condition as the summing procedure just applies for positive proportions.
The optimal proportions p * i are insensitive to a change in unities: the positive linear transformation u i = cw i , c > 0 implies that the critical solution remains the same and the new value of the Lagrange multiplier is also linearly transformed to be α** = cα* entailing that the feasibility condition (6) remains unchanged. The optimal value of the Lagrange multiplier α* defined in (3) is closed related to the harmonic mean of the weights. How can we justify that α* has numerator m − 2 instead of m? It seems that the most appealing interpretation is that when discussing result (3) we deduced that for m = 2 the value α* vanishes and the maximum point is fixed: p * 1 , p * 2 = (0.5, 0.5), independent of the weights. So, m − 2 is the number of relevant weights that affect the subsequent calculation of the maximum point coordinates and maximum value of the index.
The problem addressed in this paper is particularly important when the evaluation of D * w aims to be used in further normalization assessments with range 0 ≤ D w /D * w ≤ 1 and an erroneous computation of the maximum value can induce wrong conclusions when comparing different compositional systems. There are several empirical studies that use the maximum value of WGS index as a reference for further normalization assessments: besides Guiasu and Guiasu (2010) numeric examples such as the one relative to 10 species in two habitats with data retrieved from Jost et al. (2010), also Subburayalu and Sydnor (2012) used formula (5) when assessing street tree diversity in four Ohio communities and, probably, the feasibility condition here discussed was not checked. Weighted Gini-Simpson goes on being mentioned (e.g., Niane et al. 2014) and the problem handled in the present article seems to become relevant for the next future.

Conclusions
In this paper it was summarized an issue that seems to be relevant in the field of diversity measures: the proper evaluation of the maximum point and maximum value of weighted Gini-Simpson index. The main literature on the subject does not refer to the feasibility condition here discussed, what can involve consequent wrong results in applications. New theoretical results concerning the analytical study of the critical solution are provided, such as the calculus of limits and partial derivatives, as well as are sketched forward and backward procedures conceived to solve the issue at stake, also illustrated with numeric examples.