Gini-stable Lorenz curves and their relation to the generalised Pareto distribution

We study an iterative discrete informationproduction process (IPP) where we can extend ordered normalised vectors by new elements based on a simple aﬃne transformation, while preserving the predeﬁned level of inequality, 𝐺 , as measured by the Gini index. Then, we derive the family of Lorenz curves of the corresponding vectors and prove that it is stochastically ordered with respect to both the sample size and 𝐺 which plays the role of the uncertainty parameter. A case study of family income data in nine countries shows a very good ﬁt of our model. Moreover, we show that asymptotically, we obtain all, and only, Lorenz curves generated by a new, intuitive parametrisation of the ﬁnite-mean Generalised Pareto Distribution (GPD) that uniﬁes three other families, namely: the Pareto Type II, exponential, and scaled beta ones. The family is not only ordered with respect to the parameter 𝐺 , but also, thanks to our derivations, has a nice underlying interpretation. Our result may thus shed new light on the genesis of this family of distributions.


Introduction
The well-known information production process (IPP; see Egghe, 1990 andEgghe, 2005b, p. 8) is a mechanism in which "sources" (e.g., authors, universities) produce a series of "items" (e.g., articles, graduates) of different quality.In the "source-item" formulation, the variable of interest is typically non-negative and integer-valued for both the rank-size and rank-frequency modelling (Bertoli-Barsotti & Lando, 2019, Egghe & Waltman, 2011).In spite of this, in the informetric literature, continuous approximations are typically used because it facilitates the underlying calculations (Burrell, 2005, Egghe, 2005a).However, in this article, we consider a natural framework for empirical data stemming from a variety of applications in informetrics that entails the discrete setting.
Apart from presenting a new model of Lorenz curves  (,) of a vector of finite dimension  , we will be interested in studying its characteristics as the dimension diverges to infinity.In the limit as  → ∞, the model becomes governed by only one parameter: the inequality index .This limit defines a new parametric family of Lorenz curves  (∞,) , ordered by .Thus, if  1 <  2 , then  (∞, 1 ) () ≥  (∞, 2 ) () for all .Interestingly, the family admits a nice characterisation result.Namely, we prove that it contains all, and only, Lorenz curves generated by a Generalised Pareto Distribution (for which the mean exists).
In order to define the model  (,) , as an ancillary result, in Section 3, we introduce the notion of the Gini-stable process, i.e., a sequence of ordered and normalised vectors with a fixed (stable) value  of the Gini index.In particular, we mention that  is an "uncertainty parameter" of the Gini-stable process.In Section 4, we present the results concerning the limiting family of Lorenz curves.Then, we prove the aforementioned ordering properties of our curve set.To strengthen our theoretical derivations, we present a few case studies, including the cases of socioeconomic, bibliometric, informetric, and environmental data.We note that, more often than not, our model fits empirical data reasonably well.
As well known, the Lorenz order is defined in terms of nested Lorenz curves as follows.For any ,  ∈ ℘ () + ,  ≤   if and only if   () ≥   () for all 0 ≤  ≤ 1.
by the mean,  −1 (as our vectors sum to 1), we get that the average relative expected gain is Hence, the Gini index () can be seen as the sum of the average expected gains   ,  = 1, 2, … ,  .
Example 1.If  = (5∕15, 4∕15, 3∕15, 2∕15, 1∕15) as in Fig. 1, we find the average relative expected gains: A useful alternative formula for the Gini index is: where μ is the mean (expected) rank, that is, μ = ∑  =1 ( −  + 1)  (compare Arnold, 2015, p. 183).What is more, the Gini index is closely related to the Lorenz curve through the following formula with a clear geometrical meaning: where () = ∫ 1 0 ( −   ())  is the area between the Lorenz curve of the given vector and the graph of the identity line; compare Fig. 1.

𝑁
, in order to have This function defines a reduce-and-distribute mapping from ℘ () + to ℘ () + : it proportionally reduces each component by a factor of  and distributes the same amount  to every element.It is easy to see that this transformation is inequality-attenuating in that ℎ() ≤   for every  (for a general characterisation result on inequalityattenuating functions, refer to Marshall et al., 2011, p. 727 andArnold &Sarabia, 2018, p. 36-37).In particular, it decreases the expected gain   of each individual  by a constant factor   .Indeed, the Gini index decreases: 3. The Gini-stable process and distribution

Gini-stable transformations
Let us now introduce a slightly different type of mapping: the one in which we extend the dimension of the vector, from  to  + 1: where  = 1−

𝑁+1
, in order to have  + The mapping given by Eq. ( 2) is from ℘ () + to ℘ (+1) + , and it may no longer have an inequality-attenuating effect; for example, as  tends to 1, the Lorenz curves of  and ℎ() are nested.Instead, we can identify the constant  =   (which in general depends on the dimension of the vector) uniquely determined by the constraint: that balances the natural inequality-attenuating effect of ℎ, and the inequality-increasing effect due to the increase in the vector size.
In other words, our purpose is to determine the affine function ℎ which keeps the value  of the Gini index unchanged when a new component is added to the vector.
In an economic context, the task at hand can be stated as the problem of finding the conditions under which the sum of the average expected gains in a population of  individuals remains unchanged when the size of the population increases by one, from  to  + 1.This issue is addressed by an iterative approach, that we will be referring to as the Gini-stable process.
All the above vectors have the same Gini index,  = 1∕3.
It turns out (see Appendix A.1 for the derivation) that our iterative model given by Eq. ( 4) enjoys the analytic formula: where  = 1, 2, … ,  , and is the  -th harmonic number,  0 = 0. Note that the case  = 1∕2 must have been considered separately as otherwise it would lead to a division by 0.
We will call the vector  (,) a Gini-stable distribution with parameters  and , denoted GSD(, ).Nevertheless, let us note that our process is completely deterministic.Thence, it should not be interpreted in probabilistic terms (without making further assumptions).
Take note of the limit cases: as  → 0, we obtain the discrete uniform distribution with support 1, 2, … ,  (all components being equal), and as  → 1, we get the degenerate distribution with all probability mass concentrated at the point  = 1.

Lorenz curves of the vectors generated by the Gini-stable process
An interesting property of the Gini-stable distribution is that it admits an explicit representation of its cumulative sums.This allows us to obtain the Lorenz curve corresponding to the vector  (,) .Namely, starting from Eq. ( 5), for  = 1, 2, … ,  , after some elementary transformations, we get: Based on the above, the Lorenz curves of  (,) are defined for  = 1, 2, … ,  as follows: with the values in-between linearly interpolating the above points as described in Section 2.
Example 3. Let us consider the Luxembourg Income Study (LIS) dataset as per Bishop et al. (1991, Table 2).It gives empirical Lorenz curves, across the deciles ( = 10), for the family incomes in nine different countries,  (AU) ,  (CA) , etc.By computing the corresponding Gini indices ĜAU , ĜCA , … , we get the predicted models  (, ĜAU ) ,  (, ĜCA ) , etc., using the derived formula.In Fig. 2, we see that our model fits most of the empirical curves very well: in five cases, the maximal root mean squared error (RMSE) is not greater than 0.013, and no RMSE is greater than 0.027.

Lorenz ordering for vectors of equal lengths
Let  (,) be a vector generated by the Gini-stable process, i.e., with components given by Eq. ( 5), and let  (,) be its Lorenz curve, as defined by Eq. ( 7).The following proposition shows that, for any fixed  , an increase of the parameter  corresponds to an increase in the Lorenz ordering (see Fig. 3).
Hence, Proposition 1 is equivalent to stating that for  1 <  2 , it holds  (, 1 ) ≺  (, 2 ) , where ≺ is the majorisation ordering This implies that  is an "uncertainty parameter" of the Gini-stable distribution GSD(, ) in the sense of Hickey (1983), in that the Gini-stable distribution GSD(,  1 ) possesses a degree of randomness greater than that of the Gini-stable distribution GSD(,  2 ) (compare Marshall et al., 2011, p. 755).

Lorenz ordering for vectors of unequal lengths
Lorenz order allows to compare inequality in populations of different sizes.In our model, for fixed , an increase of  corresponds to an increase in the Lorenz ordering (see Fig. 4).

The limiting case as 𝑵 → ∞
4.1.Family of Lorenz curves,  (∞,)   As  grows indefinitely, the vector  (,) becomes an infinite sequence of numbers ( 1 ,  2 , … ).Then, the Lorenz curves take an asymptotic form that depends on only one parameter, .Namely, we can prove (see Appendix A.4) that they can be expressed for  ∈ [0, 1] by: See Fig. 4 again for an illustration.
Remark 2. We easily find that ∫ ) )  = , as expected in the continuous case.
As an extension of Propositions 1 and 2, we have the result below.
Proof.The result (i) follows from the observation that  (∞,) ()  < 0. The result (ii) follows from Proposition 2. □  We also include the DBLPv12 dataset that consists of citation counts to over 4 million papers in the field of computer and information sciences sourced from (Tang et al., 2008) 3 as well as the citation data from the RePEc4 database which features almost 2 million papers in economics (only papers cited at least once were included).Moreover, we consider the following information about users and questions on the Stack Overflow5 site: the number of times each user profile and question was viewed, and the number of up-and down-votes each user has cast (only items greater than 0 were included, as new users cannot vote until they reach a certain reputation level).
Fig. 6 presents the empirical and fitted (based on the sample Gini indices) Lorenz curves.Let us note that all datasets'  s are of a considerable order of magnitude.Therefore, in all 15 cases, we can rely on the asymptotic formula (8) with little loss in precision.
Our model is a very good fit (RMSE<0.02) in six cases, but is unsatisfactory (RMSE>0.045) in three instances.

Generalised Pareto distributions
Now the problem is to identify the probability distribution with CDF  whose Lorenz curve coincides with that given by  (∞,) , 0 <  < 1.Let us recall (see Gastwirth, 1971) that: where  is the expected value,  = ∫ The sought CDF  can be obtained by inverting the quantile function  −1 calculated by taking the derivative of  (∞,) () w.r.t..
L. M. Gagolewski,G. Siudem et al. Fig. 6.The empirical and fitted curves for a wide range of datasets from different domains, including bibliometrics, informetrics, and environmental sciences; see Examples 4 and 6.Due to each  's being large, we can replace  (,) with  (∞,) with little loss in precision. where , and  is the indicator function.Let us note that for  < 1∕2 it holds − > 0. In this case, the support is finite.
The above corresponds to the CDF of the Pickands Generalised Pareto Distribution (GPD; Pickands, 1975).As it is well-known, this distribution family unifies three different models (see Hosking & Wallis, 1987, Arnold, 2015, p. 11, and Johnson et al., 1994, p. 614).Namely, depending on the value of the parameter , here are the possible cases for a random variable  with the CDF given by Eq. ( 10): Here,  < 1 and so the support of the random variable becomes bounded.By setting  = − =  1− 1−2 > 0 as the free scale parameter, we find the formula for the CDF: with the corresponding density function: where  denotes the beta function.This can be viewed as a scaled Beta distribution of the first kind with shape parameters 1 and  1−2 (Marshall & Olkin, 2007, p. 494) or a uniform distribution with scale and frailty parameters (Marshall & Olkin, 2007, p. 672).
Note that under the Pareto Type II distribution, the Gini index cannot fall below the minimum level of 1∕2, which is consistent with our derivations (compare also Biró et al., 2023).
In conclusion, the family of Lorenz curves  (∞,) , 0 <  < 1, contains the Lorenz curves generated by all and only Pickands' Generalised Pareto Distributions for which the mean exists (as otherwise the Lorenz curve is not defined).
Note that these formulae generalise some of the cases studied in Balakrishnan et al. (2010).See also Siudem et al. (2022), where a different parametrisation of the Pareto Type II distribution was obtained in the limit.
Example 5. Continuing the countrywise family income study (Example 3), let us compare our new finite-sample model  (,)  against its asymptotic version  (∞,) being an alternative parametrisation of GPD.Furthermore, let us consider the Lotkaian model (e.g., Egghe, 2005a), which in our context is equivalent to the power law (e.g., Clauset et al., 2009) and the classic Pareto (Type I) distribution (e.g., Arnold, 2015).Its Lorenz curve is given by  ()  1 () = 1 − (1 − ) 1−1∕ , which we can also reparameterise based on the Gini index knowing that  = 1∕(2 − 1), leading to the family of curves we denote by  ()   1 .Table 1 gives the root mean squared errors between the empirical and fitted Lorenz curves in the case where the estimation is solely based on the Gini index.Furthermore, in parentheses, we list the minimal possible RMSEs (where the model parameter is optimised for numerically so as to minimise this error metric).
First, we note that in each case, the finite-sample model  (,) is better than the remaining ones.The obtained RMSEs are very close to the optimal one.This indicates that the sample Gini index is a well-behaving estimator of the theoretical  parameter.Second, we note that the  (∞,) -based (GPD) model is better than the Lotkaian one, sometimes considerably.Also in this case, using the sample Gini index as an estimator of the  parameter, gives quite satisfactory results.  ,) ≃  (∞,) .We shall thus only compare the asymptotic model  (∞,) with the Lotkaian one,  ()  1 (as given in Example 5).From Table 2 we see that where the  (∞,) model is a good fit, the RMSE in the case where  is estimated by means of the sample Gini index is close to the optimal one.Moreover, in the three cases where our model does not fit data well (terrorism, surnames, metabolic),  ()  1 with  taken from data is actually quite satisfactory.

Conclusion
In the context of a general information production process, which is a formal mechanism in which sources produce items (Egghe, 1990 andEgghe, 2005b, p. 8), the Gini-stable process can be viewed as a model where in each step, a new source is added under the special condition of keeping the concentration, as measured by the Gini index, unchanged.In this sense, the Gini-stable model derived herein can also be interpreted as a process that describes how production changes over time in the sense invoked by (Burrell, 1992).
It is interesting to note that a model mathematically equivalent to Eq. ( 5) for  > 1∕2 (minus the scaling of the vectors' elements) was already featured in (Siudem et al., 2020) and in (Gagolewski et al., 2022); see Appendix A.2 for more details.Their parametrisation is defined via the value  > 0 explaining the preferential attachment mechanism, i.e., an author's specific tendency to produce articles more or less attractive to citations.Also notice that the recent paper by Biró et al. (2023) suggests that  > 1∕2 is the most prevalent case for bibliometric data; compare Table 2.
In the literature, many parametric families of continuous Lorenz curves are known.In the present article, we derived an explicit expression for the Lorenz curve of a probability vector generated by the Gini-stable process defined by Eq. ( 5), in its genuine discrete formulation.Our formula can also be seen as a model for an empirical Lorenz curve, that is, for Lorenz curves of empirical distributions of finite samples.In this sense, the family presented here is a unique model in its genre.Within the context of a general information production process, such Lorenz curves represent the cumulative proportion of total production of all sources as ordinate against the cumulative proportion of sources as abscissa (in particular, one obtains the Lorenz curve of the productivity if the sources are arranged in increasing order of productivity).It is important to emphasise the fact that we refer here to the case of the size-frequency domain (Egghe & Waltman, 2011).
We proved that the Lorenz curve gives a complete ordering relationship among the probability vectors of the Gini-stable process, with respect to its parameters  and  -where the former represents the number of sources and the latter the value of the Gini index within its natural range (0, 1).In particular, probability vectors of unequal lengths turn out to be comparable with respect to the Lorenz ordering for suitable values of parameter .
We also derived and studied the asymptotic form of the Lorenz curve when  goes to infinity, obtaining a new parametric family of Lorenz curves that is ordered with respect to the parameter .Up to a change in scale, each Lorenz curve in this family characterises an absolutely continuous distribution on (0, ∞), with finite mean, which we have shown to belong to the Pickands Generalised Pareto Distribution family with a new parametrisation that relies directly on .This result sheds a new light on the genesis of this family of distributions.Importantly, the total ordering of our Lorenz curve family is a crucial property thereof because, in general, the Lorenz curves can cross (the Lorenz order is only a partial order).
Lastly, we showed our model fits well to a few informetric datasets.The advantage of our parametrisation is that it only relies on the sample Gini index, which is very easy to compute.No sophisticated curve fitting is necessary.
In a follow-up work (Bertoli-Barsotti et al., 2023), we derive the formulae for the Bonferroni, Hoover, and De Vergottini indices in our new model so that we can recreate their values based on the value of the Gini index.Future research will involve the study of the processes preserving other inequality, evenness, fairness, or spread measures; compare, e.g., (Beliakov et al., 2016, Bu, 2022, Chan, 2022, Gagolewski, 2015, Prathap, 2023, Prathap & Rousseau, 2023).It will be interesting to compare our finite-sample model with other parametric families of Lorenz curves, e.g., ones reviewed in (Chotikapanich, 2008).

Declaration of competing interest
The authors certify that they have no affiliations with or involvement in any organisation or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.

A.2. Proof of Proposition 1
It is sufficient to prove that for any  and  1 <  2 , it holds Σ (, 1 )  < Σ (, 2 )  for all  = 1, … ,  − 1.In our setting, this holds if and only if all the partial derivatives Σ (,)    ∕ are nonnegative.We have: We have  ,  , () , where () = Γ ′ ()∕Γ() is the polygamma function.Since the gamma function is non-negative for positive arguments, and the first derivative of the polygamma function  ′ () is decreasing in , the sign of  , ()∕ depends on the term 2 − 1. for  = 1, … ,  − 1.Since the Lorenz curve is defined as a linear interpolation between these points, the ordering of interest is preserved everywhere, and the proof is complete.
In our case, we find: The Leimkuhler order is  ≤   if and only if   () ≤   () for all 0 ≤  ≤ 1.As the Leimkuhler order is equivalent to the Lorenz order, the thesis of Proposition 2: By construction, the Leimkuhler curve  (,) is defined by straight-line segments with slopes given by the ratio between the component of the vector  (,) and the mean, that is, for  = 0, 1, … ,  : for  = 1 2 .To better illustrate the behaviour of the Leimkuhler curve as the number of the line segments grows from  to ( + 1), we will prove the following result about their slopes.
(a) We will show that the following inequalities hold: For  = 1∕2, the inequality ( 12) is equivalent to (  −  −1 ) < ( +1 −  −1 ) which is clearly true.On the other hand, for  ≠ 1∕2, the inequality ( 12) is equivalent to: ) , where  = 1∕ > 1.There are two cases. If ), the above inequality is equivalent to: which is in equivalent to: that is,  + 1 >  +  − 1, that holds true under the above condition on the parameter .
), the inequality to be proved holds if and only if: ) , that easily leads to  +  − 1 >  + 1, that holds true under the condition  > 2.
More precisely, the thesis (b) is that: ) , where the last terms in both equations are equal to  (+1,)
We prove the thesis of the Proposition 2 by contradiction.Assume that the Leimkuhler curves  and  (+1,) intersect.Then, there would be at least one vertex of  (,) not belonging to  (+1,) , but this contradicts (b).

A.5. Derivation of the CDF 𝐹
To obtain the CDF  given by Eq. ( 10), we start with the derivation of the formula for  (∞,) ()∕.Taking the derivative of the function given by Eq. ( 8), we get: From Eq. ( 9), we see that  () is given simply by  −1 () =   (∞,)

Furthermore
denote its -th cumulative sum, with, for the brevity of the anticipated formulae, Σ  0 = 0. Assume we are given a set of numbers {  1 ,  2 , … ,   } , which are featured as the components of some  ∈ ℘ () + .Then, the corresponding Lorenz curve is the function   (), 0 ≤  ≤ 1, obtained by linear interpolation of the points (

Fig. 5
Fig. 5 depicts the limiting Lorenz curves for different values of the Gini index, .Quite remarkably, these curves cover the entire family of distributions known as Pickands' Generalised Pareto Distribution, as we shall show next.

Table 1
Comparison of three models of Lorenz curves parameterised by the Gini index (where it is taken from each sample) in terms of the root mean squared error for the countrywise family income data (Examples 3 and 5).In round brackets, minimum achievable RMSE for each model is given.The new finite-sample model outperforms the other ones.In this case, using the sample Gini index to estimate  works almost equally well as

Table 2
Comparison of the asymptotic (GPD) and Lotkaian (power-law) models on a wide variety of datasets from different domains (Examples 4 and 6).
Example 6.Let us go back to Example 4, where we have studied 15 datasets from various domains, including informetrics and environmental sciences.In each case, the sample size  is large, hence