On the MLE of the Waring distribution

The two-parameter Waring is an important heavy-tailed discrete distribution, which extends the famous Yule-Simon distribution and provides more flexibility when modelling the data. The commonly used EFF (Expectation-First Frequency) for parameter estimation can only be applied when the first moment exists, and it only uses the information of the expectation and the first frequency, which is not as efficient as the maximum likelihood estimator (MLE). However, the MLE may not exist for some sample data. We apply the profile method to the log-likelihood function and derive the necessary and sufficient conditions for the existence of the MLE of the Waring parameters. We use extensive simulation studies to compare the MLE and EFF methods, and the goodness-of-fit comparison with the Yule-Simon distribution. We also apply the Waring distribution to fit an insurance data.


Introduction
The power-law distributions are a class of heavy-tailed univariate distributions that describe a quantity whose probability decreases as a power of its magnitude, which is widely used in social science, network science and so on. Two commonly used discrete examples are Zipf distribution and Yule-Simon distribution (or Yule distribution). Zipf law is found by the linguist Zipf when studying the words in a linguistic corpus, in which the frequency of a certain word is proportional to r −d , where r is the corresponding rank and d is some positive value. The Yule-Simon distribution is a highly skewed discrete probability distribution with very long upper tails, named after Udny Yule and Herbert Simon-winner of the 1978 Nobel Prize in economics, with distribution function P(X = k) = α (k) (α + 1) (α + k + 1) , α > 0, k = 1, 2, 3, . . . , where (·) is the Gamma function, and α is the parameter. Yule (1925) proposed the distribution first, applying it to model the number of species in the biological genera. Simon (1955) rediscovered the 'Yule' distribution later, using it to examine city populations, income distributions, and word frequency in publications (Mills, 2017). In Price (1965Price ( , 1976), Price, a famous American scientist, found that the number of citations of the literature follows the Yule distribution, when linking the published literature with his cited literature to form a directed network of scientific and technological literature. It is a cumulative advantage distribution based on the mechanism of 'success breeds success'. The two-parameter Waring distribution is a generalization of the Yule-Simon distribution, which provides more flexibility than the commonly used one-parameter Zipf distribution, Yule-Simon distribution, negative binomial distribution, etc. The Waring distribution can describe a wide variety of phenomena in actuarial science, network science, library and information science, such as number of shares purchased by each customer, number of traffic accidents, number of nodes in the internet connections, and frequency of authors who publish a certain number of paper (Huete-Morales & Marmolejo-Martín, 2020; Panaretos & Xekalaki, 1986;Seal, 1952;Xekalaki, 1983). The distribution function of X ∼ W(α, β) is given by , α > 0, β > 0, k = 1, 2, 3, . . . , where α, β are the parameters of the Waring distribution. It is easy to prove that the Waring distribution is a heavytailed distribution, with a polynomial tail of order α + 1. We can also derive that E(X) = 1 + β α−1 if α > 1, and var(X) = αβ(α+β−1) (α−1) 2 (α−2) if α > 2. The Yule-Simon distribution is a special case of the Waring distribution with β = 1.
Suppose that x 1 , . . . , x n is a random sample from the Waring distribution W(α, β), and let m = max{x 1 , . . . , x n } be the largest observe value, n k be the number of observations equal to k, k = 1, . . . , m, and m k=1 n k = n. Based on the data {x 1 , . . . , x n }, we can easily derive the likelihood function as Then the log-likelihood is Taking partial derivatives with respect to α and β leads to the following maximum likelihood equations where p k = n k /n with p m > 0. We first consider Equation (3), which can be treated as the conditional maximum likelihood equation of α given a positive β. When m = 1, that is, all the observed values equal to 1, since 1 n · ∂ n (α,β) ∂α = 1 α − 1 α+β = β α+β > 0, thus there is no solution to the likelihood equation. We focus on the situation where m ≥ 2.
In the following, we first consider the conditional maximum likelihood Equation (3) given any positive β, which can be regarded as a generalization of the Yule-Simon distribution, and we prove that those results for Yule-Simon distribution (β = 1) also hold for any β > 0. More specifically, given a positive β, we denote the conditional MLE of α as α(β). According to (3), α(β) satisfies For notational ease, we define and present the properties of α(β) in the following Proposition 2.1.
Proposition 2.1: Let α(β) be defined as in (5). We have the following properties.
Next we discuss the existence of MLE of (α, β). By (3), we have By (4), we have If the curves y = h(β) and y = α(β) intersect at some β > 0, we have solution to the equation system (3)-(4). Later we prove that the intersection is unique and is the MLE of the Waring distribution.
To discuss whether y = h(β) and y = α(β) intersect at some β > 0, we first present the properties of h(β) in the following proposition. (8), we have the following properties.

Property 3 * . h(β) is an increasing and concave function of β.
Property 4 * . Based on Properties 1 and 4 of Proposition 2.1 and 1 * and 4 * of Proposition 2.2, it is easy to derive that α(β) > h(β) when β is small. Therefore, if we can prove that α(β) < h(β) for some large β, due to the continuity of the two functions, there must exist solution to the equation systems (3)-(4). This is the key idea to check the existence of the MLE.
Before presenting the main result, we first give two important lemmas.
Lemma 2.3 provides an alternative to the proof of MLE based on the profile method, which is simpler than the conventional proof that includes complicated calculation of the Hessian matrix. Lemma 2.4: Assume that t 1 (x) and t 2 (x) are increasing and concave functions for x > 0, the curves y = t 1 (x) and y = t 2 (x) only intersect finite times, and the number of solutions to t i (x) = Z(x) is finite for both i = 1, 2, where Z(x) is any polynomial or fractional function of x. Further assume that Then, we have t 1 (x) ≥ t 2 (x) for all x ∈ (a, ∞).
Lemma 2.4 provides a general method to compare two increasing and concave functions, without requiring the explicit form of the functions, which not only simplifies the comparison of α(β) and h(β), but also has its own value in other applications.
Based on Propositions 2.1-2.2, Lemmas 2.3-2.4, we summarize the existence of MLE in the following Theorem 2.5.
Theorem 2.5: Suppose that {x 1 , . . . , x n } is a random sample from the Waring distribution W(α, β), and m = max{x 1 , . . . , x n }. Let p k = n k /n be the proportion of {x i = k} with p m > 0. Let To derive the necessary and sufficient conditions of MLE existence, we start form the conditional MLE of α for a given β, because it is easier to discuss the possible solutions by intersection of two curves determined by the estimating equations. Numerically, since we only have two parameters to estimate, thus it is quite efficient to solve that by the 'optim' function in R.
Remark 2.1: Unlike the existing literature which directly applied the optimization algorithm to the log-likelihood function, without verifying the existence of the MLE (Huete-Morales & Marmolejo-Martín, 2020; Rivas & Campos, 2021), we present the necessary and sufficient conditions for the existence of the MLE of the Waring parameters, which is the first attempt. It is easy to see that the sign of α intercept − h intercept is equal to the sign of and thus the MLE of (α, β) does not exist. For m ≥ 3, it depends, and we can check the sign of for a general m. For m = 2, 3, we also carefully check the existence of real-valued solution to the equation system (3)-(4), and find that the sign of α intercept − h intercept indeed determines the existence of MLE. The readers can refer to the authors for checking details.
One more comment on Theorem 2.5 is as follows. If α intercept < h intercept , or α intercept = h intercept with d α < d h , the MLE of the Waring parameters is a finite vector. Then the Waring distribution fits the data better than the Yule-Simon distribution, if the estimated β departs from 1, and similarly if the estimated β is close to 1. If α intercept > h intercept , or α intercept = h intercept with d α > d h , the likelihood function will be maximized at the boundary region, i.e., infinity. Therefore, if we directly apply the optimization algorithm to the likelihood function, the MLE may be far from the true parameters; for example, in the real data application, we get that MLE α = 1, 687, 133.2, β = 675, 078.4 for the group with central age 67.5 (age from 65 to 70), where in fact that the MLE does not exist. In such cases, we can use the EFF method if the EFF estimates are in reasonable scales, and the Waring distribution will still fit the data better than the Yule-Simon distribution.

Comparison of MLE and EFF
In this section, we give some numerical studies to compare the MLE and the EFF method in the Waring parameter estimation.
The Waring distributed observations are generated by the function rWARING in the R package gamlss.dist. We need mention that in the function rWARING, the parameters is {μ, σ }, and the probability mass function is given by Comparing the above probability mass function to (1), we can find that we need to add 1 to the generated values from rWARING, and the relationship between the parameters is α = 1 + 1/σ and β = μ/σ . Thus rWAR-ING automatically restricts α > 1 and the EFF estimator exists. We consider 20 combinations of (α, β), where α = 2, 1.5, 1.2, 1.1, 1.05 and β = 0.5, 1, 1.5, 2, with sample sizes n = 100, 200, 400 and 1000. We generate 500 replicates for each case.
Probably due to the parameter specification and restricted data-generating process of the function rWARING, we find that α intercept < h intercept is satisfied in all cases, except two replicates in the case α = 2, β = 0.5 with small sample size n = 100. By Remark 2.1, α intercept < h intercept is equivalent to  It is easy to see that where E n means the empirical distribution. When 1 < α ≤ 2, E(X) exists while E(X 2 ) diverges. Thus (9) is very likely to hold, and the MLE exists. However, in real applications, it is possible that α intercept > h intercept (Section 4). As mentioned immediately after Theorem 2.5, we use the 'optim' function to solve the MLE after verifying its existence. We tried four methods to initialize the parameters: (i) small values, (α (0) , β (0) ) = (1.1, 0.1); (ii) large values, (α (0) , β (0) ) = (2.5, 3); (iii) true values of the parameters plus a random perturbation N(0, 0.2 2 ), but restrict that α (0) ≥ 1.1 and β (0) ≥ 0.1; (iv) the EFF method. Extensive numerical studies show that these four initializing methods yield almost the same results, which indicates that the optimization is not sensitive to the initial values. Therefore, we use the EFF estimator for initialization if EFF produces positive estimates, otherwise, we set the initial values as (α (0) , β (0) ) = (1.1, 0.1).
Among all the cases, the EFF method results in negative estimates only in one replicate in the case α = 2, β = 0.5 with small sample size n = 100; in another replicate, the denominator P(X = 1) · X − 1 is exactly 0, so the estimator does not exist; these two replicates are deleted for fair comparison. Since the parameters are in different scales, especially the parameter β, the maximal value is four times of the minimal one. Thus for fair comparison, we report the rBias (relative bias, defined as the bias divided by the true value of the parameter) and rStd (relative standard errors, defined as the standard error divided by the true value of the parameter) in Tables 1 and 2. We find that, when the sample size is as small as n = 100, both MLE and EFF yield relatively poor estimates, with standard errors being larger than or close to 50% of the true value of the parameter, which indicates that it is challenging to accurately estimate the parameters with small sample sizes. Therefore, we focus on the comparison of MLE and EFF for n ≥ 200. First, MLE always results in much smaller biases than EFF. Though the rBias of EFF decreases when the sample size increases, it increases when the true α decreases, and it is still around 10% even when n = 1000 for α ≤ 1.2; the rBias of MLE decreases from 6%-7% to around 1% when n increases from 200 to 1000, regardless of the true α. Second, MLE results in comparable rStd with EFF for medium-sized sample (n = 200 and 400), but smaller rStd for n = 1000. Overall, the MLE method results better performance than the EFF method when α/β is not large or the sample size is large enough. The performance of EFF is relatively better when α/β is large, e.g., α/β ≥ 2. Our explanation is that, since P(X = 1) = α α+β = α/β α/β+1 , if α/β is large, then P(X = 1) is close to 1. Thus EFF includes relatively more information than the case with small α/β.

Goodness-of-fit comparison with Yule-Simon distribution
In this section, we compare the Waring distribution and the Yule-Simon distribution, in terms of goodness-of-fit to the data. We fix α = 1.5, and generate data from the Waring distribution with β = 1, 1.5, 2; data is generated from the function rWARING as in Section 3.1. When β = 1, it is exactly the Yule-Simon distribution, and when β departs from 1, the Yule-Simon assumption is violated. We consider 500 replicates with sample sizes n = 100, 200, 400, 1000. To initialize the optimization for the MLE of the Yule-Simon parameter α, we use the first frequency P(X = 1) = α α+1 , that is, α = P(X=1) 1− P(X=1) . Figure 1 presents the box-plots of the likelihood ratio statistics where n ( α, β) is the log-likelihood function of the Waring fitting evaluated at the MLE ( α, β), and * n ( α * ) is the log-likelihood function of the Yule-Simon fitting evaluated at the MLE α * . If the true β equals 1, the Yule-Simon distribution is correct, so it is easy to prove that T n ∼ χ 2 1 ; if the true β departs from 1, the Yule-Simon distribution is not correct, so T n will be large. The box-plots in Figure 1 confirm that the Waring distribution fits the data similar to the Yule-Simon distribution when β = 1, and much better when β departs from 1. We further report the proportion of replicates that the Yule-Simon distribution is rejected at nominal level 0.05, in Table 3. Seal (1947Seal ( , 1952 provided data on insurance shares for 12 different age periods. The original data is about male lives assured in a British life office, maintained for administrative purposes. The analysed data is a random subset, and every tenth names in this list were included until the total of 2000 was reached. The lives sampled are scheduled according to the year of birth and the number of policies in force. The group is represented by the central age. Seal (1952) fitted the data using the discrete Pareto, with probability mass function

Real data application
where ζ(d) is a normalization constant, and the parameter d is estimated by the MLE. Here we apply the Waring distribution to fit the data. For the age periods centred at 17.5 and 22.5, the maximal number of shares is 2. The EFF method leads to negative parameter estimates, while the MLE is proved not to exist as in Remark 2.1. We focus on the rest 10 groups, with central ages from 27.5 to 72.5. Among these 10 groups, for the group with central age 67.5, we have n = 45 and n 1 = 33, n 2 = 7, n 3 = 4, n 4 = 1, and it is easy to verify that (9) does not hold. Thus the MLE does not exist. If we directly apply the optimization algorithm, we get α = 1, 687, 133.2, β = 675, 078.4, which is meaningless. However, if we use the EFF method, we get α = 11, β = 4, and the resulted fitting is reasonably good. Thus, we need to be careful in using the MLE. Table 4 summarizes the comparison of the actual distribution with discrete Pareto law fitting, Waring fitting with EFF and MLE, we find that the Waring distribution fits the data slightly better than the discrete Pareto law.

Discussion
To fit a given data set by the Waring distribution, we need to verify the existence condition of the MLE of the Waring parameters before we use the MLE. If the existence condition is not satisfied, it means that the likelihood is maximized at the boundary, i.e., infinity. Therefore, if we directly apply the optimization algorithm to the likelihood function, the MLE may be far from the true parameters; see for example, we get MLE α = 1, 687, 133.2, β = 675, 078.4 for the group with central age 67.5, where in fact that the MLE does not exist. In such cases, we can use the EFF method if the EFF estimates are in reasonable scales. Based on the simulation studies and the real data analysis, we find that, when the sample size is small or the maximum observed value is small, the MLE is less likely to exist, and when the sample size is big and the maximum observed value is large, the MLE is more likely to exist. Nevertheless, we need verify the existence condition for the MLE. Xekalaki, E. (1983). The univariate generalized Waring distribution in relation to accident theory: Proneness, spells or contagion? Biometrics, 39(4), 887-895. https://doi.org/10.2307/2531324 Xekalaki, E. (1985. Factorial moment estimation for the bivariate generalized Waring distribution. Statistical Papers, 26(1), 115-129. https://doi.org/10.1007/BF02932525 Yule, G. U. (1925. A mathematical theory of evolution, based on the conclusions of Dr. J. C. Willis, F.R.S. Philosophical Transactions of the Royal Society B, 213, 21-87.

Appendices
The appendix contains some useful lemmas and technical proofs.

Appendix 1. Some useful lemmas
Lemma A.1: Define where a 1 , . . . , a k are positive and b 1 , . . . , b k are nonnegative. Then g(x) is an increasing and concave function.
Proof: It is easy to derive that Proof: Assume that which indicates that: The proof is completed.

Proof of Lemma 2.4:
We use the method of contradiction. If the conclusion is not correct, then there exists x 1 > a such that By assumption (D), the curves y = t 1 (x) and y = t 2 (x) will intersect again after (x 1 , t 1 (x 1 )), i.e., there exists x 2 > x 1 such that t 1 (x 2 ) = t 2 (x 2 ), t 1 (x) < t 2 (x) for x ∈ (x 1 , x 2 ) and t 1 (x) > t 2 (x) for x > x 2 (suppose that there exists only one such x 2 , otherwise, we consider the largest intersection). According to assumption (D), take one point x * ∈ (δ * 4 , ∞) (which is of course greater than x 2 ), use (x 2 , t(x 2 )) as the starting point, and then take a ray interpolating (x * , t 1 (x * )). Let x * diverge to infinity so that the point (x * , t 1 (x * )) moves along the curve y = t 1 (x). Since t 1 (x) is increasing and concave, the ray interpolating (x * , t 1 (x * )) tilts down around the start point (x 2 , t(x 2 )). By assumption (C), when x * → ∞, the slope of the ray Thus the limit of the ray is a ray with start point (x 2 , t 1 (x 2 )) and slope c * , denoted as L, and the curve y = t 1 (x) is above L. Note that the start point of the ray L, (x 2 , t(x 2 )), is on the curve y = t 2 (x). By assumption (D), there exists an x * , which satisfies that, the curve y = t 2 (x) intersects L at (x * , t 2 (x * )) and y = t 2 (x) lies below L for x ∈ (x * , x * + δ * 5 ) with some positive δ * 5 . Without loss of generality, we assume that x 2 is such point, that is, y = t 2 (x) lies below L for x ∈ (x 2 , x 2 + δ * 5 ). Through the intersection (x 2 , t 1 (x 2 )), we make tangent line of the curve y = t 2 (x). If the tangent line coincides with the ray, then take another point x * * ∈ (x 2 , x 2 + δ * 5 ), and make another tangent line of the curve y = t 2 (x) through the point (x * * , t 2 (x * * )). Since y = t 2 (x) is increasing and concave, if the tangent line (of y = t 2 (x)) through (x 2 , t 1 (x 2 )) coincides with the ray L, the tangent line through (x * * , t 2 (x * * )) does not coincide with L. Note that the curve y = t 1 (x) is above L, while y = t 2 (x) is below the tangent line (a concave curve is always below its tangent line) which is below the ray L (the one which does not coincide with L must be below L according to the above discussion). Therefore, lim x→∞ x , which contradicts with assumption (C).
Based on Lemma A.2, tedious calculation yields