One generalized mixture Pareto distribution and estimation of the parameters by the EM algorithm for complete and right-censored data

A new mixture generalized Pareto distribution is introduced. Then, some of its attributes are explored. The maximum likelihood method and expectation maximization (EM) algorithm have been applied to estimate the parameters for complete and right-censored data. In a simulation study, the bias, absolute bias and mean squared error of the maximum likelihood estimator are compared with those related to the EM estimator. The results show that the absolute bias and mean squared error of the EM estimator are smaller than the related values for the maximum likelihood estimator. Finally, to illustrate its usefulness, the model has been applied to describe real data sets.


I. INTRODUCTION
T HE Pareto distribution is a power-law distribution that was originally used to describe wealth distribution. Additionally, it may be useful in describing observations from scientific, geophysical, actuarial, social, and quality control events and many other fields. The Pareto model may be applied to situations, where an equilibrium is found in the distribution of "small" to "large" values. In many cases, we may point to: the sizes of files transferred on the internet network by TCP/IP, which consists of many smaller files and few larger ones; the hard disk drive error rates, which consist of many small error rates and few large ones; the sizes of human settlements, which consist of many small values related to hamlets/villages and few large values related to cities; the oil reserves volumes in oil fields, which consist of many small fields and few large fields and many other examples. Among many studies in this field, Burroughs and Tebbens [1] analyzed observations from earthquakes and forest fire areas, and Schroeder et al. [2] fitted the data from disk-driven sector errors to this distribution.
One generalized Pareto distribution, i.e., the GP (α, β, θ) distribution, can be defined by the probability density function (pdf) where α > 0, β > 0, θ > 0, see Beirlant et al. [18] and Wiborg [19]. When α → ∞ and λ = θβ α are fixed constants, this model tends to the gamma distribution with pdf In this paper, we propose a new model based on a mixture of the GP distribution (1). The proposed model extends mixtures of the exponential, gamma, or Pareto distributions and has sufficient flexibility to describe many real situations. The paper is organized as follows. In Section 2, the new model is defined, and some of its statistical and reliability attributes are studied. In Section 3, the parameters of the model are estimated for the complete and right-censored data by the maximum likelihood estimator (MLE) and EM algorithm. The behavior of the estimators of the parameters has been investigated by a simulation study in Section 4. In Section 5, the proposed model is fitted to real data sets to show its applicability.
Since the GP model tends to the gamma distribution, a special limiting case of the GPM is the quasi Lindley distribution studied by Shanker [20]. For a random lifetime X, the most important function in reliability engineering and survival analysis is the reliability function R(x) = P (X ≥ x) and gives the probability that an object works (survives) at least until a specified time x. The reliability function of GPM is Proposition 1. When α > k, the k th moments of GP M (α, β, γ) are finite and equal to For α ≤ k, it is infinite.
Proof. Because GP M (α, β, γ) is a mixture of X 1 ∼ GP (α, β, 1) and X 2 ∼ GP (α, β, 2), we have But It is easy to check that for α ≤ k, (7) is infinite, but for α > k, it can be simplified as follows: Using a similar approach, we can check that for α ≤ k, E(X k 2 ) is infinite, and for α > k, we have Then, the result follows by (7), (8) and (9).
The quantile function q(p) = F −1 (p) for GP M (α, β, γ) has no closed form and can be numerically computed by solving the following equation: ).
(10) When the corresponding moments exit, the Pearson's moment coefficient of skewness of a random variable X is defined to be and the kurtosis of X is where µ = E(X), and σ 2 is the variance of X.
Moreover, the skewness and kurtosis of a distribution can be described in terms of the quantile function. MacGillivray [21] suggested the following relation for skewness: where u ∈ (0, 0.5). When u = 0.25, it reduces to Bowley's measure of skewness, Bowley [22]. Moreover, Moor [23] introduced the kurtosis in terms of the quantile function by .
In the economics literature, for a cumulative distribution function (cdf) F , the Lorenz curve is: . and provides a graphical representation of wealth inequality. In fact, L(p) shows the proportion of overall income or wealth of the 100 × p percent of people with lower income or wealth. This plot will be a convex plot joining two points (0, 0) and (1,1). For an ideal society where every person has identical income, the Lorenz curve is a straight line that joins these points. For GP M (α, β, γ), µ is not finite for α ≤ 1, but µ p is finite, so L(p) is zero and does not describe the wealth distribution. Fortunately, Prendergast and Staudte [24] have proposed alternatives in terms of the quantile function, which replace µ by the median of the distribution and µ p by q( p 2 ). Precisely .

A. RELIABILITY MEASURES
The failure rate (FR), mean residual life (MRL) and pquantile residual life (p-QRL) concepts play important rules in describing the dynamic attributes of lifetime variables and have been studied from different perspectives in the reliability and survival analysis literature. The FR function of the GP M (α, β, γ) is .
For a distribution with reliability function R, the MRL function is: The following proposition gives the form of the MRL function for the proposed model. .
The p-QRL function of a distribution with reliability function R is defined to be wherep = 1 − p, and q() is the quantile function and defined by (10). Similar to the quantile function, the p-QRL function of GPM has no closed form and should be numerically computed.

III. ESTIMATION OF THE PARAMETERS A. THE MLE FOR COMPLETE DATA
Suppose that x 1 , x 2 , ..., x n represent realizations from GP M (α, β, γ). The log-likelihood function of the parameters is and the log-likelihood equations are We calculate the MLE by directly maximizing the loglikelihood function (15) or solving the likelihood equations.
Let l = ln f (X). Then, the Fisher information matrix can be applied to obtain the variance of the MLE. Here, it is defined as follows: The elements of this matrix are complicated expressions and should be numerically computed. For the iid random sample X i , i = 1, 2, ..., n from GP M (α 0 , β 0 , γ 0 ), the MLE (α,β,γ) weakly converges to the multivariate normal N ((α 0 , β 0 , γ 0 ), n −1 K −1 ), where K −1 is the inverse of the information matrix.

B. EM ALGORITHM FOR COMPLETE DATA
Assume that X i , i = 1, 2, ..., n shows an iid random sample of GP M (α, β, γ). Every X i arises from the GPM distribution. In the EM algorithm, one indicator latent random variable Z i is considered, which determines that X i has been taken from GP (α, β, 1) or GP (α, β, 2). Precisely, and Let Θ = (α, β, γ) for a brief representation. The likelihood function can be written in the following form.
where indicator I(z i = j) is equal to 1 when z i = j and otherwise equal to 0, and f j shows the pdf of GP (α, β, j), i.e., The log-likelihood function is l(Θ; x, z) = ln L(Θ; x, z), which is simplified in (32), see Appendix A. The EM algorithm is an iterative method that involves two steps: Expectation (E) and Maximization (M).
In step E, we should prepare the expectation of the log-likelihood in terms of the current estimate of the conditional latent variable. Then in step M, the parameters are estimated in the current iteration by maximizing the expectation value prepared in step E. The iterative process continues until the iteration does not noticeably improve the expectation.

The E step
Suppose that the estimate of the parameters Θ t is known at iteration t. Then, the conditional distribution of Z i can be computed by the well-known Bayes theorem as follows. It is simplified in Appendix A, (33). After the simplification, we have and p i2,t = 1 − p i1,t . These probabilities are known as membership probabilities at iteration t and are used to construct the expectation function Q( The simplified expression in Appendix A, (34), shows that the expectation is a sum of two expressions, one of which depends on α and β, and the other depends on γ. We can thus write where and The M Step In this step, the parameters at iteration t + 1 are estimated by maximizing Q(Θ|Θ t ) in terms of Θ. Specifically, By (20), this problem can be reduced to two separate maximization problems as follows: and The maximization problem (23) has no analytical solution and should be numerically solved. However, by solving the equation ∂ ∂γ Q 2 (γ|Θ t ) = 0, we obtain The iterative process can be terminated if for a predefined small > 0, Q(Θ t+1 |Θ t+1 ) < Q(Θ t |Θ t ) + .

C. THE MLE FOR RIGHT-CENSORED DATA
Now, suppose that we have a right-censored iid random sample from GP M (α, β, γ). The random variable X i is said to be censored from right by a censorship random variable C i , when X i > C i , so we only know that the event time is greater than C i . Thus, the observations are .., n show a right-censored sample; then, the loglikelihood function is: where f and R are the density and reliability functions of the GPM distribution, respectively. The log-likelihood function reduces to

D. EM ALGORITHM FOR RIGHT-CENSORED DATA
In the presence of latent variable Z i , which is defined in the previous sections, the likelihood function for the rightcensored data is: where P (Z i = j) and f j are defined by (17) and (19), respectively, and R j shows the reliability function of f j . The log-likelihood function ln L(Θ; t, d, z) has been simplified in Appendix A, (35). The E Step Suppose that the estimate of the parameters at iteration t, Θ t , is known; then, the conditional distribution of Z i is equal to and can be computed by the well-known Bayes theorem (see Appendix A, (36)).
Specifically, taking j = 1, and p i2,t = 1 − p i1,t . Then, using (35), the expectation function at iteration t is Q(Θ|Θ t ) = E Z|t,d,Θt [l(Θ; t, d, z)]. In Appendix A, (37) and (38) show that Q(Θ|Θ t ) is a sum of two expressions, one of which solely depends on α and β, and the other depends on γ. In other words, where and Q 2 (γ|Θ t ) = −n ln(γ + 1) + ln γ The M step Similar to the uncensored case, the estimation of the parameters at iteration t + 1 will be computed by maximizing Q(Θ|Θ t ) in terms of Θ. By (27), Q(Θ|Θ t ) is a sum of two separate statements, so it follows that and The maximization problem (30) cannot be analytically solved and should be numerically solved. However, it is easy to check that the solution of (31) is

IV. SIMULATION STUDY
Because the GPM is a mixture model, we can apply the following steps to generate a random sample with size n from it: 1. Generate one random instance of binomial distribution with parameters n and γ γ+1 . Let the generated instance be n 1 . 2. Generate one random sample with size n 1 from GP (α, β, 1) and one random sample with size n 2 = n − n 1 from GP (α, β, 2). Then, we combine these two samples to provide one random sample of GPM. 3. To generate a right-censored sample with censorship p, we take the random censorship variable C i to follow the degenerate distribution with mean t * . Then, t * can be computed by solving equation t * = q(p), where q() is the quantile function and defined in (10). The results of a simulation study to estimate the parameters of GPM have been abstracted in Tables 1 and 2 for uncensored data and right-censored data with censorship rate p = 0.2, respectively. In every run, r = 500 replicates of samples with sizes n = 80 and 100 were generated. Then, the parameters were estimated by the maximum likelihood method or EM algorithm. The EM algorithm is an iterative method to find maximum likelihood and can be applied when the model depends on some latent variables, e.g. the mixture or competing risk models. For MLE, we use the built-in "optim" function of R with the default optimization method "Nelder-Mead". In both MLE and EM approaches, for real parameters α, β and γ, the initial points have been generated from the uniform distribution on the intervals (0, α), (0, β) and (0, γ) respectively. For real data, we should guess some different initial values for parameters and then compare the results by AIC criterion and Kolmogorov-Smirnov (K-S) statistic. Unfortunately, checking the conditions for terminating the EM process in every EM iteration, makes the simulation runs very slow and time consuming. So, we tested the EM algorithm for many times to find a big constant that is sufficiently large for EM iterations to converge. In this way, we found that 20 iterations is sufficient.
For each parameter, the bias (B), absolute bias (AB) and mean squared error (MSE) were computed. Every cell of these tables shows the A, AB and MSE for α, β and γ from top to bottom. The results related to the maximum likelihood method and EM algorithm are presented on the left and right sides, respectively. The results show smaller values of AB and MSE for the EM algorithm, which indicates that the EM algorithm outperforms the maximum likelihood method. Table 3 shows the number of successive failures for the air conditioning system of 13 Boeing 720 jet airplanes that were analyzed by Tahir et al. [25] and many others. We fit the GPM to this data set by the maximum likelihood method and EM algorithm. Additionally, the Gumbel-Lomax (GuLx) and Weibull-Pareto (WP) distributions, which exhibit decreasing and upside-down bathtub-shaped FR functions, were fit to this data set. Table 5 shows the estimates of the parameters, K-S statistic,

V. APPLICATIONS
is the empirical distribution function, and its corresponding p-value. Akaike information criterion (AIC) is also included in this table. Large values of the K-S statistic and/or the AIC clearly indicate poor fit, so we can compare the fitted models using these measurements. The K-S statistic calculated for GPM (in particular by the EM algorithm) is smaller than others. Based on this criterion, GPM performs better than the others. However, based on the AIC, GuLx shows a better fit. The empirical and fitted CDFs are shown in Figure 3, left panel, visually confirming that the estimated GPM distribution is very close to the empirical distribution function. Table 4 shows a dataset corresponding to the remission times (in months) of a random sample of 128 bladder cancer patients, referring to Tahir et al [26]. The results of fitting GPM, GuLx and WP to this dataset are also shown in Table 5.
The K-S statistics and the AIC for GPM show smaller values than GuLx and WP, indicating that GPM is a good candidate to describe this dataset. Figure 3, right panel, which draws the empirical and fitted CDFs, confirms a very good fit.

VI. CONCLUSION
A new generalized Pareto model was introduced and some of the statistical and reliability properties were investigated. The proposed model can be applied in a variety of real-world situations, such as reliability and survival data where equilibrium is found in the distribution from "small" to "large" values, etc. The parameters of the proposed model were estimated using the MLE and EM algorithms. Simulation results confirm that EM performs better than MLE.

ACKNOWLEDGMENT
The author thanks four anonymous reviewers for their useful comments that lead to this improved version of the paper.    .

APPENDIX A SOME DETAILED INFORMATION
The log-likelihood function related to the EM algorithm for complete data can be simplified as follows.
Expectation function of the EM algorithm for complete data can be written as follows.