Parameter estimations for mixed generalized exponential distribution based on progressive type-I interval censoring

Abstract: This paper considers the estimation of parameters based on a progressively type-I interval-censored data from a mixed generalized exponential distribution. The maximum likelihood estimation is used but an analytic form cannot be obtained. The EM algorithm is applied to obtain the maximum likelihood estimates. The performance of the estimates is judged by a simulating study and a real data is presented to illustrate the method of estimation developed here.


PUBLIC INTEREST STATEMENT
Progressive type-I interval censoring data in survival analysis and reliability analysis are a kind of interested and popular data type. This work is modeling for this kind of data with a more flexible Mixed Generalized Exponential Distribution including the exponential distribution and Weibull distribution. And the result of this paper can apply to the lifetime analysis in clinical trial, engineering and so on.

Introduction
In the life testing, reliability studies and survival analysis, it is extremely common that items are lost or removed from experiments before failure. The most popular censoring schemes are type-I and type-II censoring. Under type-I censoring, the test is continued until a pre-specified time. In type-II censoring, the test is continued up to a pre-specified number of failures. Progressive type-II censoring is an extension of the traditional type-II censoring scheme and it has been studied such as Balakrishnan and Aggarwala (2000), Balakrishnan, Kannan, Lin, and Wu (2004) and so on. Sometimes it is impossible to observe the whole life test process continuously due to time and cost constraint. Only the number of failures is observed in an interval instead of observing the failure time exactly and this is called interval censoring. As the mixture of the interval censoring and progressive censoring, Aggarwala (2001) introduced progressive type-I interval censoring scheme and developed the statistical inference for the exponential distribution based on progressively type-I interval-censored data. Ng and Wang (2009) discussed statistical inference for Weibull distribution under progressive type-I interval censoring. They have compared with different estimation methods for the parameters of Weibull distribution via simulation. Lin, Wu, and Balakrishnan (2009) considered determination of optimum life testing plan with progressively type-I interval-censored data from the log-normal distribution. Chen and Lio (2010) considered parameter estimations for generalized exponential distribution under progressive type-I interval censoring. They have obtained the estimates of unknown parameters using EM algorithm, midpoint approximation method and method of moments. Further, Lio, Chen, and Tsai (2011) presented the parameter estimations of the generalized Rayleigh distribution at the same censoring scheme. Peng and Yan (2011) considered the MLEs and moments estimators of the parameters for gamma distribution based on progressive type-I interval censoring and compared with the biases and mean square errors through simulation. Pradhan and Gijo (2013) considered inference for the unknown parameters of log-normal distribution based on progressively type-I interval-censored data through both frequency and Bayesian approaches.
In most simple life tests, the experimental data come from one population and have only one type of failure. In fact, a mixed distribution can provide a flexible candidate to depict the time to failure. Especially in the analysis of biased life data, the failure hazard in the beginning is quite high and the failure rate is decreased or constant as the age increases. On this occasion, population is not homogeneous but is made up from sub-populations. There are many researches for the mixed distribution, such as McClean (1986), Soliman (2006), Tian, He, and Chen (2008), Tian, Tian and Chen (2012), Tian, Tian, and Zhu (2014), Tian, Zhu and Tian (2013) and so on. Nevertheless, to our best knowledge, no published papers address maximum likelihood estimation of the mixed generalized exponential distribution (MGED) under progressive type-I interval censoring. In this work, we consider the estimation for parameters of the MGED based on progressively type-I interval-censored data. This paper gives the likelihood equations of the unknown parameters. It is observed that there is no closed form of MLEs. Therefore, it is suggested to use the EM algorithm to compute the MLEs and present the performance of the estimates by simulations.
The rest of the paper is organized as follows. The model and data description are provided in Section 2. The maximum likelihood estimators of the unknown parameters are obtained in Section 3. The EM algorithm procedure is given in Section 4. The performance of the estimators is investigated via simulation in Section 5. A data-set is analyzed in Section 6. Section 7 includes some conclusion remarks.

The description of the model and data
Consider the MGED with m components with its probability density function, cumulative distribution function and hazard function as follows p k , α k > 0 are the shape parameters, k > 0 are the scale parameters, the number of the parameters is 3m − 1.
Then, progressive type-I interval censoring scheme is as follows. Let n identical items be placed on a test at time t 0 = 0 and pre-specify N inspection times t 1 < t 2 < ⋯ < t N , where t N is the scheduled termination time of the experiment. Suppose X i is the failure number within the interval of (t i−1 , t i ], Y i is the survival items numbers at time t i and R i is the number of items randomly removed from the surviving items at time t i . R 1 , R 2 , …, R N can be pre-specified by the percentage of the remaining surviving items. Therefore, for the pre-specified percentage p Alternatively R 1 , R 2 , …, R m can also be pre-specified positive integers. In this case, there must be at least R i units available for removal. The proportions of surviving units to be removed at the monitoring and censoring points have been given in the simulation. And the progressively type-I intervalcensored data are given by i = 1, 2, …, N -1, then progressive type-I interval censoring is reduced to the traditional interval-censored data, X 1 , X 2 , …, X N , X N+1 = R N .

Maximum likelihood estimation
Given a progressively type-I interval-censored data, {X i , R i , t i }, i = 1, 2, …, N which sample size is n, from a lifetime distribution with distribution function F(T, θ) and θ is the parameter vector. Then, the likelihood function is (Aggarwala, 2001) where t 0 = 0. We can obtain the MLEs of the parameters by maximizing the likelihood function in (3.1).
Given the MGED by the Equation (2.2), the above likelihood function can be specified as follows: And the log-likelihood function is Taking derivatives of the log-likelihood function with respective to p k , k , k to zero, the following score equations are obtained: Obviously, the equations are quite complex and there is no closed form for the solutions to the above equations and the EM algorithm introduced as follows to find the MLEs of p k , k , k , k = 1, 2, …, m.

EM algorithm
The Expectation Maximization (EM) Algorithm (Dempster, Laird, & Rubin, 1977) is a useful tool to estimate the parameters of the distribution based on an incomplete data. Here, some notations are similar to that in the Tian et al. (2014). In order to estimate the MGED models under progressive type-I interval censoring, EM algorithm is applied to estimate the unknown parameters. Suppose τ 1 , τ 2 , …, τ n are n independent identically distributed (iid) samples from the MGED model and denote where k = 1, 2, …, m, i = 1, 2, …, n.
Here, the sample τ i , i = 1, 2, …, n is divided into two components that consist of τ ij and * ij �. Let τ ij , j = 1, 2, …, X i be the lifetimes within interval (t i−1 , t i ] and * ij �, j ′ = 1, 2, …, R i be the lifetimes for those withdrawn items at t i for i = 1, 2, …, N and the number of the τ ij and * and I ij � = (I ij � 1 , I ij � 2 , … , I ij � m ) follows the multinomial distribution. However, we may not know which component the variate comes from. Namely, I cannot be observed, thus regard it as the missing data in the EM algorithm. In the following paper, denote I (1) as the indicator vectors of the data τ ij and * ij �, respectively.
As to the the lifetimes τ ij within interval (t i−1 , t i ], the joint density of τ ij and I (1) In Section 2, denote the progressively type-I interval-censored sample as In the progressive type-I interval censoring experiments, we can only observe the failure numbers X i within the intervals (t i−1 , t i ] and R i , the number of the censored items withdrawn at the censoring time t i for i = 1, 2, …, N. Then, the observed values can be simply denoted as Y = (t 1 , t 2 , …, t N , X 1 , X 2 , …, X N , R 1 , R 2 , …, R N ). However, the true failure time within the interval (t i−1 , t i ] denoted as τ ij , i = 1, 2, …, N, j = 1, 2, …, X i and the true failure time of censored units at t i denoted as * can be regarded as missing data. All the missing data can be denoted (τ, I) and all the complete data can be denoted as W = (τ, I, Y).
The following procedure will give the MLEs of all unknown parameters via the EM algorithm consisting of two steps: E-step and M-step.
First of all, the likelihood function of the MGED model under the complete data W is given by The log-likelihood function of the complete data is Given initial values P (0) , (0) , (0) of the unknown parameter vectors, we can obtain parameter estimates of model (2.1) based on EM algorithm via the following two steps. Of course, estimation performance has a great relationship with the choice of the initial values. Generally, different initial values may be lead to different convergence rate. In the simulation, it is suggest to choose some groups of initial values to compare the estimation results.
x ∈ (t i−1 , t i ), and M-step: we maximize the approximate Q function numerically in E-step with respect to unknown parameters P, , to update estimates which are denoted as P (h) , (h) , (h) .
In order to obtain the estimates of unknown parameters more conveniently in M-step, use have the following expressions: Then the above Q function is: Then, take the derivatives of unknown parameters and let Solve the Equations (4.1) and (4.2), there are Solve the Equation (4.3), there is  From Equation (4.6), obtain the h-th iteration values of parameters p 1 , …, p m−1 which are the solutions of the linear equation group denoted by AP = b, where P, A, b are given respectively as follows A is a invertible matrix. Therefore, the unique solution of parameter vector P of the h-th iteration in the M-step is obtained From the above Equations (4.4), (4.5) and (4.7), we can update P (h) ,̂( h) ,̂( h) by repeating E-step and M-step till the total error of all estimated parameters approaches the supposed restraint.

Simulation study
The purpose of simulation study is to investigate the performance of the estimates for the MGED parameters in modeling progressive type-I interval censoring lifetime data. Here, we use some similar algorithm steps proposed in Aggarwala (2001). The simulation is conducted in R language. To be self-contained, the algorithm is re-produced as follows. Firstly, generate the numbers, X i , of failed items in each subinterval (t i−1 , t i ], i = 1, 2, …, N, from a sample of size n putting on life testing at time t i = 0. A progressively type-I interval-censored data,(X i , R i , t i ), i = 1, 2, …, N, from MGED which has distribution (2.2) can be generated using the fact that and let X 0 = 0 and R 0 = 0 for i = 1, 2, …, N, where floor() returns the largest integer not greater than the argument in R language and 0 = t 0 < t 1 < ⋯ < t N < ∞ are pre-scheduled times.
In this paper, we consider the following progressive interval censoring (5.1)  the number of the items withdrawn from the experiment is increasing. It is obvious that the simulation results are mainly decided by the sample size n and the censoring proportion. For different censoring schemes, the smaller censoring proportion we design in life testing, the better estimation results we will get. Therefore, the scheme 4 presents the most precise estimation. Further, for fixed N as n increases, on the whole, the biases, the MSEs decrease.

Data analysis
In order to illustrate the effectiveness of the model and algorithm, we analyze a real data-set bellow. A data-set which describe the survival times for surgery of a group of 374 patients who underwent operations in connection with a type of malignant disease (Berkson & Gage, 1950;Lawless, 2003). The data are given in Table 5.
According to the data above, d j s in the first five intervals are significantly different with the last intervals and think that the data may not come from the homogenous population. Thus, consider two components of the MGED (1.1) to analyze this data-set. Through the method discussed above, we can obtain that the parameter estimates are p 1 = 0.4213, ̂1 = 0.9216, ̂1 = 0.0886, ̂2 = 0.9118, ̂2 = 0.4017, respectively. And the survival function and the hazard rate function of the model (1.1) are as follow: From (6.1) and (6.2), we can get the fitted survival function and hazard rate function of the real data as in Figure 1. For Figure 1, the hazard rate function is monotone decreasing function. There is great change in the first few years and then tends to be gentle after eight years. Therefore, for this disease, the risk rate of patients after operation in the short time is higher. In order to avoid the recurrence in the recovery stage after the operation, we suggest to check regularly.
In order check the validity of the model, we adopt the Kolmogorov-Smirnov goodness-of-fit test statistic for the fitted distribution F (x). Define the maximum distance in Kolmogorov-Smirnov goodness-of-fit test: as the distance between the empirical distribution F n (x), of the given complete data-set and the fitted distribution function F(x|̂) with ̂ as the MLE of unknown parameter vector θ. When a progressively type-I interval-censored data are given, the empirical distribution is replaced by the following (6.4) in the formula D n (F).
Fit the data-set in Table 5, we obtain the K-S distance 0.2406. So it is reasonable to say that the MGED provides a good fit for the given data-set in Table 5. Figure 2 is given to compare the empirical distribution function with the fitted distribution function. (1 −p j ), i = 1, … , N