Simulation of generalized Gamma distribution with maximum likelihood estimation and expectation-maximization algorithm on right censored data type 1

The Generalized Gamma distribution is very suitable for modeling data with various forms of hazard (risk) functions, which makes the Generalized Gamma distribution useful in survival analysis. Survival analysis aims are to predict chances of survival, disease recurrence, death, and other events over a period of time. One characteristic of survival data is the possibility of sensors. Let X be the life span of the person being studied and the right censorship time of Cr, X is assumed to be independent with the probability density function f(x), the survival function S(x), and the hazard function h(x). A person's X life span will be known if X is less than or equal to Cr. If X is greater than Cr, the individual X survives or is censored right now. Statistical inference, especially parameter estimation is needed in analyzing empirical data. Obviously the estimation results obtained are expected to be a good estimator, namely to meet the nature of unbiased and minimum variance. This paper will discuss the results of the estimation of Generalized Gamma distribution parameters with type 1 right censored data through simulations using the Expectation Maximization method and the Maximum Likelihood Estimation method. The simulation is conducted by generating data with the sample size: 25, 50, 100, 200, 500, 1000, 1500 and 2000 as well as determining censored data of 10%, 20% and 30% by first setting the parameters used which are obtained from the data of patients with gastric cancer namely α = 1.0649, β = 1,072, θ = 59.766. Based on the results obtained from the simulations on the two estimation methods that the parameter estimation using the Maximum Likelihood Estimation method is better than the Expectation Maximization method because it provides a smaller bias and MSE value where the larger the sample size used, the estimated parameter value will get closer to the parameter in fact. In addition, the Expectation Maximization method can also be used as an alternative estimation of generalized gamma distribution parameters with type 1 right censored data because it has a bias value and MSE approaching the MLE method.

The Generalized Gamma distribution is very suitable for modeling data with various forms of hazard (risk) functions, which makes the Generalized Gamma distribution useful in survival analysis. Survival analysis aims are to predict chances of survival, disease recurrence, death, and other events over a period of time. One characteristic of survival data is the possibility of sensors. Let X be the life span of the person being studied and the right censorship time of Cr, X is assumed to be independent with the probability density function f(x), the survival function S(x), and the hazard function h(x). A person's X life span will be known if X is less than or equal to Cr. If X is greater than Cr, the individual X survives or is censored right now. Statistical inference, especially parameter estimation is needed in analyzing empirical data. Obviously the estimation results obtained are expected to be a good estimator, namely to meet the nature of unbiased and minimum variance. This paper will discuss the results of the estimation of Generalized Gamma distribution parameters with type 1 right censored data through simulations using the Expectation Maximization method and the Maximum Likelihood Estimation method. The simulation is conducted by generating data with the sample size: 25, 50, 100, 200, 500, 1000, 1500 and 2000 as well as determining censored data of 10%, 20% and 30% by first setting the parameters used which are obtained from the data of patients with gastric cancer namely α = 1.0649, β = 1,072, θ = 59.766. Based on the results obtained from the simulations on the two estimation methods that the parameter estimation using the Maximum Likelihood Estimation method is better than the Expectation Maximization method because it provides a smaller bias and MSE value where the larger the sample size used, the estimated parameter value will get closer to the parameter in fact. In addition, the Expectation Maximization method can also be used as an alternative estimation of generalized gamma distribution parameters with type 1 right censored data because it has a bias value and MSE approaching the MLE method. .

Introduction
Survival time is a data to measure the time to certain events or events, such as failure, death, relapse, or the development of certain diseases. One of the characteristics of survival data is the possibility of sensors. Suppose X is the life span of a person being studied and the right censorship time of Cr, X is assumed to be independent identical distribution, f(x), the survival function S(x), and the hazard function h(x). A person's X life span will be known if X is less than or equal to Cr. If X is greater than Cr, the individual X survives or is censored right (Klein & Moeschberger, 1997). The most popular generalized gamma distribution modeling is for analyzing skewed data. Generalized gamma was introduced by Stacy (1962), and was further discussed by Hoq et al. (1974), Lee and Gross (1991), Agarwal and Kalla (1996), Agarwal and Al-Saleh (2001), Kalla et al. (2001). The generalized distribution of gamma has been used in several research fields such as engineering, hydrology and survival analysis. Nadarajah and Gupta (2007) use this distribution with applications for drought data. Ortega et al. (2009) discuss the effects of diagnostics in generalized gamma regression models. Cox (2008) discusses and compares F-generalized families with generalized gamma models and Cox et al. (2007) present parametric survival analysis and taxonomies of generalized gamma hazard level functions.
Generalized gamma distribution is very suitable for modeling data with various forms of hazard (risk) function. This characteristic is useful for estimating a person's hazard function (risk) and relative risk and relative time, this is very much needed in survival analysis. To estimate the parameters of the generalized gamma distribution in survival analysis is using Maximum Likelihood Estimation (MLE) which is one of the methods of estimation that most widely used. According to Millar (2011) maximum likelihood as a tool for inference, including evaluation of statistical significance, calculation of confidence intervals, assessment of models, and estimation parameters. Maximum likelihood estimation has optimal properties for a large sample size. So that maximum likelihood is the most widely used methods of parametric inference. Some cases often found in the log likelihood function cannot be maximized analytically, so it is necessary to calculate the MLE by iteration such as by the Newton-Raphson Maximization Procedure, while the other alternative is the expectationmaximization algorithm (Mclachlan a& Krishnan, 2008;Wang & Cheng, 2009;Ruhi et al., 2015).
Expectation-maximization algorithm is an approach to calculate maximum likelihood estimation, it is useful in several of incomplete data problems. The procedure in the expectation-maximization algorithm, there are two steps namely, the expectation step (e-step) and the maximization step (m-step) with the basic idea of the expectation-maximization algorithm associating incomplete data problems with complete data problems to get the final result of maximization likelihood estimation. The e-step focuses on producing complete data, using a collection of observable data, so that the simple steps of m-step maximize the complete data (Mclachlan & Krishnan, 2008). This study will discuss the estimation of parameter of generalized gamma distribution by using the expectation-maximization algorithm on type 1 right censored data.

Survival analysis
Survival time is data to measure the time to certain events / events, such as failure, death, relapse, development of certain diseases, parole, or divorce. The survival time distribution is usually described by three functions, the S(x) survival function, the f(x) probability density function, and h(x) the hazard function (hazard). These three functions are used to describe various aspects of the data. The fundamental problem in survival data analysis is to estimate parameter from the sample one or more of these three functions and to draw conclusions about survival patterns in the population (Lee & Wang, 2003).

Survival function
According to Klein and Moeschberger (1997) the survival function is the basis used to describe the phenomenon of timeto-event, the chance of someone surviving beyond time x (experiencing events after time x). Survival function is defined as follows: In the context of the failure of the equipment or goods produced, S(x) is referred to as a reliability function. When X is a continuous random variable, the survival function is the complement of the cumulative distribution function, that is, where ( ) = Pr ( < ), also the survival function is integral of the probability density function, ( ) namely: So that

Hazard Function (Risk)
According to Klein and Moeschberger (1997) hazard functions are known as conditional failure rates in reliability, mortality strength in demographics, function of intensity in stochastic processes, age-specific failure rates in epidemiology. The level of risk is determined by If X is continues random variable, so that Cumulative hazard function (risk) H(x) is defined as follows: So, for a sustainable life

Generalized Gamma Distribution
According to Cordeiro et al. (2012) that the generalized gamma distribution is good for modeling data with various types of hazard rate functions (risk levels), adding, reducing, which makes it very useful for estimating each hazard function (risk). The generalized gamma distribution has been used in several fields of research such as engineering, hydrology, and survival analysis. The probability density function of the generalized gamma distribution is as follows: Parameter and is known as shape and parameter is called as scale parameter.

Maximum Likelihood Estimation
According to Hogg et al. (2013), suppose random variables X1, X2, ..., Xn are mutually independent with the probability density function ƒ(x; ψ), ψ ϵ Ω. The parameter ψ is unknown. The procedure for estimating the parameters are based on the likelihood function is as follows: where x = (x1, x2, … , xn )'. The log of the likelihood function is generally easier to use, i.e.
Argmax is a method such that L(ψ; X) attains its maximum values for ψ . To determine the maximum likelihood estimation, it often uses the log of the likelihood function and then determines the critical value (Hogg and Craig, 1995), because the logarithmic function is monotone increase at (0, ∞) which means it has the same critical value. For example L(ψ) =Log L(ψ), then solving the maximum likelihood estimation with the following equation,

Expectation-MaximizationAlgorithm
According to Mclachlan and Krishnan (2008), the expectation-maximization (EM) algorithm is a method that is widely used to calculate the maximum likelihood iteration estimation that is useful in various incomplete data, which is if using an algorithm such as the Newton-Raphson method, it might be more complicated in overcoming the incompleteness data. At each EM iteration algorithm, there are two steps namely: the expectation step or e-step and the maximization step or mstep. The basic idea of the Expectation-Maximization Algorithm is to associate incomplete data problems with complete data problems to get the final result of maximum likelihood estimation. The e-step focuses on generating complete data, using a collection of observable data, so that the simple steps of m-step maximize the complete data. Let Y be a random vector that matches the data observed y and has p.d.f. g (y; ψ), where ψ = (ψ1, ..., ψd) T is an unknown parameter vector with parameter space Ω. The EM algorithm is a widely applied algorithm that provides iterative procedures for calculating maximization likelihood estimation. In this context, the observed vector data as incomplete. The idea of 'incomplete data' includes the notion of being lost in conventional data. Let x denote vectors containing complete data called augmented data, and let z denote vectors containing additional data, referred to as data that cannot be observed or lost. Let gc(x; ψ) be p.d.f. from the random vector X that corresponds to the complete vector x data. Then the complete log likelihood data function that can be formed for ψ if x is fully observable is as follows: log L c (ψ)= log g c ( x ; ψ ).
Let's say there are two sample spaces x and y and a many one to one mapping from x to y. Vector data complete data x in x, while vector data incomplete vector y = y (x) in y. So that where x (y) is the part of x that determines by the equation y = y (x). The EM algorithm is an approach to solving the likelihood equation for incomplete data indirectly by performing an iterative calculation for the complete log likelihood data function, log L c (ψ). Because its cannot be observed, the incomplete data is replaced by the conditional expectation value given by y, using a match for ψ. Let ( ) be the initial value for ψ. Then in the first iteration, e-step requires the following calculation m-step, then maximization (ψ; ( ) ) related to ( ) on the parameter space Ω. Namely, we select ( ) such that for all ψ ϵ Ω. E-step and m-step process is rerepeat, but with ( ) replaced by the fit of ( ) . At the-(k + 1) iterations, estep and m-step is defined as follows, e-step calculate Q(ψ; ( ) ), where m-step select ( ) to find the value ψ ϵ Ω that maximize Q(ψ; ( ) ) namely, for all ψ ϵ Ω.

Censored Data
Censorship is the loss of observation / information on the life span variables observed in the study. In survival data, censorship often occurs for various reasons. In clinical trials about the effectiveness of medical treatment for a diseases, for example, patients may leave the hospital due to migration or health problems. Sensors are generally divided into certain types. If someone has entered the study but cannot be followed up, the actual time of the event is placed somewhere to the right of the censored time along the time axis. This type of sensor is called the right sensor. Because right censors occur far more frequently than other types and information can be included in the estimation of survival models (Liu, 2012). According to Klein and Moeschberger (1997), there are three types of censorship as follows:

Right Sensor
Let X be the life span of a person and the time of censoring is Cr. A person's lifetime X will be known if X ≤ Cr. If X is greater than Cr then the person is said to be surviving or the right sensor. For example, from Lee and Wang (2003), suppose that there are six mice that are exposed to carcinogens by injecting tumor cells into their foot pads. Time to develop tumors of a certain size is observed. Researchers decided to stop the experiment after 30 weeks. Figure 1 is a time graph of tumor development. Rats A, B, and D experienced tumors after 10, 15, and 25 weeks, respectively. C and E mice did not have a tumor at the end of the study, so the likelihood of getting a tumor is 30 weeks. F rat died without a tumor after 19 weeks of observation. So the survival time data are 10, 15, 30+, 25, 30+, and 19+ weeks. The plus sign indicates censored observations.

Left Sensor
Let X be the life span of a person in study. If there is a left Cl-sensor, it means that someone has experienced an event before it was observed in the study.

Interval sensor
A person's lifetime occurs at a time interval.

Unbiased
One property that must be fulfiled by a parameter estimator of a distribution is the characteristic of the unbiased of the estimator. The estimator ( ) = ( , , … , ) ,which satisfied the following equation: is called unbiased estimator of ( ) (Warsono et al., 2019).

Minimum Variance
An estimator is said to be a good estimator if in addition to having an unbiased property, it also has a minimum variance property. Bain and Engelhardt (1992) states that for example, U(X) is an unbiased estimator for g(θ) with minimum variance, if for any other unbiased estimator U1(X) of g(θ), is an inequality for minimum variance (Hogg and Craig, 1995). According to Hogg and Craig (1995), ( ; ) is called Fisher Information and written as ( ). Such that the Equation (20) above can be simplified as:

Data analysis
The data used in this study are data from Monte Carlo simulation by using generalized gamma distribution. So that the choice of parameter values approaches the value of the reliability data, the selection of parameter values in the simulation is based on the reliability data available in libraries. This simulation is based on data from patients with gastric cancer who have been tested first using EASYFIT software that the data has generalized gamma distribution with parameters α = 1.0649, β = 1.072, θ = 59.766. Data generation is performed for several different size of data, namely for of n: 25, 50, 100, 200, 500, 1000, 1500 and 2000 with 100 replications and using R software and censoring data with percentages p = 10%, p = 20% and p = 30%.

Complete Data Likelihood Function from Generalized Gamma Distribution
Warsono (2009) states that the random variable X is said to have a generalized gamma distribution with parameters α, β, and θ if and only if the probability density function of X is, Estimation of the parameters of generalized gamma distribution by using the maximum likelihood estimation method begins by determining the likelihood function as follows: To facilitate the derivative of the likelihood function of the Generalized Gamma distribution in the survival time data of patients suffering from gastric cancer, then equation (23) is changed to logarithmic form as follows: Based on the estimation results presented in Table 1, it can be seen that the two methods give different estimation results.
Where the estimation of Parameter β and Parameter θ by using the EM method the estimate values are the same with the real parameter value. Next to see the accuracy of the estimation of the two methods, we compared the average values of Bias and MSE of each estimator. To make it easier to see the difference in average bias values given by the two methods, Table 2 is visualized in graphical form in Figure 2 and Figure 3. Based on the graph of the average bias values in Fig. 2. it can be seen that the bias value generated from the EM method is greater than the bias value produced by MLE. It can be seen that the greater the value of n for each sensor percentage p = 10%, p = 20% and p = 30% results in the smaller bias value produced by the EM and MLE methods. But the greater the percentage sensor value in the data, the greater the bias.   Fig. 3. the greater the n used, the smaller the average MSE of the EM and MLE methods. The average MSE value of the EM method is greater than the MLE method on the graph for each sensor percentage value p = 10%, p = 20% and p = 30%. So that, from the average bias and MSE values it is known that the MLE method gives a smaller average bias and MSE value than that of the EM method. So that, it can be said that the MLE method provides better estimator values regardless of the amount of data (n) and the percentage of censored data (p).

Establishing the Survival Function and Hazard Function from the First generating Data
After estimating parameter by using the Maximum Likelihood Estimation and Expectation-Maximization method assisted by the Newton Raphson method, the result given in Table 1, the survival and hazard functions of the data generated by the survival time parameters of patients with gastric cancer with parameter α = 1.0649 , β = 1,072, and θ = 59,766.The survival function of the Generalized Gamma distribution is as follows: Hazard function of the Generalized Gamma distribution is as follows: From the survival function and hazard function the survival time of patients suffering from gastric cancer above can be found the value of survival probability and the probability of failure. Several setting of time will be tried when x = 50, x = 100, x = 150, x = 200, x = 250, and x = 1000. One of the estimation results of the survival and hazard functions by using the methods of MLE and EM are as follows:   Fig. 5 show that the survival function curve for the MLE and EM methods, where the x-axis shows the value of the survival time of patients with gastric cancer and on the y-axis shows the estimation results of the survival function. On the survival function curve it is seen that for every increment of time, the value of the survival function continues to decrease approaches zero. So that, it can be concluded that each additional time, the chances of survival of patients suffering from gastric cancer will continue to decrease until the occurrence of death. In the hazard function curve by using the MLE and EM methods, where the x-axis is the value of the survival time of patients with gastric cancer and on the y-axis is the estimated results of the hazard function. The lines on the hazard function curve appear to go up along with the increasing survival time of patients with gastric cancer. So it can be concluded that the hazard function of the survival time data of patients with gastric cancer has an increasing hazard rate that is the increasing survival time of an individual, the failure rate will increase.

Conclusion
Based on the results obtained by using data from patients with gastric cancer (in the Asaur package in Software R) has a generalized gamma of right-sensed type 1 distribution with parameters α = 1.0649, β = 1,072, and θ = 59,766. From the simulations that have been carried out it can be seen that the larger the sample size of the data used, the smaller the estimated value of the bias and MSE for both methods of EM and MLE. Estimation of Generalized Gamma distribution parameters with right censored data type 1 using Maximum Likelihood Estimation and Expectation Maximization produces estimators that cannot be solved analytically, so it needs to be solved numerically. From the two estimation methods used it can be concluded that the best method for estimating the Generalized Gamma distribution parameters with type 1 right censored data is the Maximum Likelihood Estimation method because it has the smallest average bias value and mean square error compared to EM method. Although the average bias and MSE values of the EM method are greater than the MLE method, the EM method still provides the results of the estimation of bias and MSE are close to the results of the MLE method. Thus, it can be said that the EM method can become an alternative estimation method for estimating parameters in the Generalized Gamma distribution with type 1 right censored data.