Parameter estimation for Weibull distribution With right censored data using em algorithm

The estimation process is supported by a number of statistical techniques, methods and procedures for analyzing the data on the variable of interest that may be the time that elapses from the welldefined initial instant, for example, installation of equipment, until the occurrence of a specific event, such as the failure of the equipment or component under consideration. The process aims to estimate the distribution parameters for modeling the system under review. The parameters are the distribution characteristics that show the behavior of a given population and therefore fixed for a specific system. Thus, the estimation of system parameters is obtained from the data collected from the population. System data analysis can be obtained from various possible sources, namely, laboratory tests or a recording of occurrences along its use (historical record). The parametric analysis assumes that the data fits a specific distribution, such as the Weibull distribution. The Weibull distribution has a wide application in various fields. These applications include its use to model the distribution phenomena of fatigue and the life of many devices, such as bearings, shaft and motor [1, 14, 20]. The popularity of the distribution of Weibull due to its great flexibility, i.e., as it can describe functions with failure intensity constant, increasing and decreasing, for different values of the shape parameter. Since the distribution of Weibull became widely recognized, various methods have been proposed to estimate its parameters [2,17,19]. However, the maximum-likelihood estimation method is currently one of the most used methods of estimation, for its versatility and provides reliable results [10]. The maximum-likelihood estimation method allows us to estimate the unknown parameters of a statistical model. These parameters are obtained by maximizing the likelihood function of the model in question [4].


Introduction
The estimation process is supported by a number of statistical techniques, methods and procedures for analyzing the data on the variable of interest that may be the time that elapses from the welldefined initial instant, for example, installation of equipment, until the occurrence of a specific event, such as the failure of the equipment or component under consideration.
The process aims to estimate the distribution parameters for modeling the system under review.The parameters are the distribution characteristics that show the behavior of a given population and therefore fixed for a specific system.Thus, the estimation of system parameters is obtained from the data collected from the population.
System data analysis can be obtained from various possible sources, namely, laboratory tests or a recording of occurrences along its use (historical record).
The parametric analysis assumes that the data fits a specific distribution, such as the Weibull distribution.
The Weibull distribution has a wide application in various fields.These applications include its use to model the distribution phenomena of fatigue and the life of many devices, such as bearings, shaft and motor [1,14,20].
The popularity of the distribution of Weibull due to its great flexibility, i.e., as it can describe functions with failure intensity constant, increasing and decreasing, for different values of the shape parameter.
Since the distribution of Weibull became widely recognized, various methods have been proposed to estimate its parameters [2,17,19].
However, the maximum-likelihood estimation method is currently one of the most used methods of estimation, for its versatility and provides reliable results [10].
The maximum-likelihood estimation method allows us to estimate the unknown parameters of a statistical model.These parameters are obtained by maximizing the likelihood function of the model in question [4].

Types of data
With the information from a historical record of data is possible to obtain indicators to estimate and understand the behavior of the equipment with respect to failures.Therefore, using appropriate methodologies, it will be possible to set the proper maintenance policies to any equipment and their components.

sciENcE aNd tEchNology
The data is considered complete when it is known the exact time of each system failure.In many cases the data contain uncertainties, i.e., it is not known the exact moment when the failure occurred.The data containing such uncertainty as to when the event occurred are regarded as incomplete or partial.Incomplete data can be classified into censored or truncated [8,13].
Incomplete data give only part of the information about the failure time of the units under review.However, this information should not be ignored or treated as failure.In the absence of such data, it would not be possible to make good estimation parameters and thus make a proper analysis.
One of the most common types of censored data, which may arise in real cases, is Type-1 right censored data [9].For Type-1 right censored data, all units of a system are observed up to the date of completion of the study.For this censorship scheme the time each unit is under observation is fixed, while the number of units that fail (uncensored observations) is random.
If T is a random variable representing the failure time and Cd another random variable independent of T which corresponds to the end of the registration information (observation time).It is said that the time to failure is right censored when one does not know its exact value, only that its value is greater than Cd, with regard to item i (i = 1, 2, ..., n).Therefore: The δ i variable (censorship indicator) indicates whether T i is censored or not.The obtained data can be represented by the pair (t i , δ i ) i.e. t i the failure time or censored time and δ i the variable that indicates whether it concerns a failure or censorship, that is, (2): In the right censored data the failure time of the units with censored data it is just known to be greater than the operating time of the conclusion of the registration information.These right censored data are further classified into Type-1 if the recording of information is interrupted at a predetermined time and Type-2 censure if registration is completed when a predetermined number of failures occur [12].

The maximum-likelihood estimation method
As we have seen, in the particular case of the right censored data classified as Type-1 censorship, the recording of information is interrupted at a predetermined time Cd > 0, such that t i is observed to occur before Cd, otherwise, one only knows that the failure time is greater than the observation time.
Let t i = t 1 , t 2 , ..., t n , n independent observations, where r records are failure times and (n -r) are censured information.
Let the probability density function, f(x, θ) and the cumulative probability function F(x, θ), with the distribution parameters denoted by θ, the likelihood function for right censored data Type-1 is given by [18]: Where (x 1 , x 2 , ..., x n ) is a sample of n independent observations of the random variable X and θ = (θ 1 , θ 2 , ..., θ k ) is the vector of unknown distribution parameters.
For the Weibull distribution with scale parameter η and shape parameter β, the likelihood function for the right censored data Type-1 is given by: In many situations it is easier to achieve the maximization of the log of the likelihood function and, since the logarithm function is a monotonically increasing function, is equivalent to maximizing the likelihood function or the log-likelihood function.
By applying logarithm to eq. 4, it becomes: As the maximum-likelihood equations in many cases have no analytical solution to determine their solutions, it is needed to use numerical optimization methods.Among the possible numerical solutions, in this paper the Expectation-Maximization (EM) algorithm was selected [3,5,15].

The Expectation-Maximization algorithm
The EM algorithm is an iterative process that can be used to calculate the maximum likelihood estimators in cases with incomplete data.
Let X be the set of complete data with the probability density function f c (x, θ) and Y the observed data.The corresponding loglikelihood function to the full sample is represented by: ln , , Each iteration of the EM algorithm involves two steps: step E (expectation) and step M (maximization), defined by [15], Step E: To calculate Q k θ θ , ( ) ( ) , where: The procedure is performed until the difference between the iteration k and iteration k+1: decrease to an acceptable value, with  > 0 In Step E the algorithm calculates the conditional expected value of the logarithm of the likelihood function for complete data given the observed sample and step M calculates its maximum.This algorithm requires an initial solution for the values of the distribution parameters, designated by θ (0) .The selection of this initial solution requires particular attention to the extent that the algorithm convergence speed may become extremely slow due to a poor choice.Another aspect to take into account is the maximum likelihood equation can have multiple solutions corresponding to local maxima, so the choice of the starting solution becomes important.
A comparative study of various strategies in the choice of initial values can be found in Karlis and Xekalaki [11].The results obtained by these authors demonstrate the importance of the choice of initial values, to have a reasonable convergence speed.
The figure 1 illustrates the EM algorithm procedure.
The EM algorithm has several advantages which stand out relative to other iterative algorithms: The EM algorithm converges in a large variety of conditions, that is, from an arbitrary data, θ (0) the algorithm usually finds a local maximum, except for a poor choice of initial solution θ (0)  or in the wrong formulation of the likelihood function.
The required analytical work is simpler than with other meth-ods, since only it is necessary to maximize the conditional expected value of the log-likelihood for complete data.
The EM algorithm is relatively easy to program and imple-ment.
During the iterations it is possible to control the convergence and programming errors However it also has some disadvantages: The EM algorithm can converge slowly, even in some appar-ently simple problems and on issues where there is a lot of incomplete information.The EM algorithm does not have an integrated process for pro-ducing an estimate of the covariance matrix of the estimated parameters.But this disadvantage can be overcome by using appropriate methodology.The EM algorithm does not guarantee convergence to the glo-bal maximum when there are multiple maximum sites.The obtained estimate depends on the initial solution

EM algorithm with right censored data
The observed data consists of (Y i , δ i ), with Y i = min (T i , Cd i ), where Cd i corresponds to the observation times, δ i = 1 (T i ≤ Cd i ), corresponds to the non-censored data and δ i = 0 (T i > Cd i ), to the censored data.
In the step E of the algorithm it is required the calculation of the conditional expected value of the log-likelihood for complete data, given the observed sample.In this case the log-likelihood function to complete data is given by equation (10), for n data collected: The θ = (η, β) are the distribution parameters, δ = (δ 1 , δ 2 , ..., δ n ) is the vector indicator of censorship and y = (y 1 , y 2 , ..., y n ) is the vector of observed data.
We have from equation ( 7): Where: For the conditional expected value is first necessary to determine the expression of the corresponding conditional probability density function, f (y|y') (y|y').For this case is given by [16]: So, Where: Γ p x u e du is the upper incomplete gamma function Introducing equations ( 15) and ( 16) in equations ( 12) and ( 13): Thus, the expression referring to step E of the algorithm EM, Q(θ, θ (k) ) with censored data to the right is given by the following equation: As mentioned above, the step M is intended to find the solution θ (k+1) which maximize Q(θ, θ (k) ).
In order to get the points that maximize the function it is necessary to solve the partial derivatives of the above equation and equal them to zero, as shown in the following equations: Thus, the solution for the estimation of the distribution parameters η is given by the following equation: With equation ( 22) for η parameter, it is possible to calculate the β parameter of the distribution.
The second derivative must be negative to ensure that the results obtained correspond to a maximum point.The second derivative equations are as follows: With this analysis it is possible to calculate the Weibull distribution parameters β and η, through the knowledge of the maximum values of function Q.

Case study
The case study concerns the history of five centrifugal pumps failures of a petrochemical company, used to pump oil with similar density for the period 2006 to 2013.
From the collected data it can be seen that one of the pumps components is responsible for an important number of failures.It is the mechanical seal which represents approximately 45% of the total number of failures.Thus, based on the assessment of these results, it justifies the need of further study on this component.
For the purposes of this study, we select only the data related to failures due to excessive leakage of fluid to the outside of the pumps.The reason for this consideration is due to the significant number of failures and to restrict the study for just one failure mode.
The possible failures between inspections are treated as complete data, as the centrifugal pumps in question are visually inspected by the user of the equipment at least every 8 hours.So, it was not considered the possible mistake between two inspections compared to the total time of the observation and it was assumed that the exact moment of failure was well known.
The last record in each pump does not correspond to a fault but at the end of the test, since the pumps continued in operation.Thus the last recording time of each pump was considered as right censored data.
For the estimated values of the Weibull distribution parameters by the maximum-likelihood method it was used the EM algorithm for right censored data as described in section 3.2.
For the algorithm implementation process it was used the statistical program R.
The autors used a manually written code in R software and also used some packages like for exemple, maxNR, boot and mass.
The initial solution θ (0) was attributed by the result obtained by the least squares method.The iterative process stopped when the difference between the iteration k and iteration k+1, given by equation (9), was less than 0,1.
The following Table 1 shows the expected value for β and η for each of the mechanical seals applying the maximum-likelihood method with the EM algorithm, as described in this paper.The respective confidence intervals are also presented.Because the sample of data is small, it was used the bootstrap-t method in determining the confidence intervals with a confidence level of 95% [6,7].
The bootstrap-t method allows the calculation of the confidence interval of the parameters of interest in particular in the case where the sample is small (n <30).
New samples are taken randomly from resampling the original sample.For the randomization of the process be minimized it is necessary to perform a large number of resampling, B. With the generated bootstrap samples it is possible to calculate the standard deviation of B repetitions that will be used in the confidence intervals: From the table of the t-Student distribution gets the value t c that: (28) Thus, the bootstrap-t confidence interval with confidence level 100 (1-α)% is given by: (29) The number of resampling for the bootstrap method is equal to 1000.From the results obtained and presented in table 1, it can be noted that, all mechanical seals feature the shape parameter β >1.The values of the scale parameter η vary between 345.31 and 403.68 days of operation.
With the information obtained by the bootstrap method, it is possible to represent the confidence interval around the probability density function of the estimated Weibull distribution, as shown in Figure 2 to seal 1.
The estimation of the unknown parameters of the Weibull distribution was also obtained, in the same system, with the least squares method, to compare with the results obtained by the EM algorithm.
In Rinne is indicated how to obtain the estimates of the unknown parameters using least squares method.
The following Table 2 shows the expected value for β and η for each of the mechanical seals applying the least squares method.
The shape parameter, β, is higher by maximum likelihood method (EM algorithm) than the least squares method and in both is higher than 1.
For the scale parameter, η, the values obtained by the two methods are similar.

Conclusion
The prediction of failures of equipment for a given time horizon can only be considered in probabilistic terms, because there is always uncertainty about the time they will happen.Thus, this work has focused on the presentation of statistical techniques inherent to an estimation process of the parameters of a theoretical probabilistic model that best fits the observed data.
Also, it showed the importance of the record of occurrences of failures during the use of equipment (historical record).This infor-

Fig. 2 .
Fig. 2. Confidence interval around the probability density function of the estimated Weibull distribution by the bootstrap method, seal 1

Parameter estimation for Weibull distribution With right censored data using em algorithm ZastosoWanie algorytmu maksymaliZacji Wartości ocZekiWanej do estymacji ParametróW roZkładu Weibulla W PrZyPadku danych obciętych PraWostronnie
The maximum-likelihood estimation (MLE) is a method of estimating the parameters of a statistical model for given data.This method allows us to estimate the unknown parameters of a statistical model.These parameters are obtained by maximizing the likelihood function of the model in question.In many practical situations the likelihood function is associated with complex models and the likelihood equation has no explicit analytical solution, it is only possible to have its resolution through numerical methods.The estimation of the parameters of the Weibull distribution by maximum-likelihood method based on information from a historical record with right censored data shows this difficulty.The solution presented in this article entails using the Expectation-Maximization (EM) algorithm.Actual data from the historical record of 5 centrifugal pumps failures of a petrochemical company were analyzed for application of the methodology.Keywords: EM algorithm, parameter estimation, maximum likelihood estimates, reliability.Metoda największej wiarygodności (MLE) służy do estymacji parametrów modelu statystycznego dla zadanych danych.Metoda ta pozwala na estymację nieznanych parametrów modelu statystycznego.Parametry te otrzymuje się poprzez maksymalizację funkcji wiarygodności rozważanego modelu.Często w praktyce metoda ta może jednak nastręczać trudności związane z wielomodalnością funkcji wiarygodności oraz niemożnością uzyskania jawnych analitycznych rozwiązań równań wiarygodności.Równania takie można jedynie rozwiązywać za pomocą metod numerycznych.

Table 1 .
Expected value of β and η (days) for each of the mechanical seals

Table 2 .
Expected value of β and η (days) for each of the mechanical seals obtained with the least squares method