Modeling Cancer Remission Time Data by Means of the Max Erlang Binomial Distribution

In this paper, a statistical simulation algorithm for the power series distribution, called the Max Erlang Binomial distribution, is proposed, analyzed, and tested for bladder cancer remission time data. In order to present the simulation technique, the EM algorithm for statistical estimation aimed at estimating the model parameters is described.


Introduction
The introduction of this new (generalized) distribution addresses reliability problems when lifetime can be expressed as the maximum or minimum of a sequence of independent and identically distributed (iid) random variables, which represents the system components' risk times. In recent years, some researchers have proposed a series of new distributions for the maximum and minimum of a sequence of iid random variables. For example, Adamidis and Loukas [1], Kus [2], Tahmasbi and Rezaei [3], Louzada et al. [4], and Cancho et al. [5] were interested in determining the maximum or minimum distribution when the components in a sequence of iid random variables are exponentially distributed, and a number of components are of a discrete type. Next, Flores et al. [6] treated the distribution of a vector's maximum with components that are exponentially distributed in a random number of a power series distribution type. This type of distribution is called the complementary exponential power series (CEPS) distribution. Also, Morais and Barreto-Souza [7] considered analyzing the Weibull distribution class by means of the power series distribution class (WPS). Recently, Louzada et al. [8] have developed a mathematical model that unifies the procedure for obtaining a distribution of the maximum and minimum of a sequence of iid random variables of the absolutely continuous type in a random number N characterized by the generating function. But the problem of determining the general formula when the random variable N forms a part of a power series distributions remains unsolved.
In this paper, the simulation algorithms for these family distributions are proposed. This study is intended as a completion of the research by Balkema and de Haan This work has the following structure: Section 2 defines the mathematical properties of the Max Erlang Binomial power series distribution (the cumulative distribution function, the probability density function, the mean, and variance). The simulation techniques targeting the Max Erlang Binomial distribution are analyzed and formulated in Section 3, with results validation via the Pearson test. In Section 4, the simulation algorithm for the Max Erlang Binomial distribution parameters is proposed and tested using the method of the maximum likelihood estimation. Section 5 discusses an application of the proposed distribution using a real-life dataset. Lastly, in Section 6, some useful conclusions are drawn.

Development of the Mathematical Model
In [11], the properties of a new power distribution type series, called the Max Erlang Binomial (MaxErlB), are introduced and researched. As a mathematical model, this distribution describes the probabilistic behavior of lifetimes, widely used in researching the reliability of systems. In [11], this distribution is presented as the distribution of the maximum value in a random volume sample Z from a statistical population, Erlang distributed, where Z is a binomially distributed, zero-truncated random variable. Formally, things are presented as follows.
Definition 1 ( [12]). We say that the random variable Z has a power series distribution if where a 1 , a 2 , ⋯ are nonnegative real numbers, τ is a positive number bounded by the convergence radius of power series (series function) AðΘÞ = ∑ z≥1 a z Θ z , ∀Θ ∈ ð0, τÞ, and Θ is the power parameter of the distribution (Table 1). PSD denotes the power series distribution function families. If the random variable Z has the distribution in Equation (1), then we write Z ∈ PSD.
We note that U Erl = max fX 1 , X 2 , ⋯, X Z g.
The results in this section are obtained using the general framework in [13], for which reason some proofs are not presented.
The numerical characteristics (mean, variance) of a random variable with a MaxErlB distribution, in a particular case (k = 2), are presented in the following result: Proposition 2. The mean and variance of the random variable U ErlB~M axErlBð2, λ, n, pÞ, λ > 0, n ∈ f1, 2 ⋯ g, p ∈ ð0, 1Þ, are characterized by the following relations: Proof. After Equation (3) and the definition of the mean, we obtain where as developed by Newton's binomial. Distribution

Computational and Mathematical Methods in Medicine
A sum of n-integrals then can be solved with elementary methods (method of integration by parts), which leads to Equation (4). Similarly, evaluating the second-order moment together with the definition of variance, leads us to Equation (5).
Remark 1. We notice that for k = 1, we obtain the complementary exponential distribution introduced by Flores et al. [6].

Statistical Simulation for the MaxErlB Distribution
Taking advantage of the fact that the random variable U ErlB~M axErlBðk, λ, n, pÞ, λ > 0, k, n ∈ f1, 2 ⋯ g, p ∈ ð0, 1Þ, has the same distribution as the random variable max 1≤i≤Z X i , where ðX i Þ i≥1 are iid random variables, X iẼ rlangðk, λÞ, k ∈ ℕ, k ≥ 1, λ > 0, and the value of the random variable Z~Binom * ðn, pÞ, p ∈ ð0, 1Þ, n ∈ f1, 2, ⋯g, coincide with the value of the random variable zerotruncated binomial distributed with the same parameters, but provided this is a nonzero value, we can briefly describe the following algorithm.

Statistical Simulation Algorithm for the MaxErlB Distribution
Step 1. We generate a value z ⋆ of the random variable Z ⋆ Binomðn, pÞ, p ∈ ð0, 1Þ, n ∈ f1, 2, ⋯g Step 2. If z ⋆ = 0, then GO TO Step 1; otherwise, z = z * Step 3. For the value z of the random variable Z (generated in Steps 1 and 2), simulate the values Step 4. It is considered u ErlB = max 1≤i≤z x i , STOP.
where n j , j = 1, r represents the number of observed values in the interval ½t j−1 , t j Þ, n 0 = ∑ r j=1 n j . The probabilities p j that the random variable U ErlB takes the values in the interval ½t j−1 , t j Þ are calculated using the following relation: where t j , j = 0, r1 represent the ends of each interval after they have been merged.
Based on the algorithm presented above, we can notice (see Table 2) that the mean and the empirical variance of the simulation results are well approximated by the mean and the theoretical variance of the random variable U ErlB MaxErlBð2,10,3, 0:2Þ (Proposition 2), and the Chisquare criterion validates each time the basic hypothesis according to which the simulated values are indeed governed by this distribution.
The histogram of the simulated data and the plot of the probability density function of the simulated distribution ( Figure 1) also confirm the validity of the basic hypothesis, but visually.

EM Algorithm for the MaxErlB Distribution
The EM algorithm introduced in 1977 in the paper [14] comes to perfect the maximum likelihood method which, in the case of processing incomplete statistical data, becomes practically unusable. Next, the algorithm is implemented for the MaxErlBð2, λ, 3, pÞ, λ > 0, p ∈ ð0, 1Þ distribution.
We consider the values of a sample ðx 1 , x 2 , ⋯, x m Þ of size m a statistical population govorned by a MaxErlB distribution with the probability density function u ErlB ðx, ΨÞ, x > 0, which depends on the parameter vector Ψ = ðλ, pÞ, given that the parameter n of the zero-truncated binomial distribution and the parameter k of the Erlang distribution are given. According to the definition of the maximum likelihood function and Equation (3) To obtain the maximum likelihood equations for the MaxErlB distribution regarding the estimation b λ,p for the parameters λ, p, we consider The parameters n and k being considered known, then the equations of the method for the maximum likelihood estimation function (MLE) are characterized by the nonlinear system SðΨÞ = 0, where SðΨÞ = ðð∂ ln L/∂λÞ, ð∂ ln L/∂pÞÞ. Developing the system of equations SðΨÞ = 0, we notice that it becomes difficult to solve in relation to the unknowns λ and p. We are thus in the situation in which the application of the EM algorithm explained and analyzed by Dempster et al. [14], then expanded by McLachlan and Krishnan [15] is required. In this algorithm, the random variable Z is considered a random variable latency, that is, the random variable which cannot be observed directly.
For this, we consider, formally, the following sample: by m observations of the random variable ðU ErlB , ZÞ. This shows that ððx 1 , z 1 Þ, ðx 2 , z 2 Þ, ⋯, ðx m , z m ÞÞ can be interpreted as a complete set of statistics, being, in this case, a sample of incomplete data. The description of the EM algorithm supposes a known conditional mean EðZjU ErlB ; ΨÞ, where Ψ = ðλ, pÞ.
The probability density function u ErlB ðx, ΨÞ, x > 0, of the random variable U ErlB wich corresponds to a complete set of data, is defined by the following relation according to the definition of probability density in the case of the maximum (see [13], Consequence 2.2): In these conditions, the probability density function u ErlB ðx, zÞ of the random variable ðU ErlB , ZÞ which corresponds to a complete set of data is given by where AðΘÞ = ð1 + ΘÞ n − 1, Θ = p/1 − p, p ∈ ð0, 1Þ, a z = n z ! , z ≤ n, f Erl ðxÞ, and F Erl ðxÞ, x > 0 are the probability density function, respectively, the cumulative distribution function which has the Erlangðk, λÞ, k ∈ ℕ, k ≥ 1, λ > 0 distribution. Then, the probability density function of the random variable Z conditioned by the random variable U ErlB has the following expression: Therefore, considering the obvious relation ∑ z≥1 z 2 a z Θ z−2 = A′′ðΘÞ + ð1/ΘÞ · A′ðΘÞ, the conditional mean becomes Since Z~Binom ⋆ ðn, pÞ ∈ PSD, k, n ∈ f1, 2, ⋯g with A ðΘÞ = ð1 + ΘÞ n − 1, Θ ∈ ð0,+∞Þ, Θ = p/1 − p, p ∈ ð0, 1Þ, we have We describe the EM algorithm for the MaxErlBð2, λ, 3, pÞ distribution as an iterative process of estimating the unknown parameter Ψ = ðλ, pÞ through Ψ ðhÞ = ðλ ðhÞ , p ðhÞ Þ calculated for a few steps h ≥ 1 such that the following condition is satisfied: or h = K be accomplished when ε > 0 and K represents the number of preset iterations. The steps of the EM algorithm for MaxErlB distribution are the following: Step 1. We take λ = λ ð0Þ , p = p ð0Þ , λ ð0Þ > 0, p ð0Þ ∈ ð0, 1Þ Step 2. (Expectation). To iterate h, h ≥ 1, we calculate the mean value of z ðh−1Þ j , j = 1, m according to Equation (17) for k = 2: Step 3. (Maximization). Through the maximum likelihood estimation (MLE) method, we take into consideration the following sample: with the maximum likelihood function: Thus, we can find iteration Ψ ðhÞ = ðλ ðhÞ , p ðhÞ Þ which estimates the parameters Ψ = ðλ, pÞ Step 4. We examine Equation (18). If NOT, then GO TO Step 2; otherwise, Ψ ≔ Ψ ðhÞ , STOP.

Given the function
: the maximum likelihood equations are characterized by the nonlinear system SðΨ ðh−1Þ Þ = ðð∂ ln L/∂λ ðh−1Þ Þ, ð∂ ln L/∂ p ðh−1Þ Þ, namely Table 3 shows the results obtained from the implementation of the EM algorithm (described above), in the Octave 1.5.4 GUI programming environment. We must also emphasize that for different sample sizes (m ∈ f100,1000,10000,100000,1000000g), we obtain very good approximations of the parameters λ and p that Computational and Mathematical Methods in Medicine characterize the MaxErlB distribution, when the parameters k and n are known.

Application
We will now consider a dataset which represents the remission times (in months) of a random sample of 128 bladder cancer patients. The dataset itself has previously been used in [16][17][18] Figure 2 provides the histogram of relative frequencies of a sample size which characterizes the remission times of bladder cancer, where the curve represents the pdf of the random variable U ErlB~M axErlBð2,10,3, 0:2Þ distribution defined by Equation (3).

Conclusion
The conclusions revealed by the present research are related to the study of power series distributions type of a maximum of a sequence of iid random variables which are found in a random number.
Also, the distribution of a maximum number of iid random variables through the PSD family, characterized by the number of the random variable in the sequence, was presented in a compact, coherent approach.
For this purpose, programs for the statistical simulation of the MaxErlB power series distributions type were developed. The validity of the maximum distributions was performed using Pearson's test of consistency and is reflected in Table 2. Describing the EM algorithm implemented in the GUI Octave 1.5.4 programming environment to estimate the parameters of the MaxErlB distribution is presented in Table 3.
A real data sequence on bladder cancer remission times was used to illustrate and compare the histogram of the relative frequencies of remission times and the probability density function plot of the remission time values that are governed by the MaxErlB distribution ( Figure 2).

Data Availability
All data are fully available without restriction.

Conflicts of Interest
The author declares no conflicts of interest.