A discrete Ramos-Louzada distribution for asymmetric and over-dispersed data with leptokurtic-shaped: Properties and various estimation techniques with inference

In this paper, a flexible probability mass function is proposed for modeling count data, especially, asymmetric, and over-dispersed observations. Some of its distributional properties are investigated. It is found that all its statistical properties can be expressed in explicit forms which makes the proposed model can be utilized in time series and regression analysis. Different estimation approaches including maximum likelihood, moments, least squares, Anderson’s-Darling, Cramer vonMises, and maximum product of spacing estimator, are derived to get the best estimator for the real data. The estimation performance of these estimation techniques is assessed via a comprehensive simulation study. The flexibility of the new discrete distribution is assessed using four distinctive real data sets “coronavirus-flood peaks-forest fire-Leukemia”. Finally, the new probabilistic model can serve as an alternative distribution to other competitive distributions available in the literature for modeling count data.


Introduction
The count data sets emerge in various fields like the yearly number of destructive earthquakes, number of patients of a specific disease in a hospital ward, failure of machines, number of patients due to coronavirus, number of monthly traffic accidents, hourly bacterial growth, and so on. Various discrete probability models have been utilized to model these kinds of data sets. Poisson and negative binomial distributions are frequently for modeling count observations. On the other hand, in the advanced scientific eon, the data generated from different fields is getting complex day by day, however, existing discrete models do not provide an efficient fit.
Although various distributions are available in literate to analyze count observations, there is still a need to introduce a more flexible and suitable distribution under different conditions. The fundamental purpose of this paper is to propose discrete Ramos-Louzada distribution, which is a oneparameter lifetime distribution introduced by [24]. The proposed one-parameter distribution herein has distinctive properties which makes it among the best choice for modeling over-dispersed and positively skewed data with leptokurtic-shaped. A continuous random variable is said to have Ramos-Louzada distribution if its probability density function (pdf) can be written as where λ is the shape parameter. The corresponding survival function (sf) to Eq (1) can be formulated as In this article, the discrete version of Ramos and Louzada distribution is proposed and studied in detail. The following are some interesting features of the proposed distribution: Its statistical and reliability characteristics can be expressed as closed forms. Its failure rate is showing an increasing pattern. The suggested distribution evaluated time and count data sets more effectively than competing distributions. As a result, we feel that the proposed model is the greatest option for attracting a wider range of applications and industries.
The rest of the study is organized as follows: In Section 2, we introduce a new distribution using survival discretization methodology. Different mathematical properties are derived in Section 3. Parameter estimation and simulation study are presented in Section 4. Four data sets are utilized to show the flexibility of the proposed model in Section 5. Finally, Section 6 provides some conclusions.

Synthesis of discrete Ramos-Louzada distribution
Let be a continuous random variable with sf ( ; ), then the pmf of the discrete random variable = ⌊ ⌋ can be expressed as where ⌊⋅⌋ denotes the floor function, which returns the highest integer value smaller or equal than its argument, and is a parameter vector 1 × . If the random variable have Ramos-Louzada (RL) distribution, then the pmf of discrete RL (DRL) distribution can be written as where ≥ 2 is the shape parameter. The pmf ( + 1; ) can be expressed as a weight from the pmf ( ; ) as follows The sf of the DRL model can be expressed as The hazard rate function (hrf) of the DRL model is given by where ℎ( ; ) = Pr( = ; ) 1− ( −1; ) . Mathematically, the shape of the hrf of the DRL model is always increasing, which makes it an effective statistical tool for modeling data, especially in the engineering and medical fields. Figure 2 shows the hrf plots of the new discrete model based on various values of the model parameter. The reversed hazard rate function (rhrf) and the second rate of failure are given as and * ( ) = [ ]. Mathematically, and after simple algebra steps, it is found that the shape of the rhrf of the DRL model is decreasing only. Figure 3 shows some rhrf plots of the proposed model based on specific parameter values.

Distributional properties
In this Section, the probability generating function (pgf) as well as its rth moment are investigated. Assume the random variable have a DRL model, then the pgf can be expressed as On replacing z by in Eq (9), the moment generating function (mgf) can be written as The first four moments around the origin ( ′ 1 , ′ 2 , ′ 3 , ′ 4 ) can be written as Based on the rth moments, the variance can be expressed as The dispersion index (di) is defined by variance to mean ratio. The di indicates that the reported model is suitable for under-, equi-or over-dispersed data sets. Using the derived moments, the coefficients skewness and kurtosis can be listed in closed forms. Some numerical computations for mean, variance, di, skewness, and kurtosis based on DRL parameters are listed in Table 1. According to Table 1, it is noted that the DRL model can be used effectively to model overdispersion data as di is greater than one, which makes it a proper probability tool to discuss actuarial data. Moreover, the new discrete probabilistic model can be utilized to analyze positively skewed data with leptokurtic-shaped.

Estimation methods
In this section, six estimation methods are used to estimate the unknown parameter of DRL distribution. The considered estimation methods are maximum likelihood estimation (mle), method of moments (mom), least-squares estimation (lse), Anderson-Darling estimation (ade), Cramer von-Misses estimation (cvme), and maximum product of spacing estimator (mpse).

Maximum likelihood estimation
Assume a random sample 1 , 2 , … , from the DRL model, then the log-likelihood function can be expressed as Differentiating the Eq (12) with respect to the parameter , we get the non-linear equation as follows the exact solution of Eq (13) is not easy, so we will maximize it by using optimization approaches, for example, the Newton-Raphson approach using R software.

Method of moment estimation
Based on the mom definition, we must equate the sample mean to the corresponding population mean, and then solve the non-linear equation for the parameter To solve Eq (14), the uniroot function should be utilized.

Least squares estimation
To estimate the parameter minimizing the sum of squares of residuals, a standard approach like the lse should be used. For the estimation of the parameter of DRL distribution, the lse can be obtained by minimizing with respect to the parameter .

Anderson-Darling estimation
The ade of the parameter can be derived by minimizing the following equation with respect to the parameter .

Cramer von-Misses estimation
The cvme is an estimation method. This method is derived as the difference between the empirical cdf and fitted cdf where cvme( ) = with respect to the parameter .

Simulation
In this section, we discussed the results of the simulation study to compare the estimation performance of the proposed estimators based on the DRL model. The performance of considered estimators is evaluated via absolute biases and mean square errors. We simulate 10,000 random samples from the DRL distribution using the following sample sizes = 10, 20, 50, 100, 200, and  Tables 2 and 3.   Based on the simulation criteria, it is observed that all estimation approaches work quite well in estimating the parameter λ of the DRL distribution.

Data analysis
In this Section, the importance of the proposed distribution is discussed by using data sets from different areas. We shall compare the fits of the DRL distribution with different competitive distributions such as Poisson (Poi), discrete Pareto (DPr), discrete Rayleigh (DR), discrete inverse Rayleigh (DIR), discrete Burr-Hatke (DBH), discrete Bilal (DBi), discrete Lindley (DL), new discrete Lindley (NDL), and discrete Burr-XII (DBXII) distributions. The fitted probability distributions are compared using some criteria, namely, the negative log-likelihood (− ), Akaike information criterion (aic), and Kolmogorov-Smirnov (ks) test with its p-value.

Data set I: Covid-19 data in Pakistan
The first data set represents the number of deaths due to coronavirus in Pakistan during the period March 18, 2020, to April 30, 2020, which were obtained from the public reports of the National Institute of Health (NIH), Islamabad, Pakistan (https://covid.gov.pk/stats/pakistan). The mean, variance, and di of data set I are 9.4773, 102.39, and 10.804, respectively. The mle(s) along with standard error(s) "se(s)" and goodness-of-fit measures for this data are presented in Table 4. The results in Table 4 show that the DRL distribution provides a better fit over other competing discrete models since it has the minimum aic, and ks values with the highest p-value. Figure 4 shows the probability-probability (pp) plots for all tested models which prove the empirical results listed in Table 4.

Data set II: Hydrology data
The second data set was reported in [25], which represents the exceedance of flood peaks in m 3 /s of the Wheaton River near Carcross in Yukon Territory, Canada based on the discretization concept. The mean, variance, and di of this data are 11.806, 152.38, and 12.908, respectively. The mle(s), se(s), and goodness-of-fit measures for data set II are reported in Table 5. It is observed that the DRL model is the best among all competitive distributions. Figure 5 illustrates the pp plots for all tested distributions which prove the empirical results reported in Table 5.

Data set III: Forest fire in Greece
The third data set was listed in [26] and represents the number of fires in Greece forest districts for the period from 1st July 1998 to 31 August 1998. The mean, variance, and di measures are 5.2, 32.382, and 6.2272, respectively. The mle(s), se(s), and goodness-of-fit measures for data set II are listed in Table 6. It is found that the new discrete model is the best among all tested distributions. Figure 6 shows the pp plots for all competitive distributions which prove the empirical results listed in Table 6.

Data set IV: AG-positive leukemia
The fourth data set represents the time to death (in weeks) of AG-positive leukemia patients [27]. The mean, variance, and di values are 62.471, 2954.3, and 47.29, respectively. The estimates and goodness-of-fit measures for all competitive distributions are listed in Table 7. It is noted that the DRL is the best for this data. Figure 7 shows the pp plots for all tested distributions which prove the empirical results mentioned in Table 7.

Conclusions
In this article, a new one-parameter discrete model has been proposed entitled a discrete Ramos-Louzada (DRL) distribution. The new model can be used in modeling asymmetric data with overdispersion phenomena. Some of its statistical properties have been derived. It was found that all its properties can be expressed in closed forms, which makes the new model can be utilized in different analysis, especially, in time series and regression. Various estimation techniques including maximum likelihood, moments, least squares, Anderson's-Darling, Cramer von-Mises, and maximum product of spacing estimator, have been investigated to get the best estimator for the real data. The estimation performance of these estimation techniques has been assessed via a comprehensive simulation study. The flexibility of the proposed discrete model has been tested utilizing four distinctive real data sets in various fields. Finally, we hope that the DRL distribution attracts wider sets of applications in different fields.