A novel approach for zero-inﬂated count regression model: Zero-inﬂated Poisson generalized-Lindley linear model with applications

: Count regression models are important statistical tools to model the discrete dependent variable with known covariates. When the dependent variable exhibits over-dispersion and inﬂation at zero point, the zero-inﬂated negative-binomial regression model is used. The presented paper o ﬀ ers a new model as an alternative to the zero-inﬂated negative-binomial regression model. To do this, Poisson generalized-Lindley distribution is re-parametrized and its parameter estimation problem is discussed via maximum likelihood estimation method. The proposed model is called as zero-inﬂated Poisson generalized Lindley regression model. The results regarding the e ﬃ ciency of parameter estimation of the proposed model are evaluated with two simulation studies. To evaluate the success of the proposed model in the case of zero inﬂation, two datasets are analyzed. According to the results obtained, the proposed model gives better results than the negative-binomial regression model both in case of over-dispersion and in the case of zero inﬂation.


Introduction
The Poisson distribution is a well-known distribution to model the count data sets. As widely known, the mean and variance of the Poisson distribution are equal to each other. This property of the Poisson distribution causes some problems in modeling the real-life data sets. In real-life data modeling, the data sets are generally over-dispersed which means that the empirical variance is greater than the empirical mean. In this case, the use of the Poisson distribution for these type of data sets yields the misspecification of the underlying probability distribution. Negative-Binomial (NB) distribution is the first choice for modeling the over-dispersed count data sets. However, we need more flexible discrete distributions to model highly over-dispersed count data sets. In the last decade, several authors have proposed alternative discrete distributions to handle this problem, such as Shoukri et al. [28], Shmueli et al. [26], Rodríguez-Avi et al. [25], Mahmoudi and Zakerzadeh [24], Lord and Geedipally [23], Cheng et al. [11], Gómez-Déniz [17], Sáez-Castillo and Conde-Sánchez [27], Zamani et al. [30], Gencturk and Yigiter [16], Bhati et al. [9], Imoto et al. [20], Wongrin and Bodhisuwan [31], Altun [2][3][4][5][6], El-Morshedy et al. [14,15], Altun et al. [1], Eliwa et al. [13].
The other phenomena in count data modeling is inflation. Inflation is seen generally at zero point and called as zero-inflation. Zero-inflation means that the underlying data set contains too many zero observations that cannot be represented by the corresponding distribution, such as Poisson and NB. This situation is commonly seen in insurance and health sciences, such as loss frequency, number of physicians visits, daily coronavirus cases etc. In this case, zero-inflated version of the Poisson and NB regression models are used. These are called as zero-inflated Poisson (ZIP) and zero-inflated negative-binomial (ZINB). Thanks to their software support, these models have been applied to reallife problems by many researchers. For instance, a comparison of over-dispersed count data sets were studied by Avci et al. [8]. Besides, Ismail and Zamani [19] conducted a study for applications of the ZIP and ZINB models on the Malaysian own damage claim data. One can also visit the work of Lord et al. [22] to see the application of these models on the crash data. Also, Ayati and Abbasi [7] investigated the suitability of these models for the accidents on urban highways.
Here, the main purpose is to develop a new sophisticated model for the zero-inflated and/or overdispersed data sets. To do this, we use the Poisson generalized-Lindley (PGL) distribution, introduced by Wongrin and Bodhisuwan [34]. The suitable re-parametrized version of the PGL distribution is introduced and its statistical properties are studied. The maximum likelihood estimation (MLE) method is preferred to estimate the unknown model parameters. The suitability of the MLE method for estimating the parameters of the proposed model is discussed with simulation study. Two real data sets are analyzed to prove the importance of the proposed distribution against the existing models such as Poisson, NB regression models and their zero-inflated models.
The other parts of the study are organized as follows: The re-parametrized PGL distribution is studied in Section 2. In Section 3, the parameter estimation problem is addressed with MLE and the simulation study is given. Section 4 is devoted to introduce a new regression model for both zero-inflated and over-dispersed cases. Section 5 contains the empirical results of the study. Some conclusion remarks are given in Section 6.
where y = 0, 1, 2, ..., α > 0, θ > 0 and µ > 0. The mean and variance of Y are Hereinafter, the random variable Y refers to the re-parametrized PGL distribution given in (2.1) and shortly denoted as PGL (α, θ, µ). Following the results of Wongrin and Bodhisuwan [34], the statistical properties of the re-parametrized PGL distribution could be obtained easily. The pmf shapes of the re-parametrized PGL distribution are displayed in Figure 1. From these figures, we conclude that this distribution could be used to model zero-inflated, bimodal and right skewed count data sets.
The dispersion index (DI) is defined as DI = Var(X)/E(X). The DI shows the flexibility of the distribution in modeling over(under)-dispersed data sets. When the DI is greater than one, it means that the data set exhibits over-dispersion. The opposite case (DI < 1) indicates the under-dispersion. The variance and DI plots of the PGL distribution are displayed in Figure 2 (for fixed α = 0.5). We conclude the following results from Figure 2: When the parameter µ increases, dispersion index and variance increase; when the parameter θ increases, dispersion index and variance decreases. The DI of the PGL is always greater than one. So, the PGL distribution is an appropriate choice to model overdispersed data sets. Figure 3 shows the results of dispersion index and variance of the PGL distribution for α = 1.5. The similar interpretation can be done as in the case of α = 0.5.    Figure 3. The dispersion index and variance of PGL distribution for for α = 1.5.

Estimation
In this section, the parameters of PGL distribution are obtained by MLE method. The appropriateness of the MLE method is evaluated by simulation study.

Maximum likelihood estimation
Assume that we have a random sample, y 1 , y 2 , . . . , y n , from the PGL distribution. Then, the loglikelihood function of the PGL distribution is (3.1) where τ τ τ = (α, θ, µ) is the unknown parameter vector. The score vector components can be obtained by taking partial derivatives of (3.1) with respect to α, θ, µ. The likelihood equation does not have the explicit solution. In this case, we should prefer the direct maximization of the log-likelihood function given in (3.1). For this purpose, the optimization toolboxes of the R, S-Plus or Matlab can be used. The nlm function of the R software is used in this study. To construct the asymptotic confidence intervals, we need the observed information matrix whose elements are given by Since the second derivatives of the log-likelihood function are complicated, these equations are omitted, however, these are upon request from the authors. The inverse of the observed information matrix evaluated atτ τ τ gives the asymptotic variance-covariance matrix. The asymptotic standard errors are obtained by the inverse of (3.2). Then, the asymptotic confidence intervals of the parameters are given by α ± z p/2 Var(α), θ ± z p/2 Var(θ), µ ± z p/2 Var(μ), where z p/2 represents the left quantile value of the standard normal distribution at p/2.

Simulation
Now, we conduct a simulation study to see the finite-sample performance of the MLEs of the parameters of the PGL distribution. The below simulation steps are implemented.
(1) Determine the sample size n, simulation replication N and the parameter values of the PGL distribution, α, θ and µ.
(2) Using the parameter settings in Step 1, generate the random variables from the PGL distribution using the inverse transform method.
(3) Using the generated sample in Step 2, obtain the MLEs of the parameters α, θ and µ.
(5) Using the MLEs and the true parameter values, compute the estimated values of biases, mean square errors (MSEs) and mean relative estimates (MREs). The required formulas for these measures can be found in Altun [5].
The simulation results are displayed in Figure 4. We determine the simulation replication N = 10, 000 and the sample size n = 50, 55, 60, . . . , 500. The true parameter values are τ = (2, 2, 5). We expect to see that when the sample size becomes larger, the biases and MSEs should be near zero and MRE should be near one. From the results given in Figure 4, we conclude that the estimated biases and MSEs are near the zero. Also, as expected, the MREs are near the one. These results show that the MLE is an appropriate method to estimate the unknown parameters of the PGL distribution.  Additionally, two different parameter settings are evaluated to check whether the similar results are obtained for the different parameter vectors. The results are reported in Table 1. As in previous simulation study, the biases are near zero and MSE and MRE approach their desired values. Consequently, MLE is effective parameter estimation method for the PGL distribution. Table 1. The simulation results of the PGL distribution for two different parameter settings.

Poisson generalized-Lindley regression model
As mentioned before, the Poisson regression model does not work well in case of over-dispersion. Dealing with the over-dispersed data set, the first choice is NB regression model. Now, we introduce an alternative regression model to the NB regression model for modeling the highly over-dispersed data sets. Assume that Y is a random variable distributed as a PGL distribution, given in (2.1). Since the mean of Y is E (Y|α, θ, µ) = µ, the log-link function can be used to link the covariates to the mean of the PGL distribution, as follows ..x ik ) represents the covariates and β β β = (β 0 , β 1 , β 2 , ...β k ) T represents the regression parameters. Note that the log-link function is used to link the covariates to the mean of the response variable. Since the mean of the response variable is defined on the positive domain, the link function should convert the observations defined on R to R + . However, the log-link function is not the only option to do this transformation. The softplus function, proposed by Weiss et al. [35], can be used as an alternative to the log-link function. Replacing µ in (2.1) with (4.1), the log-likelihood function of the PGL regression model is The parameters α and θ are the distributional parameters and β β β = (β 0 , β 1 , β 2 , ...β k ) is the vector of unknown regression parameters. These parameters are estimated by direct maximization of (4.2). The nlm function of R software is used to minimize the minus of (4.2), which is equivalent to the maximization of (4.2). The standard errors of the estimated parameters are obtained by means of hessian matrix evaluated at the MLEs of the parameters. The elements of the hessian matrix are computed with fdHess function of R software. The elements of the hessian matrix consist of the second-order partial derivatives of the log-likelihood function. Since these derivatives are complicated, they are omitted and not presented in the study.

Simulation of PGL regression model
Now, we evaluate the suitability of the MLE method for estimating the parameters of the PGL regression model. The simulation replication number N is determined as 10, 000 and four sample sizes are used: n = 50, 250, 500, 1000. Using the log-link function, we generate random variables using the ln (µ i ) = β 0 + β 1 x i1 + β 2 x i2 with parameters β 0 = 0.5, β 1 = 0.5, β 2 = 0.5 and θ = 1, α = 1. The covariates are generated from the standard uniform distribution. The response variable, y i , is generated based on the values of µ i , α and θ. The simulation results are listed in Table 2  Additionally, we compare the standard deviations of the estimators based on the simulated samples and fdHess function. The sample function of R software is use to generate bootstrap samples. The bootstrap samples are used to calculate the bootstrap standard errors of the model parameters. The bootstrap replication number is determined as 1, 000. The model parameters are β 0 = 0.5, β 1 = 0.5, β 2 = 0.5 and θ = 2, α = 2. The simulation results are reported in Table 3. The results show that the obtained standard errors using two different approaches are close to each other. Thus, it is verified that the fdHess function works well to obtain the asymptotic standard errors of the model parameters.

Zero-inflated PGL regression model
The ZIP and ZINB regression models are the most widely used models in case of the zero-inflation. ZINB regression model could be more appropriate choice in most cases since Poisson distribution does not model the over-dispersion. The zero-inflated Poisson distribution is given by where 0 ≤ w ≤ 1. It is easy to see that when the w = 0, the zero-inflated Poisson distribution reduces to Poisson distribution. As in PGL regression model, the mean of Poisson distribution, λ i , is linked to covariates by means of log-link function. The probability of zero counts, w i is linked to covariates by means of logit-link function which is given by where z z z T i = (1, z i1 , z i2 , ...z ik ) is the vector of covariates and γ γ γ = (γ 0 , γ 1 , γ 2 , ...γ k ) T is the unknown vector of regression coefficients for zero process. The log-likelihood function of ZIP regression model is given by (4.5) The log-likelihood function given in (4.5) can be maximized by means of nlm function of R. As mentioned before, when the corresponding data displays over-dispersion, the negative-binomial regression model should be used. The pmf of zero-inflated negative-binomial distribution is given by where τ is the shape parameter. When w = 0, the zero-inflated negative-binomial distribution reduces to negative-binomial distribution. The log-likelihood function of ZINB regression model is given by The log-likelihood given in (4.7) can be maximized with nlm function of R software. Here, an alternative zero-inflated regression model is introduced based on the zero-inflated PGL distribution. The pmf of zero-inflated PGL distribution is given by where 0 ≤ w ≤ 1 and α > 0, θ > 0 and µ > 0. Inserting (4.1) and (4.4) in (4.8), the log-likelihood function of zero-inflated PGL (ZIPGL) regression model is given by The unknown parameters, α, θ, β β β = (β 0 , β 1 , β 2 , ...β k ) and γ γ γ = (γ 0 , γ 1 , γ 2 , ...γ k ) T are obtained by maximizing the (4.9) with nlm function of R software.

Empirical study
In this section, two real data sets are analyzed to show the flexibility of the PGL regression model against the Poisson and NB regression models and also their zero-inflated counterparts. Also, we compare the PGL model with NPGL model, proposed by Altun [2]. In statistics literature, there are many discrete distributions to models the over or under-dispersed count data sets. Some of these distributions can be cited as follows: Mean Conway-Maxwell-Poisson distribution by Huang [18], zero-modified geometric by Kang et al. [21] and generalized COM-Poisson by Qian and Zhu [33]. The best model for the fitted data is chosen according to the results of the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC). The lowest values of the AIC and BIC indicate the best model for the data used.

Absences of high school students
The first data set comes from the daily number of absences of 316 high school students. We model the daily number of absences with some covariates such as gender and type of instructional program.
The same data set was analyzed by Altun [5]. The female individuals are coded 1 and male individuals are coded 0. The gender is represented by the covariate (x 1 ). The instructional program variable has three categories. These are general, academic and vocational. Therefore, one of them is determined as the baseline category and two dummy variables are created. The baseline category is selected as the vocational program. The general program (x 2 ) and academic program (x 3 ) are used as two dummy variables. The below regression structure is fitted to the data set.
The probability distribution of the response variable, number of absence, is displayed in Figure 5(a). The mean and variance of the response variable are 5.955 and 49.518, respectively. Since the DI is greater than one, it is concluded that the response variable exhibits over-dispersion. Cameron and Trivedi [10] proposed a test to evaluate the over-dispersion. The dispersiontest function of R software is used to perform over-dispersion test of Cameron and Trivedi [10]. The obtained test statistic value is z = 6.679 and corresponding p value is < 0.001. This result verifies the over-dispersion problem in response variable. The estimated parameters with corresponding standard errors (SEs) and goodness of fit statistics are listed in Table 4. From Table 4, since the PGL regression model has the lowest values of AIC and BIC, we conclude that the PGL performs better than Poisson, NB and NPGL regression models.  As seen from estimated regression coefficients of PGL regression model, we conclude that the gender, academic and instructional programs variables have statistically significant effects (at 1% level) on the days of absence for high school students. The number of absences for female students are exp (−0.233) = 0.792, that is 20.8% lower than male students. It means that male students have higher absences than female students. However, the number of absences for general and academic instructional program students are exp (1.206) = 3.340 that is 234% and exp (0.708) = 2.029 ,that is 102.9% higher than the vocational instructional program students.
The profile log-likelihood plots of the PGL regression model are displayed in Figure 6. These figures are helpful to evaluate the suitability of the estimated model parameters. Thanks to the profile log-likelihood plots, it is obvious that the estimated parameters of the PGL model are the maximizers of the log-likelihood function. The residual analysis is also applied to check the model suitability for the fitted data set. To do this, the randomized quantile residuals (rqs) are used. The rqs is where u i = F (y i ;μ i ) is uniformly distributed random variable between a i = lim y↑y i F (y;μ i ) and b i = F (y;μ i ) (Altun, [2]). Note that F (y; µ) is the cdf of PGL distribution. If the model is statistically valid for the data set, the rqs follows the standard normal distribution. The index and quantile-quantile plots of the PGL regression model are displayed in Figure 7. These figures show that the PGL regression model provides perfect fit to the data.

NMES
The second data comes from the US National Medical Expenditure Survey (NMES) in the years of 1987-1988. The data set has 4406 observations. It can also be found in the countreg package of R software ( see Zeileis et al. [32]). Here, our goal is to model the number of physician visits y, with following variables: number of hospital stay, (x 1 ), number of chronic conditions, (x 2 ), gender (female=0, male=1) (x 3 ), number of years of education, (x 4 ) and indicator of private insurance (yes=1, no=0), (x 5 ). The following model is fitted to NMES data set using the zero-inflated Poisson, negative binomial and PGL regression models.
The histogram of the number of physician visits are displayed in Figure 5(b). The mean and variance of the number of physician visits are 5.774 and 45.687, respectively. Since the DI of the response variable is greater than one, it exhibits over-dispersion. As in Section 5.1, the over-dispersion test of Cameron and Trivedi [10] is performed. The obtained test statistic value is z = 6.679 and corresponding p value is < 0.001. Therefore, the response variable has a over-dispersion. As seen from Figure 5(b), the response variable is highly peaked at zero. To assess the zero-inflation, the test proposed by Van den Broek [29] is used. The zero.test is used for this purpose. The test statistic follows a chi-square distribution with one degree of freedom. The obtained test statistic value is χ 2 = 33438.09 and its p-value is < 0.001. This result verifies that the frequency of zero process in response variable cannot be modeled by Poisson regression model. Therefore, zero-inflated regression models are needed to model such data sets. The AIC and BIC of the fitted count regression models are listed in Table 5.
Since the data exhibits both over-dispersion and zero-inflation, Poisson and NB regression models do not perform well. According to the AIC and BIC values, PGL and ZIPGL models perform better than other models for NMES data set. The estimated parameters of the fitted models as well as their standard errors are summarized in Table 6. Zero-inflated regression models have two parts to interpret. These parts are related the noninflated and zero-inflated processes. As mentioned before, non-inflated process is modeled with loglink function and zero-inflated process is modeled by logit-link function. Therefore, the regression coefficients of zero-inflated process can be interpreted as odd ratio. As seen from estimated regression coefficients of ZIPGL regression model, for non-inflated process, all variables have statistically significant (at 1% level) effect on number of physician visits. According to zero-inflated process, number of chronic conditions and privative insurance variables have statistically significant effects on number of physician visits. Having private insurance decreases the odds of not having the opportunity of physician visits by exp (−1.174) = 0.309, which is 69.1%.  As in the previous section, the profile log-likelihood plots are displayed to check the correctness of the estimated model parameters. According to the plots in Figure 8, the estimated model parameters of the ZIPGL model are the maximizers of the function, given in (4.9).

Conclusions
This study introduces a new count regression model for zero-inflated and over-dispersed count data sets based on the re-parametrization of the PGL distribution. The PGL regression model and its zeroinflated counterpart are studied. Two real data sets are analyzed to convince the readers in favor of the PGL regression model against the Poisson and NB regression models. Empirical findings show that PGL and ZIPGL regression models provide better fits than Poisson, ZIP, NB and ZINB regression models. As a future work of this study, one-inflated PGL regression model could be considered. The one-inflated regression models are useful for modeling the claim numbers in insurance. We hope that the PGL and ZIPGL regression models find a wider application area in different applied sciences.

Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.