Estimation Techniques for Regression Model with Zero-inflated Poisson Data

Researchers in many fields including biomedical often make statistical inferences involving the analysis of count data that exhibit a substantially large proportion of zeros. Subjects in such research are broadly categorized into low-risk group that produces only zero counts and high-risk group leading to counts that can be modeled by a standard Poisson regression model. The aim of this study is to estimate the model parameters in presence of covariates, some of which may not have significant effects on the magnitude of the counts in presence of a large proportion of zeros. The estimation procedures we propose for the study are the pretest, shrinkage, and penalty when some of the covariates may be subject to certain restrictions. Properties of the pretest and shrinkage estimators are discussed in terms of the asymptotic distributional biases and risks. We show that if the dimension of parameters exceeds two, the risk of the shrinkage estimator is strictly less than that of the maximum likelihood estimator, and the risk of the pretest estimator depends on the validity of the restrictions on parameters. A Monte Carlo simulation study shows that the mean squared errors (MSE) of shrinkage estimator are comparable to the MSE of the penalty estimators and in particular it performs better than the penalty estimators when the dimension of the restricted parameter space is large. For illustrative purposes, the methods are applied to a real life data set


Introduction
Data with large number of zeros are often encountered in many studies including medical and public health.Failure to account for the extra zeros may result in biased parameter estimation and misleading inference.In recent years the regression method for modelling such data has become very popular.It is based on the assumption that the response is generated by a mixture of a degenerate distribution at zero and a standard Poisson distribution.For example, the National Medical Expenditure Survey data contain an excessive number of zeros in the hospital admission, and the probability mass at point zero exceeds that allowed under a standard parametric family of distributions.We will analyze this survey data in Section 5 where the number of hospital admission ranges from 0 to 8, and 80.4% of the respondents have no hospital admission, which presents possible zero-inflation.The zero-inflated Poisson regression (ZIPR) model can be used to analyze this data which assumes that the population of respondents can be divided into two subpopulations.The first of which generates counts (number of hospital admissions) that follow a Poisson distribution with the usual parameter, λ and the second subpopulation generates only the zeros (no hospital admission) with probability, p that follows binomial distribution with logit model.
The ZIPR models were first introduced by Mullahy (1986) in the econometric literature.Their use has become very broad since the publication of the pioneering work by Lambert (1992) in which manufacturing defects were considered.Ridout et al. (1998) cited examples of data with too many zeros from various disciplines including medicine, agriculture, and the use of recreational facilities.Hall (2000) considered ZIPR and zero-inflated binomial models with random intercept, and provided an EM algorithm for model estimation.Hinde & Demetrio (1998) reviewed the literature of ZIPR model and cited examples from agriculture, econometrics, manufacturing, road safety, species abundance, and medical sciences.Jansakul & Hinde (2002) extended the van den Broek (1995) score test to the more general situation where the zero probability is allowed to depend on covariates.Feng & Zhu (2011) studied a semiparametric ZIP mixture model, and a Monte Carlo EM algorithm was provided for model estimation.For other aspects of ZIPR models, see among others, Dietz & Bohning (2000) and Ridout et al. (2001).
The motivation of the present paper comes from the need to be able to identify the significant covariates of the ZIPR model in presence of a higher proportion of zeros in the response variable.With this motivation, we consider an unrestricted model that includes all covariates and possible extraneous variables and a restricted model that includes some covariates of concern while leaving out extraneous variables.One way to deal with this framework is to use the pretest procedures that test whether the coefficients of the extraneous variables are zero and then estimate the parameters in the model that include coefficients that are rejected by the test.Another approach is to use Stein type shrinkage estimators where the estimated regression coefficient vector is shrunk in the direction of linear restriction among the parameters.Sapra (2003) considered the pretest method to estimate the parameters of the Poisson regression model.For the properties of pretest and shrinkage estimation strategies for parametric and semiparametric linear models, see Nkurunziza (2013), Hossain et al. (2012), andLiang &Song (2009), among others.
This paper also studies the penalty estimators such as, least absolute shrinkage and selection operator (LASSO) and adaptive LASSO and compares their performances in terms of biases and mean squared errors with the shrinkage and pretest estimators through a simulation study.The LASSO was developed by Tibshirani (1996) for simultaneous variable selection and parameter estimation.Unlike LASSO, the adaptive LASSO proposed in Zou (2006) permits different weights for different parameters.The adaptive LASSO has been shown to have the oracle property, that is, consistency in variable selection and asymptotic normality.Hossain & Ahmed (2012) considered the shrinkage and penalty methods for estimating the Poisson regression model when some of the covariates may be inactive for the response.Zeng et al. (2014) proposed a variable selection approach for ZIPR models via adaptive LASSO.
The rest of this paper is organized as follows.The model and suggested estimators are introduced in Section 2. The asymptotic properties of the proposed estimators and their asymptotic distributional biases and risks are presented in Section 3. The results of a simulation study are reported in Section 4. Application to real life data and a comparison of our methods are described in Section 5. Finally, concluding remarks are given in Section 6.

Models and Estimation Method
Suppose that the counts, Y i , i = 1, 2, • • • , n are generated independently according to a zero-inflated Poisson distribution.The zeros are assumed to arise in two distinct underlying processes.The first process occurs with probability p i and produces only zeros, while the second process occurs with probability 1 − p i and leads to a standard Poisson model with parameter λ i and hence a chance of further zeros.In general, the inevitable zeros from the first process are called structural zeros and those from the Poisson process are called sampling zeros, see for example, Ridout et al. (1998) and Jansakul & Hinde (2002).These two processes give rise to a simple two-component mixture distribution with probability mass function and The ZIPR model reduces to a Poisson regression model when p i = 0, and exhibits overdispersion when p i > 0. Lambert (1992) incorporated covariates using a log link for λ i and logit link for p i .log(λ i ) = x ⊤ i β and log where ⊤ are the covariate vectors, and β and γ are the p × 1 and q × 1 vectors of unknown regression parameters.Let us assume θ = (β ⊤ , γ ⊤ ) ⊤ .The covariates that affect the Poisson parameter of the first process may or may not be the same that affect the probability of the second process.When the covariates are the same and λ i and p i are not functionally related, then x i = z i and in that case the ZIPR involves twice as many parameters as the Poisson regression.In model (2), we are interested in testing the following hypothesis: where H is an r × (p + q) matrix of full rank and h is a r × 1 vector of constant terms.
For a random sample, y = (y 1 , y 2 , • • • , y n ), the log-likelihood function is given by where I (•) is an indicator function, which is equal to 1 if the event is true and 0 otherwise.
The log-likelihood function (3) of the ZIPR model is quite complicated, especially as the first term involves both β and γ.Also, the responses are from a mixture distribution that includes both sets of the parameters p i and λ i .
The computation thus becomes quite challenging in terms of the variance-covariance matrix and accuracy when using the Newton-Raphson algorithm.And it is due to the additional number of parameters to be estimated for the proposed model from the complicated nature of the likelihood function (3).To avoid this complication, we use the EM algorithm to maximize the log-likelihood function, see, Hall (2000) and Lambert (1992).
The EM algorithm is based on a latent variable U i , where we observe U i as 1, when Y i is from the perfect, zero state (or first process) and U i as 0, when Y i is from the Poisson state (or second process).To formulate the log-likelihood for the complete data, we use the conditional probability: Thus, the complete log-likelihood based on (Y, U) is where To implement the EM algorithm, we first initialize (β, γ) by letting U i = I (y i =0) .In the E-step, we use the initial values of (β, γ) to calculate the expectation of U i , and use it as the estimator of U i .In the M-step, we use the estimate U i to maximize L c (β, γ; U, x i , z i ), which gives the unrestricted maximum likelihood estimator for β and γ.The iteration (k + 1) of the EM algorithm requires the following steps.
Let I (β,γ) be the information matrix of the estimator θ.If 1 n I (β,γ) has a positive definite limit satisfying some regularity conditions, as in the work of McCullagh (1983), the quantity √ n is asymptotically normally distributed with mean vector 0 and information matrix I −1 (β,γ) (Lambert, 1992).The matrix I (β,γ) can be partitioned as ) , where the elements I β,β , I β,γ = I ⊤ γ,β , and I γ,γ are, respectively, Suppose now that our interest centers in estimating the parameters β and γ from (4) under the linear restriction Hθ = h.The steps of the EM-algorithm for estimating the parameters using log-likelihood (4) under the above restriction are the same.We simply replace θ ) in the M-step.The resulting estimator, θ = ( β⊤ , γ⊤ ) ⊤ is known the restricted maximum likelihood estimator (RMLE).
The likelihood ratio test statistic can be used to test H 0 : Hθ = h vs. H a : Hθ h.If θ maximizes the log likelihood of the ZIPR model under H 0 of dimension k − r, where k = p + q and θ maximizes the log likelihood of the ZIPR model under a nested alternative hypothesis H A of dimension k, then the test statistic D n is Under H 0 and suitable regularity conditions, D n is asymptotically distributed as chi-square with r degrees of freedom (Lambert, 1992).So that a likelihood ratio test can be performed using approximate critical values from the chi-square distribution.

The Pretest and Shrinkage Estimators
The pretest and shrinkage estimators are based on the test statistic D n of ( 7) for testing H 0 : Hθ = h.Specifically, the pretest estimator (PT) of θ is defined as where I(A) is an indicator function of a set A, and χ 2 r,α is the α-level critical value of the approximate distribution of D n under H 0 .From the above definition, one can see that if the data yield D n < χ 2 r,α , then θPT = θ, otherwise θPT = θ.So the PT is indeed a simple mixture of the UMLE and RMLE estimators.In an ordinary two-step procedure, one would test the hypothesis H 0 : Hθ = h first, then based on the test result decide which estimator should be adopted.The PT simply combines these two steps to form a single one.That is, testing and estimation are done simultaneously.For details, see Hossain et al. (2009) and Ahmed at al. (2006), and others.
Because of extreme choices for either the UMLE or RMLE, the pretest procedures are not admissible for some models, even though they may improve upon the UMLE, a well-documented fact in the literature (Judge & Bock, 1978).This motivates us to define a shrinkage estimator, which is a smoothed version of θPT : This estimator is a weighted average of UMLE θ and RMLE θ, where the weights are a function of the test statistic for testing H 0 : Hθ = h.
We note that when the test statistic D n is very small in comparison with r − 2, i.e., when the ratio (r − 2)/D n is greater than one in absolute value, the shrinkage estimator θS E tends to shrink θ overly towards θ and reversing the sign of θ.Replacing 8), where (x) + = x1 (x≥0) , the positive-part shrinkage estimator, θPS E rectifies this problem.For details, see, Ahmed & Fallahpour (2012).

Penalty Estimators: LASSO and Adaptive LASSO
Let θ be the parameter of interest and ℓ(θ) be the log-likelihood function of the ZIPR, then the LASSO estimator (Tibshirani, 1996) is given by θlasso where τ ≥ 0 is a penalization parameter controlling the amount of shrinkage on the regression coefficients.However, the LASSO applies same penalty to all the coefficients, which over-penalizes the important ones and accordingly results in biased estimators (Zou, 2006).
The adaptive LASSO (ALASSO) (Zou, 2006) offers an effective way to minimize this bias.It has been shown that the ALASSO enjoys the oracle property that is, the ALASSO is consistent in variable selection, and its estimators are asymptotically normal and unbiased.More explicitly, it assigns a higher weight to the small coefficients and lower weight to the large coefficients.The adaptive lasso estimator is defined as where w = (w 1 , w 2 , • • • , w p+q ) are adaptive weights, which are usually set to ŵ = 1/| θ|, here θ are the maximum likelihood estimators.The ALASSO estimator can be obtained by minimizing penalized log-likelihood function.
First, we expand the likelihood function based on a Taylor series expansion at θ: where ℓ ′ and ℓ ′′ are the first and second derivatives, respectively.Since ℓ( θ) is constant and ℓ ′ ( θ) = 0, so the penalized log-likelihood function is equivalent to Let Σ be the estimated variance-covariance matrix of θ, which can be obtained from one of many statistical packages.And it provides an estimate of the Hessian matrix ℓ ′′ , given as ℓ ′′ = −I θ = − Σ−1 .Here we use the R statistical package pscl (Jackman, 2012) in order to calculate Σ.Then we construct a pseudo-data set as The order of the square matrix X * is p + q + 2 and Y * is a vector corresponding to X * .The ALASSO estimator (9) can now be re-written as θalasso (τ) ≈ argmin The estimator in equation ( 12) is similar to the least squares estimator with adaptive LASSO penalization (Zou, 2006).Various efficient algorithms, such as the least angle regression algorithm (Efron et al., 2004), the predictorcorrector algorithm (Park & Haste, 2007), and the coordinate descent algorithm (Friedman et al., 2010) can be used to conduct the minimization of Equation ( 12).Here we adopt the predictor-corrector algorithm which computes solutions along the entire penalization path of the coefficient estimates as τ varies.Starting at τ = τ max , where τ max is the largest τ that makes all the coefficients of θ(τ) nonzero.This algorithm computes a series of solutions, each time estimating the coefficients with a smaller τ.The penalization parameter τ is selected using k-fold cross validation.For each fold, we obtain a series of models based on Schwarz's Bayesian criterion, commonly referred to as Bayesian information criterion corresponding to the values of τ and compute the right side of (12) using the omitted fold.Then we choose the value of τ for which the average cross-validation in the right side of ( 12) is minimized.
In the case of pretest and shrinkage estimators, it is possible to determine theoretically when they have smaller asymptotic risks than the unrestricted MLE.Similar results are not available for LASSO and adaptive LASSO estimators.

Asymptotic Results
In this section we study the asymptotic behavior of the proposed estimators and develop the results in terms of θ * , which could be any one of the estimators considered in this paper: θ, θ, θPT , θS E , and θPS E .In order to proceed with the asymptotic results, we first partition θ = (θ ⊤ 1 , θ ⊤ 2 ) ⊤ , where θ 2 is the coefficient vector of the inactive covariates for the ZIPR model.The main focus here is to evaluate the performance of these estimators when θ 2 is close to the null vector.To derive any meaningful results we consider a sequence of local alternatives where ω = (ω 1 , ω 2 , • • • , ω r ) ⊤ ∈ ℜ r is a given vector of real numbers.In this framework, θ= (θ ⊤ 1 , 0 ⊤ ) ⊤ , and the quantity ω √ n is the magnitude of the distance between the true model and the restricted model.For any fixed ω, this distance shrinks as the sample size increases.
To study the asymptotic risks (ADR) of the estimators, we define a quadratic loss function by using a positive definite matrix W, namely where θ * is any one of the estimators.The usual quadratic loss is defined when W is chosen as I, the identity matrix.A general W gives a loss function that weighs differently for different θ's.
If V is the asymptotic variance-covariance matrix of θ * , the ADR of √ n( θ * − θ) is given by tr(WV).We assume that the cumulative distribution function of θ * under K n exists and can be denoted as where F(x) is a nondegenerate distribution function.The ADR of θ * is then defined as where is the dispersion matrix for the distribution function F(x).
The shrinkage estimators are, in general biased, the bias, however is accompanied by a reduction in variance.The asymptotic distributional bias (ADB) of an estimator θ * is defined as Under the local alternatives (13), the following theorems help the derivation and numerical computation of the ADB and the ADR of the estimators.
Theorem 3.1 If I β,γ is nonsingular and ∆ = ω ⊤ (HI −1 β,γ H ⊤ ) −1 ω, then under the local alternatives K n in ( 13) and regularity conditions, we have as n → ∞ 2. The test statistic D n in ( 7) converges to a non-central chi-squared distribution χ 2 r (∆) with r degrees of freedom and non-centrality parameter ∆.
For a concise statement of the results, we use the following notations: where i = 2, 4, j 1, 2 and for simplicity we assume that I β,γ = I.Here H r+2 (•, ∆) is the distribution function of the χ 2 r (∆) distribution.Theorem 3.2 Using the definition of ADB and Theorem 3.1, the ADBs of the estimators are, The outline of the proof is similar to that of Theorem 3.1 of Ahmed & Fallahpour (2012).
Clearly, the bias of the estimators is a function of ∆.Therefore, for bias comparisons, it suffices to compare the scalar factor ∆ only.It is clear that the ADB of RMLE is an unbounded function of ∆.The ADB( θS E ) and ADB( θPS E ) start from the origin, and as ∆ increases, they increase to a maximum and then decrease to 0. Note that, h 1,r+2 (∆) is a decreasing log-convex function of ∆ and the ADB of θPT is a function of ∆ and α.For a fixed α, ADB( θPT ) starts at zero, increases to a point, then decreases gradually to zero.
Theorem 3.3 Under the local alternatives K n in ( 13) and the assumptions of Theorem 3.1, the ADRs of the estimator are The outline of the proof is similar to that of Theorem 3.2 of Ahmed & Fallahpour (2012).
By comparing the ADRs of the estimators, we see that, as ∆ moves away from 0, the risk of θ becomes unbounded.That is, the RMLE θ dominates the unrestricted estimator at and near ∆ = 0.The risk of θPS E is asymptotically superior to θS E for all values of ∆, with strict inequality holds for some ∆.Thus, not only does θPS E confirm the inadmissibility of θS E , but it also provides a simple superior estimator.Further, the largest risk improvement of θPS E over θS E is at and near the null hypothesis.Also, by comparing the ADRs of θS E , θPS E , and θ, it can be easily shown that, under certain conditions ADR( θPS E , W) ≤ ADR( θS E , W) ≤ ADR( θ; W) for all ∆ ≥ 0. For a given α, PT is not uniformly better than the unrestricted estimator near the null hypothesis.One may determine an α such that PT has a minimum guaranteed risk.For any given H, I, and h, the relative efficiency (inverse of risk) of PT  d-e).
the shrinkage estimators relative to the UMLE.However, this is precisely the lesson of Theorems 3.2 and 3.3.Shrinkage strictly improves the asymptotic risk, and the improvement can be especially strong in high-dimensional cases.This holds true for other n and r values.
For the LASSO and adaptive LASSO estimators, we use the local quadratic approximation algorithm of Ulbricht (2010) for finding the entire solution path.We also use a k-fold cross validation procedure in order to choose the value of the penalization parameter τ that achieves the lowest BIC score.Table 1 shows that when r < 6, the LASSO and adaptive LASSO estimators are better than the PSE while it performs well for r ≥ 6.The PSE is preferable in the presence of many inactive predictors in the model.Deb & Trivedi (1997).In this example, the number of hospital stays (hosp) is used as the response variable and the covariates are self-perceived health status (poorhlth and exclhlth), number of chronic conditions (numchron), gender (male), age, number of years of education (school), family income (faminc), and private insurance indicator (privins).Approximately 80% of the responses are equal to zero, which corresponds to patients with zero day stay in the hospital.We first look at the count portion of the data, which refers to the respondents who have stayed at least one day in the hospital.On the other hand, if we look at the logistic portion (i.e.inflation part) of the data, which predict whether outcome is positive or zero.All covariates are included in both portions of the model to predict the number of hospital stays.
To assess the effect of covariates on the hosp on each portion, ZIPR model is fitted.The AIC and BIC criteria for the mixture model (Wang et al., 1996) show that poorhlth, exclhlth and numchron are the important factors for the Poisson part and poorhlth, numchron, age, and male are the important factors for the logistic part of the ZIPR model.So the other six covariates in the Poisson part and five covariates in the logistic part are not significantly related to the response hosp.In this example, our hypothesis is Hθ = 0 where H is a 7 × 18 matrix and θ is a 18 × 1 vector of covatiates.
We apply the bootstrap method (see, Jung et al., 2005) to examine the performance of the proposed estimation strategies for estimating the coefficients of the other seven covariates.We draw 1000 bootstrap samples of size n = 2000 by drawing 2000 rows with replacement from the data matrix (y i , x i j ) and compute the point estimates, standard errors, and relative efficiencies of the proposed estimators.Here, the empirical distribution function F based on original 4406 individuals is regarded as the true distribution, and the coefficients of the ZIPR closest to F are regarded as true parameter values.The results given in Table 2 reveal that the restricted, pretest, shrinkage, positive shrinkage, LASSO, and adaptive LASSO estimators are superior to the maximum likelihood estimator, which is in agreement with our simulation results.

Conclusion
In this paper, we have introduced the pretest, shrinkage, LASSO, and ALASSO estimators for zero-inflated Poisson regression model when it is suspected that some of the regression coefficients may be restricted to a subspace.We have presented ADB and ADR expressions of the pretest and shrinkage estimators.We conducted a Monte Carlo simulation experiment to examine the performance of the proposed estimators which showed that the RMLE outperforms the usual UMLE at or near the restriction.However, as we deviate from the the restriction, risk of the RMLE becomes unbounded.And near the restriction, the risk of the pretest estimator is less than that of the UMLE.As ∆, the distance between the simulated and restricted models increases, the risk of the pretest estimator crosses the risk of the UMLE, reaching a maximum, and then decreasing monotonically to the risk of the ULME.Furthermore, the shrinkage estimators with data based weights perform well if the restriction is true.In fact, the shrinkage estimators outperform the UMLE in the entire parameter space when the number of inactive parameter is greater than two.In addition, the performance of the shrinkage estimators improve relative to the UMLE when the number of inactive covariates increases.The penalty estimators are preferable when there are fewer inactive covariates.Thus in presence of many inactive covariates, the shrinkage estimators are attractive alternatives to penalty estimators.
Finally, we applied the proposed strategies to a real life data set to evaluate the relative performance of the proposed estimators.The results are consistent with our analytical and simulated findings.The theoretical and numerical results of zero-inflated Poisson can be extended to the entire class of generalized linear models.We are currently working on this project.
In terms of recommendations, the shrinkage estimators should be used instead of the UMLE when the restriction on regression coefficients agrees with the data.If we are uncertain of the quality of auxiliary information about the covariates, we can still use the shrinkage estimation because it offers a lower or at best an equal risk relative to the UMLE over the entire parameter space.

Table 1 .
Simulated relative efficiency of RE, SE, PSE, LASSO, and adaptive LASSO with respect to θ when the restricted parameter space Hθ = h is correct (∆ = 0).The data are from the National Medical Expenditure Survey (NMES) conducted in 1987 and 1988 to provide a comprehensive picture of how Americans use and pay for health services.It includes 4406 respondents aged 66 or older and covered by Medicare program.Details of the data can be found in