Bayesian analysis of zero-inflated regression models

https://doi.org/10.1016/j.jspi.2004.10.008Get rights and content

Abstract

In modeling defect counts collected from an established manufacturing processes, there are usually a relatively large number of zeros (non-defects). The commonly used models such as Poisson or Geometric distributions can underestimate the zero-defect probability and hence make it difficult to identify significant covariate effects to improve production quality. This article introduces a flexible class of zero inflated models which includes other familiar models such as the Zero Inflated Poisson (ZIP) models, as special cases. A Bayesian estimation method is developed as an alternative to traditionally used maximum likelihood based methods to analyze such data. Simulation studies show that the proposed method has better finite sample performance than the classical method with tighter interval estimates and better coverage probabilities. A real-life data set is analyzed to illustrate the practicability of the proposed method easily implemented using WinBUGS.

Introduction

Statistical methods for analyzing count data with numerous zeros are very important in various scientific fields including but not limited to industrial applications (e.g., Lambert, 1992) and biomedical applications (e.g., Heilbron and Gibson, 1990, Hall, 2000). As an illustration, consider the data set in Table 1, which presents the number of defects that resulted from an experiment for improving printed circuit board (PCB) manufacturing quality at Nortel, RTP, North Carolina. Out of 54 observations, 42 (78%) of them are zeros (no defects), while 8, 2 and 2 of them have one, two and three defects, respectively. In addition to such defect counts, we also obtained data on other controllable factors (covariates) that might explain the variation in the defect counts. Regression models with commonly used discrete distributions such as Poisson and Negative Binomial (see Miaou, 1994), may not fit these data well, and seriously underestimate the zero-defect probability, which is an important indicator of production quality. We use the Nortel data set as a motivating example to develop our models for count data with many zeros. However, by no means, the proposed methodology is limited to this specific data set.

The zero-inflated Poisson regression model proposed in Lambert (1992) is very useful to model discrete data with many zeros. We extend the models to include a broad class of distributions (e.g. power series distributions) and present an alternative approach to fit such models to “zero-inflated data” with a relatively small to moderate sample size. The zero-inflated model has the interpretation that when the production process is in near perfect state, zero defect can be observed with high probability. However, due to changes in manufacturing environment, the process moves to an imperfect state and defective outcomes are possible, but not inevitable. The environmental changes are usually unobservable and random. This causes the process to move randomly back and forth between the perfect and the imperfect states. If the production process is reasonably good, the data that counts the number of defects will consist of many zeros. Most data sets collected from Nortel manufacturing processes have this feature of many zeros. See Li et al. (1999) for an example. This poses a challenge in their quality improvement practice. For other interesting applications related to zero-inflated models see Dahiya and Gross (1973), Umbach (1981), Yip (1988), Gupta et al. (1996), Welsh et al. (1996), Gurmu (1997), Welsh et al. (1996), and Hinde and Demetrio (1998). An overview of zero inflated models can be found in Ridout et al. (1998).

Classical statistical methods based on the maximum likelihood estimate (MLE) and the likelihood ratio (LR) test for zero inflated Poisson regression can be found in Hall (2000). The approximation theory based on large samples usually serves as the basis for deriving classical inference for non-normal data and often requires the use of nonstandard asymptotic theory (see Self and Liang, 1987). It will be of interest to see how these procedures perform with finite samples, especially in estimating the zero-defect probability. In simulation studies based on a sample of size n=50 (see Section 4.1.1), it was found that the classical procedure performs reasonably well in cases where the zero-defect probability (Pr(Y=0)) was not chosen close to unity. However, when Pr(Y=0) was chosen closer to unity, the Bayesian estimates performed very well with respect to interval width and coverage probability. Motivated by such good finite sample performance of the Bayesian methods, this article develops Bayesian point and interval estimation methods for zero inflated regression models.

In a Bayesian approach, parameters are considered random and a joint probability model for both data and parameters is required. The joint posterior distribution of the parameters of the proposed models turns out to be analytically intractable, hence simulation-based methods (see Tierney, 1994) broadly known as Markov Chain Monte Carlo (MCMC) are required to obtain the point and interval estimates of the parameters. A simple code written in WinBUGS (Spiegelhalter et al., 1999), has been used to perform all the required computations (see Appendix). In Section 4.1, several models have been fitted to show that the computing time should not be a major obstacle to implementing the proposed Bayesian methods in real-life operations.

Section 2 presents parametric formulations of zero-inflated models with a particular emphasis on the zero inflated power series (ZIPS) model. Section 3 presents a Bayesian analysis for the ZIPS regression models. The process states (perfect and imperfect) are viewed as missing data and the data augmentation method (Tanner and Wong, 1987) is integrated into the MCMC procedure to generate samples from the posterior distribution of parameters of interest. Section 4 illustrates the procedure with real-life data. Results from some simulations are also presented to compare interval estimates from Bayesian and large sample chi-square approximation methods. Section 5 concludes this study and addresses a few areas of future work.

Section snippets

Zero inflated power series (ZIPS) models

The random variable, Y in a zero-inflated model can be represented as Y=V(1-B), where B is a Bernoulli(p) random variable and V independently to B has a discrete distribution such as Poisson(θ), NegBin(θ,r) or more generally power series, PS(θ). Notice that under this representation, the mean (E(Y)) and variance (Var(Y)) are given by, E(Y)=(1-p)E(V),Var(Y)=p1-p[E(Y)]2+δE(Y),where δ=Var(V)/E(V) denotes the coefficient of dispersion of the latent random variable V. Thus, it follows that if the

Bayesian analysis

Bayesian analysis requires the specification of prior distribution for the parameters. Assume that the prior distributions for p and θ are independent, and use the following conditional conjugate priors: pBeta(b1,b2)andθπ(θ),where π(θ)θa1/[c(θ)]a2 is a conjugate prior for the PS family. Note that prior independence does not necessarily imply posterior independence. The hyperparameters a1,a2,b1,b2 are assumed known. In particular, b1=b2=1 gives the uniform prior on (0,1) for p. Small values

Data analysis and simulation studies

In this section, we analyze the data listed in Table 1 to illustrate the procedures mentioned in Section 3 and also present some results based on a simulation study to compare the performance of the Bayes procedures to that with its frequentist counter parts. We use Poisson and Negative Binomial (NB) distributions from the PS family for all our illustrations. Notice that a NB(r,θ) distribution belongs to the PS family (as presented in (1)) with b(k)=r+k-1k and c(θ)=(1-θ)-r with μ(θ)=rθ/(1-θ).

Conclusions

Zero-inflated models have been shown to be useful for modeling outcomes of manufacturing processes and other situations where count data with many zeros are encountered. In the presence of covariates, zero-inflated regression model has been found to be useful for process optimization. In this article Bayesian methodologies have been used to model such data, using sampling-based methods. From simulation studies, it is also evident that the proposed methods are quite effective in drawing

Acknowledgements

We thank Spiegelhalter et al. of MRC Biostatistics unit, Institute of Public Health, Cambridge, UK for providing the program BUGS (Bayesian inference Using Gibbs Sampling) and its window version WinBUGS, available free of cost from their web site http://www.mrc-bsu.cam.ac.uk/bugs. The WinBUGS code for the ZIP regression model is given in the Appendix. Finally we thank the editor, the associate editor, and the referees for their constructive comments and valuable suggestions that improved this

References (24)

  • S. Gurmu

    Semiparametric estimation of hurdle regression models with an application to Medicaid utilization

    J. Appl. Econometrics

    (1997)
  • D.B. Hall

    Zero-inflated Poisson and binomial regression with random effectsa case study

    Biometrics

    (2000)
  • Cited by (174)

    • Experimental assessment of lanceolate projectile point and haft robustness

      2022, Journal of Archaeological Science: Reports
      Citation Excerpt :

      We constructed a Bayesian zero-inflated negative binomial regression model of lanceolate point and haft damage. The zero-inflated negative binomial model works with count variables that have an excessive occurrence of zeroes and is commonly used for overdispersed count outcome variables (Ghosh et al., 2006; see Gelman et al., 2013 for an overview of Bayesian data analysis). Our sample distribution of percent broken point lengths exhibits these properties, an excess of zeroes and a long tail (the points recorded as zero percent broken had haft damage that precluded further firing; Fig. 3).

    View all citing articles on Scopus
    View full text