Introducing COZIGAM : An R Package for Unconstrained and Constrained Zero-Inﬂated Generalized Additive Model Analysis

Zero-inﬂation problem is very common in ecological studies as well as other areas. Nonparametric regression with zero-inﬂated data may be studied via the zero-inﬂated generalized additive model (ZIGAM), which assumes that the zero-inﬂated responses come from a probabilistic mixture of zero and a regular component whose distribution belongs to the 1-parameter exponential family. With the further assumption that the probability of non-zero-inﬂation is some monotonic function of the mean of the regular component, we propose the constrained zero-inﬂated generalized additive model (COZIGAM) for analyzing zero-inﬂated data. When the hypothesized constraint obtains, the new approach provides a uniﬁed framework for modeling zero-inﬂated data, which is more parsimonious and eﬃcient than the unconstrained ZIGAM. We have developed an R package COZIGAM which contains functions that implement an iterative algorithm for ﬁtting ZIGAMs and COZIGAMs to zero-inﬂated data based on the penalized likelihood approach. Other functions included in the package are useful for model prediction and model selection. We demonstrate the use of the COZIGAM package via some simulation studies and a real application.


Introduction
Generalized additive models (GAMs) (Hastie and Tibshirani 1990;Wood 2006) are widely used in applied statistics, especially for modeling nonlinear effects of the covariates in scientific and quantitative studies.See, for instance, Ciannelli, Fauchald, Chan, Agostini, and Dingsør (2007b) and the references therein in ecological analysis.GAMs can be estimated by maxi-COZIGAM mizing the penalized likelihood which, in general, equals where η is the unknown regression function on the link scale, L(η) is the log-likelihood functional, J 2 (η) is some roughness penalty, and λ is the smoothing parameter that controls the trade-off between the goodness-of-fit and the smoothness of the function.The estimated regression functions are smoothing splines under mild regularity conditions.See Wahba (1990), Green and Silverman (1994), Wood (2000) and Gu (2002) for details on the penalized likelihood approach and smoothing splines.
Zero-inflated data abound in ecological studies as well as in other scientific and quantitative fields, where the data contain an excess of zero responses.The problem is known as zeroinflation.For example, fisheries trawl survey data often contain a large number of zero catches, due to the fact that fish swim in schools influenced by food availability and irregular current pattern, see Ciannelli et al. (2007b).Zero-inflated data are often analyzed via a mixture model specifying that the response variable comes from a probabilistic mixture of zero and a regular component whose distribution (referred to as the regular distribution below) belongs to the 1-parameter exponential family distribution.See Mullahy (1986), Lambert (1992) and Heilbron (1994) for discussions in the parametric setting.Nonparametric regression analysis of zero-inflated data can be studied via the zero-inflated generalized additive model (ZIGAM) (Chiogna and Gaetan 2007), where the mean of the regular component and the probability of non-zero-inflation are each modeled as some nonparametric smooth predictors, say, s µ (T ) and s p (T ) respectively with T as the covariate.An alternative approach to modeling zero-inflated data proceeds in two stages: (i) model the presence/absence pattern by a GAM and (ii) model the response given it is non-zero by another GAM (Barry and Welsh 2002).When the response variable has a continuous regular distribution, the two-stage approach is equivalent to the ZIGAM, otherwise the two approaches are generally different.In stage (ii), the twostage approach generally specifies the conditional response distribution given it is non-zero to belong to a zero-truncated 1-parameter exponential family, and hence its fitting involves very complex link functions and variance functions.Here, we mainly focus on the ZIGAM and its constrained versions.
If the process generating the non-zero-inflated responses and the zero-inflation process constitute distinct mechanisms, the functional forms of the two smooth predictors s µ (T ) and s p (T ) in a ZIGAM are unconstrained.However, in many ecological data, the two processes are coupled and bear some systematic relationship.For example, in trawl survey studies, zero-inflation often arises from the spatio-temporal aggregation of fish due to their schooling behavior.For such data, the probability of positive catch is positively correlated to the volume occupied by the schools of fish which generally increases with the mean (local) abundance of the fish.Therefore, in the situation involving spatio-temporally aggregated subjects, the probability of positive catch is likely a monotonic function of the mean (local) abundance of the study population.Liu and Chan (2008) considered the case of imposing a proportionality constraint on s µ (T ) and s p (T ) up to an additive constant, which leads to a constrained zero-inflated generalized additive model (COZIGAM); see below.The imposed constraint in a COZIGAM reflects the mechanistic nature of the zero-inflation process.Moreover, it promotes estimation efficiency by effectively reducing the number of model parameters.The ZIP(τ ) model proposed by Lambert (1992) in the parametric Poisson regression setting is a harbinger of our new approach.Here, the proportionality constraint may be relaxed by letting the proportionality constants be component-specific, which allows the non-zero-data generating process and the zero-inflation process to be partially coupled.
Another important issue is to assess the validity of the proportionality constraint imposed by the COZIGAM against the (unconstrained) ZIGAM.Liu and Chan (2008) proposed a Bayesian model selection criterion for choosing between the two competing models.The model selection approach can be readily extended for the purpose of choosing between a ZIGAM and a GAM, which we do here.
To implement the regression analysis via the ZIGAM and the COZIGAM in real applications, we have developed an R (R Development Core Team 2007) package COZIGAM, which can be downloaded from the Comprehensive R Archive Network at http://cran.r-project.org/.
The purpose of this paper is to introduce the COZIGAM and describe how to use this package.The structure of this paper is as follows.We introduce the model formulation of both the constrained and unconstrained ZIGAMs, and briefly discuss the model estimation and the proposed model selection criterion in Section 2. The use of the COZIGAM package is illustrated by both Monte-Carlo studies and a real data application in Section 3. We briefly conclude in Section 4.

Model Formulation and Estimation
In this section we briefly outline the model formulations of the constrained and the unconstrained ZIGAMs, see Liu and Chan (2008) for details.Next, we summarize a model estimation procedure which may involve the EM algorithm (Dempster, Laird, and Rubin 1977).
Then, We will review the Bayesian model selection criterion developed by Liu and Chan (2008) for choosing between the constrained and the unconstrained ZIGAMs, and extend it for choosing between a ZIGAM and a GAM, i.e. without zero-inflation.

Model Formulation
Let Y = (Y 1 , Y 2 , . . ., Y n ) T be the responses and T = (T 1 , T 2 , . . ., T n ) be the covariates where Y i is univariate and T i consists of m sub-vectors (T i = (T 1i , T 2i , . . ., T mi ) T ).Assume that given the covariates T = t, Y i 's are independent.A GAM (Wood 2006, Chapter 3) relating the response Y i to the covariate t i has the general form: where µ i = E(Y i ), g µ is a monotonic link function, and η is some unknown smooth function to be estimated.Assume further that η is additive in the m covariates: where each s is a centered unknown smooth function (could be distinct when operating on different arguments), and β 0 is the intercept.Moreover, the conditional distribution of the response variable Y i is assumed to belong to the 1-parameter exponential family, as in a generalized linear model (GLM), see Nelder and Wedderburn (1972).In particular, where f (y i |ϑ i ) is the probability density (mass) function of some 1-parameter exponential family distribution, which has the form: where ϑ i is the canonical parameter, ω i is known constant denoting the weight of the data case which is often equal to 1, and φ is the dispersion parameter.GAMs can be estimated by the penalized likelihood approach, see Wood (2006, Chapter 4) for details.
Due to its flexibility, GAMs have become widely used in various fields.Unfortunately, GAMs cannot be directly applicable for regression analysis with zero-inflated data due to the excess of zeroes.Instead, nonparametric regression with zero-inflated responses may be studied via the zero-inflated generalized additive models (ZIGAMs).The ZIGAM assumes that the response variable follows a probabilistic mixture distribution of a zero atom and a regular component whose distribution belongs to the 1-parameter exponential family: where the zero atom models the zero-inflation explicitly and f is defined in (4).Below we refer to f in the mixture model ( 5) as the regular pdf and its corresponding distribution the regular distribution, and µ i = E f (Y i ) as the regular mean which is assumed to link to the covariates as given by ( 2).The non-zero-inflation probability p i is linked to the covariate as follows: where g p is another link function, for instance, the logit function, and ξ is an unknown smooth function.If η and ξ are functionally orthogonal (infinite-dimensional) parameters, the model is an unconstrained ZIGAM in which case zero-inflation could be caused by a mechanism different from that underlying the non-zero-inflated responses.On the other hand, if the zero-inflation process is coupled with the process generating the non-zero-inflated data, we may expect some monotonic relationship between η and ξ.In particular, we consider the case that ξ is constrained to be a linear function of η: where α and δ are two unknown coefficients.We will refer to the zero-inflated model (5) with constraint (7) as the constrained zero-inflated generalized additive model (COZIGAM).
In some cases, the non-zero-inflated data generating process and the zero-inflation process are partially coupled so that the latter process may only depend on a subset of the smooth components affecting the non-zero-inflated response.In addition, these smooth components may affect the zero-inflation differently.Thus, the proportionality constraint (7) in the COZIGAM may be relaxed to allow possibly different proportionality coefficients for different covariates in the zero-inflation process.Specifically, we consider the component-specific proportionality constraint: where α and δ = (δ 1 , . . ., δ m ) T are unknown parameters.In constraint (8) we assign a specific proportionality coefficient for each additive component so that different smooth components may possibly have different effects on the zero-inflation process.Furthermore, in some applications, it may be desirable to fix some proportionality coefficients to be zero, which enforces that the corresponding component covariates do not affect the zero-inflation process.In Section 3 we will illustrate the use of the COZIGAM package for fitting ZIGAMs, and COZIGAMs with both constraints ( 7) and (8).

Model Estimation
We now briefly outline the method of penalized likelihood for estimating a COZIGAM, with constraint (7), that is proposed by Liu and Chan (2008).The method can be readily modified for estimating a COZIGAM with component-specific constraint (8) or fitting an unconstrained ZIGAM.According to the reproducing kernel Hilbert space theory, under some mild conditions and for finite sample size, we can reparametrize the infinite-dimensional parameter η by a vector parameter β, so that where X i is the i-th row of the design matrix X of some basis functions.Furthermore, the roughness penalty J 2 (η) can often be expressed as a quadratic form β T Sβ/2, where S is a penalty matrix, see Gu (2002) and Wood (2006).Define the binary variables E i , i = 1, . . ., n, with If the underlying regular exponential family distribution is continuous, for instance, Gaussian or Gamma, the penalized log-likelihood then equals where λ n is the smoothing parameter.
If the regular distribution assigns positive probability to zero, which is the case for many discrete distributions including Poisson and binomial, the penalized log-likelihood function becomes somewhat complex: Below, a COZIGAM will be referred to as a continuous (discrete) COZIGAM if its penalized likelihood function is given by Equation ( 9) (Equation ( 10)).Liu and Chan (2008) proposed an iterative algorithm for maximizing (9) or ( 10) with respect to the parameter θ = (α, δ, β T ) T for the case of known smoothing parameter, which is motivated by the Penalized Iteratively Re-weighted Least Squares (PIRLS) method (Wood 2006, pg. 169) and the Penalized Quasi-Likelihood (PQL) method (see, for instance, Green 1987;Breslow and Clayton 1993).
Direct maximization of (9) could be done via a modified PIRLS algorithm, and the smoothing parameter could be determined by generalized cross validation (GCV) or unbiased risk estimation (UBRE); see Wood (2006, Chapter 4) for further discussions about GCV and UBRE.However, for a discrete COZIGAM, direct optimization of the penalized likelihood ( 10) is challenging because it complicates the use of GCV or UBRE for choosing the smoothing parameter.In this case, if we augment the data by the binary variables Z i , i = 1, . . ., n, which are defined by the complete-data penalized log-likelihood equals which has the same form as ( 9) and can be optimized through the modified PIRLS.Note that the variable Z i defined by ( 11) is latent so that the EM algorithm is employed for estimating a discrete COZIGAM.The covariance matrix of the estimator can be approximately computed by inverting the observed Fisher information.See Liu and Chan (2008) for details.

Model Selection
In statistical analysis, one important issue is model selection or model comparison among multiple competing models.One of the widely used model selection criteria is the Bayesian Information Criterion (BIC) (Schwarz 1978), which selects the model with maximum posterior model probability.In the Bayesian framework, the posterior probability of model M i equals where P (M i ) is the prior probability of model M i , D denotes the data, and is the normalizing constant.P (D|M i ) is the marginal likelihood (also known as the evidence) of the model M i , and it equals where P (D|θ, M i ) is the likelihood of the parameter θ under the model M i , and P (θ|M i ) is the prior probability of θ under M i .Assume a flat prior that P (M i ) ∝ constant, the posterior model probability P (M i |D) is proportional to the marginal likelihood P (D|M i ).Just like the BIC, we will use the marginal likelihood as the model selection criterion which maximizes the posterior model probability.Preference will be given to models with larger marginal likelihoods.For the unconstrained ZIGAM and the COZIGAM, there is generally no closedform solution for the integral on the right side of (12).Laplace method (see, for example, Tierney and Kadane 1986) is used to approximately compute the marginal likelihood.Liu and Chan (2008) gave the following approximate formula of the logarithmic marginal likelihood for the COZIGAM: where θ = ( α, δ, β T ) T is the maximum penalized likelihood estimator, K = dim(β), S + is a diagonal matrix of dimension m with all the strictly positive eigenvalues of the penalty matrix S arranged in descending order on the leading diagonal, and H is the negative Hessian matrix of l p /n evaluated at θ.
For the ZIGAM, Liu and Chan (2008) provided the following approximation: where the unconstrained infinite-dimensional parameter ξ (defined in ( 6)) can be reparametrized by a parameter vector γ, ( β T , γ T ) T is the maximum penalized likelihood estimator, and S 2+ of dimensions m 1 and m 2 are two diagonal matrices consisting of the strictly positive eigenvalues of the penalty matrices associated with the parameters β and γ respectively, λ 1n , λ 2n are the smoothing parameters, and H is the negative Hessian matrix of the normalized penalized likelihood function evaluated at the maximizer.
Here, we extend the model selection approach of Liu and Chan (2008) for assessing the presence of zero inflation.This can be done by fitting a ZIGAM and a GAM to the data, and then compare their marginal likelihoods.Following Liu and Chan (2008), it can be shown that the logarithmic marginal likelihood of a GAM (without zero-inflation) is given by Note that the difference between ( 13) and ( 14) is that l p (α, δ, β) in ( 13) is the penalized loglikelihood of a COZIGAM; furthermore the COZIGAM adds two more degrees of freedom to the GAM, while l p (β) in ( 14) is the penalized log-likelihood of a GAM, and accordingly the negative Hessian matrices in the two formulas are different.A higher marginal likelihood from the ZIGAM would indicate that there is zero-inflation in the count data.Otherwise, fitting the data by a GAM instead of a ZIGAM is appropriate.

The COZIGAM Package
The R package COZIGAM facilitates the fitting of a ZIGAM or a COZIGAM to zero-inflated data.It requires the installation of the mgcv package (Wood 2008) whose magic() function is made use of in the implementation of the maximum penalized likelihood estimation algorithm.Some features of the mgcv package are also shared by the COZIGAM package.In this section, we illustrate the use of the COZIGAM package.First we demonstrate the use by fitting discrete COZIGAMs to simulated data.Then a real data analysis will be studied where the response variable follows a zero-inflated lognormal distribution.The main function for fitting a COZIGAM (ZIGAM) is cozigam() (zigam()), which calls the COZIGAM.dis()(ZIGAM.dis())function if it is a discrete COZIGAM (ZIGAM).Otherwise COZIGAM.cts()(ZIGAM.cts())function is used for model estimation of continuous COZIGAMs (ZIGAMs).Some other useful functions including visualizing and summarizing a fitted COZIGAM will be discussed.In addition, the model selection criterion for choosing between an unconstrained ZIGAM and a COZIGAM will be illustrated.The key R commands as well as outputs will be COZIGAM provided with associated graphics.All numerical illustrations reported below were computed using a PC with a CPU of 2.40×2 GHz and 3 GB RAM.

Simulated Data
The simulations are based on two test functions, denoted by s 1 and s 2 , which are taken from Wood (2006, pg. 197).The test function s 1 has a 1-dimensional argument, while s 2 has a 2-dimensional argument (see Figure 1).In the COZIGAM package, the two test functions are named f0 and test respectively.We will simulate some Poisson and binomial count data based on these functions and then use the simulated data to fit COZIGAMs and ZIGAMs.As mentioned earlier, because the underlying regular distributions in these examples are discrete, the EM algorithm is used to find the maximizer of the penalized log-likelihood function (10).
R> library(COZIGAM) R> set.seed(8)R> n <-500 R> t1 <-runif(n, 0, 1) R> t2 <-runif(n, 0, 1) R> t3 <-runif(n, 0, 1) Next, we simulate the latent Poisson count data without zero-inflation: Finally, the Poisson variates are then set to zero with probability 1 − p i .The zero-inflated Poisson data may be saved in a data frame, say named data1: Note that in the process of simulating the data, we actually have the information of the latent indicator variable Z i (defined by ( 11)).However, in model fitting, we will not use this information but treat the Z i 's as missing.
The simulated zero-inflated dataset comprises of 200 zero responses out of 500 observations (40%), see Figure 2.Among the 200 zero responses, some are due to zero-inflation and the rest are the zero realizations of the Poisson distribution (and we cannot tell them apart).
To Here y ~s(t1,t2)+s(t3) is a GAM formula (see the gam() function in the mgcv package) specifying the response and predictor variables structure; the argument constraint = "proportional" specifies the proportionality constraint (7); conv.crit.out is the preselected stopping criterion for the iterative estimation procedure (see below); the distribution of the regular component (the non-zero-inflated data) is specified via the argument family, which is similar to the family argument of the glm() function for fitting a GLM; the data argument points to the dataset where the responses and covariates are saved.For a full list of the arguments as well as the object returned by the cozigam() function, see its help manual by running the command ?cozigam.
At the end of each iteration, the iteration number and the maximum norm of the difference between the current estimate and the previous one is displayed on the console, which lets the user keep track of the progress of the estimation procedure.The maximum norm is defined as where α, δ are the current parameter estimates and α old , δ old are the estimates from the previous iteration.The iteration procedure is considered to have successfully converged if the maximum norm is sufficiently small, i.e. it is less than the value specified by the argument conv.crit.out,at which iterate the estimation algorithm stops.For this example, the estimation algorithm converged after 15 iterations which took less than 10 seconds.Furthermore, the function outputs the parameter estimates α, δ, with their standard errors enclosed in parentheses.The above summary consists of two parts: the first part reports the parametric estimation results which includes the estimate of the intercept term β 0 in (2), as well as those of the constraint parameters α and δ.The corresponding standard errors of the estimators and the Wald test results for testing whether the parameters are individually equal to 0 are also given.
The second part reports the estimation results of the nonparametric smooth components, which lists the efficient degrees of freedom (edf) for each smooth term and the approximate F tests for significance.See Wood (2006) for relevant discussions in the context of GAM.The last line in the summary reports the scale (dispersion) parameter estimate of the regular distribution or its true value (if it is known), and the sample size as well; for example, the scale parameter is known and equals 1 for Poisson distributions.The users can check the help manual on the object returned by the cozigam() function (in this example saved as "res1") for more information of the fitted COZIGAM.
The smooth function estimates can be displayed using the generic function plot().The commands R> par(mfrow=c(1,2)) R> plot(res1, shade.ci=TRUE,Rug=TRUE) produce two figures, one for each of two smooth components in the model res1, as shown in Figure 3.The plotting convention depends on the dimension of the argument of the function.For the case of 1-dimensional argument, the function estimate is plotted as a smooth function by connecting the point estimates over a grid by lines in the plot, with a 95% pointwise confidence band.Setting the argument shade.ci to TRUE shades the confidence band in grey but otherwise the confidence band is unshaded except that its upper and lower boundaries are drawn as dashed lines.The covariate values of each data case are drawn as a short stick on the bottom of the x-axis if Rug=TRUE.
For the 2-dimensional case, the function estimate is displayed in a contour plot by default, with the covariate values of each data case plotted as a dot if Rug=TRUE (the right panel of Figure 3).Alternatively, the function estimate can be drawn in a perspective plot by setting the argument plot.2d="persp".We could also require only the second smooth component s 2 to be plotted by letting select=2.The command to produce Figure 4 is listed below.Note that the test functions are scaled in the model and the estimated smooth functions are centered at 0. R> plot(res1, select=2, plot.2d="persp") Given a new set of covariates, we can use the generic function predict() to make predictions for the new data.Suppose we have two new observations with predictors t1 = (0.5, 0.2, 0.3) T and t2 = (0.8, 0.1, 0.7) T .To predict the response values at those two points, we first create a data frame named newdata containing the new data: R> newdata <-data.frame(t1=c(0.5,0.8),t2=c(0.2,0.1),t3=c(0.3,0.7))R> newdata t1 t2 t3 1 0.5 0.2 0.3 2 0.8 0.1 0.7 The names of the covariates in the new data set must match those in the fitted model.In the case of missing values in the covariate or if there is a mis-match in the covariate names, the predict() function will return an error message.Next, we call the function predict() to make predictions for the new observations: R> pred <-predict(res1, newdata=newdata, se.fit=TRUE, type="response") R> pred fit se p 1 6.112847 0.6563248 0.7263883 2 2.327128 0.2841570 0.5475469 With the option se.fit=TRUE, the standard errors of the point predictors are computed and reported in the output.The argument type="response" specifies that predictions are done on the original scale of the response, whereas if type="link", predictions on the link scale are returned.The returned object is a data frame that consists of three columns: the column fit gives the predicted response for each observation; the column se gives the standard errors of the point predictors; and the last column p gives the predicted non-zero-inflation probability.
For example, if the model has two smooth components s(t 1i ) and s(t 2i ), zero.delta=c(NA,0) would include only the first smooth component in the zero-inflation constraint, so that, g p (p i ) = α + δ 1 s(t 1i ).In the above example, we initially did not fix δ 2 , but let the data tell us which covariate may affect the zero-inflation process, as, in practice, there may be little information on which factors affects zero-inflation.Instead, we fitted a COZIGAM with all constraint coefficients being free parameters.The fitted model yields that δ 2 = −0.025with standard error 0.229, which is not significant.Hence, we fitted another COZIGAM with δ 2 fixed to be 0 (unreported).In practice we can use similar strategy or some prior information to determine which smooth components should be included in the zero-inflation constraint.

The Use of Model Selection Criterion
In Section 2 we have discussed the proposed model selection criterion for choosing between an unconstrained ZIGAM and a COZIGAM.We demonstrate its use here.Let us revisit the first example with zero-inflated Poisson responses.We have fitted a COZIGAM with the fitted model saved as res1.The validity of the proportionality constraint (7) can be checked via model comparison between the fitted COZIGAM and an unconstrained ZIGAM, the latter of which can be fitted by the zigam() function: R> res1.un<-zigam(y ~s(t1) + s(t2,t3), family = poisson, data = data1) We can then compare the (approximate) logarithmic marginal likelihoods of the two models: The COZIGAM has a greater marginal likelihood (−962.32)than the unconstrained ZIGAM (−969.39),which suggests that the more parsimonious COZIGAM is preferred by the model selection criterion.It is instructive to compare the non-zero-probability functions from the two model fits.Because the ZIGAM assumes no constraint on the smooth function of nonzero-inflation probability ξ, its estimated smooth components have much wider confidence intervals (Figure 3) as compared to their counterparts of the COZIGAM (Figure 3). Figure 6 plots the estimated non-zero-inflation probabilities versus their true counterpart with the left diagram for the fitted COZIGAM and the right digram for the fitted (unconstrained) ZIGAM, which shows that the ZIGAM results in much more variable estimates than the COZIGAM.The larger variablility in the ZIGAM estimates owes to the fact that the ZIGAM estimate of the non-zero-inflation probability function is based on the presence/absence binary data, which is generally less informative than the non-zero-inflated data.This confirms that fitting a COZIGAM gains efficiency when the constraint obtains (Liu and Chan 2008).[1] -1245.787 The logarithmic marginal likelihood of the fitted GAM (−1245.79)which does not incorporate zero-inflation is much lower than that of the (unconstrained) ZIGAM model, revealing the presence of zero-inflation.
Consider another example in which we simulated Poisson data that are not zero-inflated and the Poisson mean equals exp{s 1 (t 1 )/3 − 1} with sample size n = 300.The simulated Poisson responses have 98 zeroes.The histogram of the non-zero-inflated Poisson responses in Figure 7 looks very similar to Figure 2 where zero-inflation does exist.Therefore, in this case, we cannot easily tell whether zero-inflation is present in the data.However, the model selection approach provides a convenient way to assessing the presence of zero-inflation.
We could also compare the marginal likelihood of a GAM with that of a COZIGAM for checking the presence for zero-inflation.However, because the COZIGAM adds only two more degrees of freedom to the parameter space, the model selection criterion tends to choose the COZIGAM over the GAM even though the true model is non-zero-inflated, for relatively small sample size.We study the relative frequency of detecting the presence of zero inflation by choosing between a GAM and a COZIGAM, via simulations using the above model setting and with different sample sizes.The simulation results are summarized in Figure 8, which suggests that, for small to medium sample sizes, the model selection criterion is not so powerful in picking the true (non-zero-inflated) model.However, for large sample sizes, the proportion of choosing the true model increases from 80.2% when n = 600 to 94.5% when n = 1200.On the other hand, our limited simulation experience suggestions that if the model comparison is restricted to between the GAM and the ZIGAM, the model selection approach was found to yield very high probability (above 90%) of choosing the true model even with relatively small sample size (e.g., n = 200), whether the true model is zero-inflated or not.Therefore, in order to use the model selection approach to detect zero-inflation in the data, our suggestion is to compare the marginal likelihood of the GAM with that of the (unconstrained) ZIGAM, unless the sample size is sufficiently large, in which case we can also compare the GAM with the COZIGAM.

Real Data Application
Now we illustrate the use of the COZIGAM package with a real data application; see Liu and Chan (2008) for further discussion.The data analyzed in this example is part of an extensive survey data on walleye pollock egg density (numbers 10m −2 ) collected during the ichthyoplankton surveys of the Alaska Fisheries Science Center (AFSC, Seattle) in the Gulf of Alaska (GOA) from 1972 to 2000.Ciannelli, Bailey, Chan, and Stenseth (2007a) showed that the spatialtemporal distribution of the pollock egg in the GOA underwent a change around 1989-90.However, their analysis was confined to positive catch data and information from the zero catches were ignored.Here, we illustrate the use of the COZIGAM for extracting information from all data including zero catches.For simplicity, we only analyze the data from 1987 which contain 274 observations sampled from the 93th to the 116th Julian day over sites with bottom depth in the range of 28-5200m.This dataset is included in the COZIGAM package with name eggdata.To load the dataset into an R session, type

R> data(eggdata)
The first 6 observations are all zero catches: R> head(eggdata) The dataset contains six variables: bottom records the bottom depth (in meters) for each observation; lon and lat represent longitude and latitude respectively, i.e., the geographical location of each sampling site; the catch column contains the observed pollock egg abundance which is measured by CPUE (Catch Per Unit Effort); j.day is the Julian day information; and the last variable is year.
There are totally 274 observations in the year of 1987, among which 84 are zero catches making up over 30% of the data (see Figure 9).Because the survey in 1987 took place in a relatively short period (93th to 116th Julian day), preliminary analysis showed that the sampling date may and will be dropped from the analysis.Here, the main goal is to explore the spatial distribution of pollock spawning aggregations in the GOA.The response variable is the CPUE, and the covariates include location (longitude and latitude) and (log-transformed) bottom depth.Consider the model that the CPUE follows a COZIGAM with a zero-inflated lognormal distribution.Specifically, for the i-th observation, i = 1, . . ., 274, CP U E i |t i ∼ 0 with probability 1 − p i Lognormal(µ i , σ 2 ) with probability p i .
The mean response µ i of the (log) non-zero-inflated data is assumed to be additive in the covariates: with the following constraint on the non-zero-inflation probability p i : where β 0 , α, δ are parameters, s are assumed to be distinct smooth functions if they have distinct arguments; for model identifiability, the smooth functions are constrained to be of zero mean and hence the corresponding function estimates are centered over the data.
The function cozigam() was called to fit a COZIGAM to the pollock egg data: The argument log.tran=TRUE effects the log-transformation to all positive responses so that the normal family is specified in the model fit.
Before accepting the fitted COZIGAM, we need to assess the validity of the contraint on the non-zero-inflation probability.We do this by fitting an unconstrained ZIGAM to the data and comparing its logarithmic marginal likelihood with that of the COZIGAM: R> egg.res.un<-zigam(catch ~s(lon,lat) + s(log(bottom)), log.tran = TRUE, family = gaussian, data = eggdata) R> egg.res$logE; egg.res.un$logE [1] -454.6639 [1] -463.1898 which provides some justification for constraining the non-zero-inflation probability specified by ( 16).The parameter estimates for Equation ( 16) are α = −1.816(0.347) and δ = 0.489 (0.064), which is significantly positive.Thus, there is strong evidence indicating that zero-inflation is more likely to occur at locations with less egg density.Approximate F-tests show that the two smooth functions are highly significant.See Figure 10 for the plots of the estimated functions.
The validity of the lognormal regression assumption for the positive data may be explored with the model fit using only the residuals of the non-zero data.The model diagnostic plots including the Q-Q normal score plot of the residuals and the plot of residuals vs. fitted values (Figure 11) suggest that the model assumptions for the positive data are generally valid.Therefore the lognormal regression assumption is reasonable according to the model diagnostics.

Conclusion
In summary, we have presented a new approach for analyzing zero-inflated data, and introduced a corresponding package COZIGAM of R routines for fitting constrained and unconstrained zero-inflated generalized additive models.Some simulation studies and a real data application were used to illustrate the use of the COZIGAM package.Future work includes incorporating more general form of constraints on the non-zero-inflation probability, developing methods of model diagnostics for zero-inflated models using all data, and extending the package to fit threshold COZIGAM that can account for nonstationarity or nonlinearity.We plan to incorporate some of these features into later versions of the COZIGAM package.

Figure 1 :
Figure 1: Test Functions Used in Simulation Studies Figure 2: Histogram of the Simulated Zero-Inflated Poisson Responses

Figure 3 :
Figure 3: Plots of fitted smooth functions in Example 1.The left panel depicts the estimate of s 1 , and the right panel displays the estimated s 2 .

Figure 5 :
Figure 5: Plots of the smooth function components of the non-zero-inflation probability, on the logit scale, of the fitted ZIGAM with the data of Example 1.The left panel depicts the estimate of s 1 , and the right panel displays the estimated s 2 .

Figure 7 :
Figure 7: Histogram of the Simulated Non-Zero-Inflated Poisson Responses

Figure 8 :
Figure 8: Proportions of choosing the true model under different sample sizes.The simulation results are based on 1000 replications for each case.

Figure 10 :
Figure 10: Effects of Location and Bottom Depth: the left diagram shows the contour plot of s(lon, lat) on the right side of Equation (15); the right diagram depicts the bottom depth effect s(log(bottom)) with 95% pointwise confidence band.
Raw Data Plot of Pollock Egg Density.Blue circles denote zero catches; positive catches are displayed by red dots, whose sizes are proportional to logarithmic responses.
The fitted model is summarized as follows:R> summary(egg.res)