Zero Inflated Poisson and Zero Inflated Negative Binomial Models with Application to Number of Falls in the Elderly

Usually when zeros occur in data sets, researchers or analysts deal with it as missing data, delete or impute data. At times these zeros have meanings and are important and should be dealt with as such. It may mean that the subject or respondent did not experience the outcome and such data should not be considered as missing. Generally data should not be transformed to fit a linear model and, particularly, count data should not be logtransformed. Another approach commonly used is to make the variable a categorical one by considering the zeros as “outcome absent” and the observations as “outcome present”. This is not a very good option either since the variable has been transformed and interest is no more in the number of occurrences.


Background
Generalized linear models (GLMs) are powerful tools for analyzing count data. The starting point for count data is a GLM with Poisson-distributed errors, but not all count data meet the assumptions of the Poisson distribution. Therefore it is important to test if the variance > mean or if the number of zeros is greater than expected.
Usually when zeros occur in data sets, researchers or analysts deal with it as missing data, delete or impute data. At times these zeros have meanings and are important and should be dealt with as such. It may mean that the subject or respondent did not experience the outcome and such data should not be considered as missing. Generally data should not be transformed to fit a linear model and, particularly, count data should not be logtransformed. Another approach commonly used is to make the variable a categorical one by considering the zeros as "outcome absent" and the observations as "outcome present". This is not a very good option either since the variable has been transformed and interest is no more in the number of occurrences. A standard Poisson model would not distinguish between the two processes causing an excessive number of zeroes, but a zeroinflated model allows for and accommodates this complication. When analyzing a dataset with an excessive number of outcome zeros and two possible processes that arrive at a zero outcome, a zero-inflated model should be considered. A histogram of the response variable is drawn to see if the number of zeros is excessive. If two processes generated the zeroes in the response variable but there is not an excessive number of zeroes, a zeroinflated model may or may not be used.
In dealing with dependent variables with excess zeros, the generalized linear model (GLM) with Poisson distribution can be considered. However, the GLM with Poisson distribution does not deal with over dispersion particularly when it is due to excessive zeros [1]. However, the zero inflated model with Poisson distribution (ZIP) deals with excessive zero responses in count data. Theory suggests that the excess zeros can be modeled independently. The ZIP model has two parts: a Poisson count model and the logit model for predicting excess zeros. The zero inflated negative binomial regression is also for modeling count data with excessive zeros and it is usually for over dispersed count outcome variables. Again theory also suggests that the excess zeros are generated by a separate process from the count values and that the excess zeros can be modeled independently [2].
In this analysis, we considered the number of falls; a count variable with zero responses. If the respondent didn't experience a fall, the only possible outcome is zero. If the respondent experienced a fall, then it is a count process. Previous studies on falls have examined falls as a categorical variable by examining factors associated with its presence or absence [3][4][5]. Number of falls is usually a count variable and should be treated as such.
We determined the best fitting model between the zero inflated Poisson and the zero inflated negative binomial regression models in identifying the factors associated with number of falls in the elderly using data from the Ibadan Study for Ageing.

The Zero Inflated Poisson (ZIP) Regression Model
In Zero Inflated Poisson regression, the response (Y = Y 1 , Y 2 , …..Y n ) is independent. The assumption of this model is that with probability p the only possible observation is 0, and with probability (1-p), a Poisson (λ) random variable is observed in Y. Therefore the occurrence of Y i follows this distribution: , with probability (1-) , 1, 2, The mean and variance of ZIP distribution are respectively:

The Zero Inflated Negative Binomial (ZINB) Regression Model
The Zero-Inflated Negative Binomial (ZINB) regression model assumes that there are two distinct data generation processes. The result of a Bernoulli trial is used to determine which of the two processes is used. For observation i, with probability π i the only possible response of the first process is zero counts, and with probability of 1-π i , the response of the second process is governed by a negative binomial with mean λ i . The zero counts are generated from the first and second processes, where a probability is estimated for whether zero counts are from the first or the second process. The overall probability of zero counts is the combined probability of zeros from the two processes. Following Greene and Yau et al., a ZINB model for the response Y i is written as: Where the mean and variance of the Y i are: where λ i is the mean of the underlying negative binomial distribution and k is the over-dispersion parameter. Respondents were elderly people aged 65 years or above who resided in the selected states. A multistage cluster sampling of households within enumeration areas (geographical units demarcated by the National Population Commission) within each state was employed. Only one respondent, who had provided consent, mostly verbal due to illiteracy or by choice, was selected per household. Information was collected using a questionnaire that contained socio-demographic, social engagement and health characteristics as well as questions on falls. At baseline in 2003/2004, 2149 respondents completed the study. Respondents were subsequently followed up yearly from 2007 to 2009. Only the baseline data were used for this analysis. The data were weighted to account for the clustering.
The outcome variable was number of falls in the last 12 months, while independent variables were socio demographic characteristics: age, sex, years of education, wealth index developed using principal component analysis to create 5 wealth quintiles: poorest, poorer, middle, richer, richest. The wealth index was constructed using the asset approach whereby all household possessions are included as much as possible. Items used for the wealth index include: wall clock, bucket, radio, cell

03
phone, motor bike, fan, gas or electric stove, video cassette, bicycle, television, air conditioner, livestock, personal computer, clothes iron, deep freezer, refrigerator, and motor vehicle. The reliability, precision and accuracy of the wealth index increases when many items are used for its computation [6]. Other explanatory variables considered were presence of chronic medical conditions such as arthritis, insominia, and presence of chronic pain. A total of 2118 respondents answered the question: how many times have you fallen in the last 12 months? Of these, 475 had fallen 1 time, 91 had fallen 2 times, 63 had fallen 3 times, etc., We checked for the number of zeros {zeros =1643 (77.57%)}, and then examined the means (mean=0.4759, variance =1.304) and confirmed that variable (number of falls) was over dispersed. Features of data suggest the use of zero inflated models. Figure 1 shows the number of falls. The ZIP and ZINB models were fitted. Model selection and comparison were done using the Akaike information criteria (AIC), Bayesian Information criteria (BIC),-2logL as well as the Voung test.

Results
A total of 475 (22.4%) respondents reported having 1 or more fall while 1643 had zero falls. About a third (30.6%) were above 80years, slightly above half (53.4%) were females, 25.8% lived in the urban areas and about half (50.3%) were married. Majority (70.8%) had no medical condition, 79.1% had chronic pain and 25.6% were visually impaired. Table 1 shows the characteristics of the respondents.

Results from the ZIP model
The number of falls was approximately lower by 23.2% among female respondents compared to male respondents (IRR = 0.768, 95% CI: 0.594, 0.991). Number of falls among respondents from rural areas was higher by 34.0% (IRR = 1.340, 95% CI: 1.025, 1.751) compared to respondents from urban areas. Respondents who had a chronic medical condition had a 19.1% decrease in number of falls (IRR = 0.809, 95% CI: 0.668, 0.981) compared to respondents who had no medical condition. The risk of experiencing a fall increased by 28.8% among respondents who were visually impaired (IRR = 1.288, 95% CI: 1.076, 1.541) compared to respondents who were not visually impaired, however, this was not significant.
For the absolute zero group, the risk of being in the absolute zero group among female respondents decreased by 44.3% (IRR = 0.557, 95% CI: 0.379, 0.818) compared to male respondents. In addition, the risk of being in the absolute zero group among respondents with chronic pain decreased by 39.2% (0.608, 95% CI: 0.421, 0.877) compared to respondents with no chronic pain. The risk of being in the absolute zero group for respondents in semi-urban areas decreased by 21.5% (0.785, 95% CI: 0.544, 1.134) compared to respondents in urban areas. However the risk of being in the absolute zero group for respondents in rural areas increased by 5.2% (1.052, 95% CI: 0.710, 1.558) compared to respondents in urban areas. Location was found to be nonsignificant. The risk of being in the absolute zero group for respondents that were visually impaired increased by 59.7% (1.597, 95% CI: 0.786, 1.398) compared to respondents who were not visually impaired, but this result was not significant. Table 2 shows the parameter estimates from the zero inflated Poisson regression.

Results from the ZINB model
Respondents who had chronic pain had a 67.0% increase in number of falls (IRR = 1.670, 95% CI: 1.217, 2.292) compared to respondents who had no pain. Respondents who were visually impaired had a 36.4% increase in number of falls (IRR = 1.364, 95% CI: 1.042, 1.786) compared to respondents that were not visually impaired. Number of falls was approximately higher by 35.4% among female respondents compared to male respondents (IRR = 1.354, 95% CI: 0.961, 1.907). Number of falls among respondents from semi-urban and rural areas were higher by 26.7% (IRR = 1.267, 95% CI: 0.931, 1.725) and 29.4% (IRR = 1.294, 95% CI: 0.917, 1.825) respectively compared to respondents from urban areas. Respondents who had elementary occupation had an 18.1% increase in number of falls (IRR = 1.181, 95% CI: 0.775, 1.800) compared to respondents who had skilled/higher level occupation. However, respondents who were traders had a 21.1% decrease in number of falls (IRR = 0.789, 95% CI: 0.506, 1.231) compared to respondents who had skilled/higher level occupation. Sex, location, marital status and occupation were not significantly associated with number of falls. Table 3 shows the parameter estimates from the zero inflated negative binomial regression.

Model Comparison and Selection
We used the AIC, BIC, -2log LL and Voung statistic for model comparison and selection. For the ZIP model, AIC was 3466.944 while for the ZINB, AIC was 3281.569. In addition, the -2logL was 3386.994 and 3237.57 for the ZIP and ZINB models respectively (Table 4).

Discussion
In this study we identified and quantified the relationships between frequency of falls in the elderly and selected variables in the presence of over dispersion by using zero inflated regression models. The number of falls was over dispersed and the zero inflated negative binomial models were the preferred model in identifying the factors associated with falls. Mixed probability distributions were used to model the zero and non zero counts. The estimated results are efficient and non -biased. Our results showed that the standard errors from the zero inflated Poisson model were smaller than those of the zero inflated negative binomial model which could have led to some insignificant regression parameters. These large standard errors in the ZINB model suggest that in the presence of over dispersion, the ZIP model overstates the significance of the regression parameters and the significance of the evaluation factors. This is compatible with findings from other studies [7][8][9][10]. The ZINB was the best model selected based on the values of the model selection test/ criteria utilized. About a quarter of the respondents reported experiencing 1 or more fall. This low rate of fall had been discussed extensively by Bekibele and Gureje [3]. In addition, this rate is comparable with reports from Japan [11], China [12], but low when compared to Australia [13], Spain [14], Italy [15], Brazil and Iran [16][17][18][19]. The ZIP model showed that female sex, being separated, or widowed, having chronic medical condition, fourth wealth quintile were significantly associated with number of falls. However, the ZINB identified chronic medical pain as the only predictor of falls in the elderly. Previous studies in which fall in the elderly was treated as a categorical variable have identified regular medication, depression, sufficient exercise, wearing slippery shoes, as significant factors of falls in the elderly [5]. However, in another analysis where falls had been used as a categorical variable, chronic pain was the only significant factor associated with falls using the binary logistic regression [3]. Similarly, our findings using the ZINB model also identified presence of chronic pain as the only significant factor associated with number of falls in the elderly. We however note that in carrying out the ZINB model, we used the option "constant" for the inflate part in the analysis procedures in stata; and we reported the modeling of the non zero counts. We didn't include the covariates in the inflation part of the ZINB model because it disrupted the model from converging; hence the option was deleted.