Double Generalized Beta-Binomial and Negative Binomial Regression Models

Overdispersion is a common phenomenon in count datasets, that can greatly aﬀect inferences about the model. In this paper develop three joint mean and dispersion regression models in order to ﬁt overdispersed data. These models are based on reparameterizations of the beta-binomial and negative binomial distributions. Finally, we propose a Bayesian approach to estimate the parameters of the overdispersion regression models and use it to ﬁt a school absenteeism dataset


Introduction
The binomial and Poisson distributions are widely used to fit discrete count data.However, a serious complication arises when the variance of the response variable Y exceeds the nominal variance.This phenomenon, called overdispersion, can lead to underestimation of standard errors (Cox 1983) in addition to misleading inference of the regression parameters, such as confidence intervals (Breslow 1984).In practice, the mentioned phenomenon is very common and even the theoretical dispersion of a variable is sometimes considered an exception.
In order to fit such data, for both binomial and Poisson overdispersed data, many authors have proposed models and estimation methods.Regarding to binomial data, authors such as Williams (1982) and Collet (1991) have studied the extra-binomial variation, which is another name for overdispersion in binomial data, and have proposed methods such as the incorporation of an extrabinomial variation components in the maximum likelihood estimation of the loglinear model.Moreover, Breslow (1984) analyzed extra-Poisson variation data (or overdispersed Poisson data) by extending the results in Williams (1982) to the Poisson case.Later, Lawless (1987) explored the robustness and efficiency of the methods used to deal with extra-Poisson variation in regression models.In turn, McCullagh & Nelder (1989) presented a more general discussion of overdispersion within the framework of generalized linear models .Demétrio & Hinde (1998) stated that the different models for overdispersion can be categorized by two general approaches: 1.Those including additional parameters for the variance function.
2. Those assuming a two-stage model, meaning the distribution parameter is itself a random variable.Demétrio & Hinde (1998) proposed the Hinde-Demétrio (HD) regression models in order to handle overdispersed count data and discussed the effect of the dispersion parameters on this fit.For this topic, it is worth mentioning the work of (Quintero-Sarmiento, Cepeda-Cuervo & Núñez-Antón 2012), who published a review about overdispersion and the different methods to model it, such as GLMs with random mean, the quasi-likelihood functions, and the double exponential families.
The aim of this paper is to introduce a new reparameterization of the beta binomial and the negative binomial distribution to propose reparameterized overdispersed regression models and develop Bayesian and classic methods to fit the proposed models.The paper is organized as follows: In Section 2, we define the (µ, φ)-beta binomial distribution and analyze its main statistical properties.Section 3 presents the (µ, α)-negative binomial distribution with its definition, mean, variance, graphics, density and characteristic functions.Section 4 develops the (µ, σ 2 )-negative binomial distribution.In Section 5, the definitions of the proposed overdispersion models are established.In Section 6, a Bayesian approach is proposed to fit the proposed models by applying the Metropolis Hastings algorithm.As an extension, in Section 7, we discuss the application of the Newton Rapson method to fit the proposed models.Finally, in Section 8 we present the motivating data to analyze the factors affecting school absenteeism in rural New South Wales.

(µ, φ)-Beta Binomial Distribution
There are many day-to-day phenomena which show binomial behavior.Therefore, improving the fit of binomial distribution models is important.However, most of these models in real experiments exhibit a significant amount of trials with estimated variance larger than the predicted by the theoretical binomial model (Demétrio & Hinde 1998).In this case, the beta-binomial model is an alternative to the binomial one since it captures the overdispersion and thereby results in a better fit of the observed data (Williams 1975).In order to correct estimation problems caused by overdispersion in binomial families, the p parameter (proportion of successes) is assumed to be a random variable following a beta distribution.
To build a reparameterization of the beta-binomial distribution, we first refer to the reparameterization of the beta distribution, which was proposed by Jørgensen (1997) and later in Cepeda-Cuervo (2001).If X ∼ Beta(α, β), an appropriate reparameterization of the beta distribution in terms of the mean and dispersion parameter is given by: where µ = α α+β and φ = α + β (Ferrari & Cribari-Neto 2004).In (1), Γ(.) denotes the gamma function.Regarding the mean and variance of the beta-binomial, we have: In this paper we use p ∼ Beta(µ, φ) to denote that p follows a beta distribution with mean µ and dispersion parameter φ.Thus, the (µ, φ)-beta binomial distribution is the distribution of a random variable Y such that, conditional to p, has a binomial distribution Y | p ∼ Bin(m, p), where p is a random variable with a beta distribution p ∼ Beta(µ, φ).We use the notation Y ∼ BB(µ, φ) to denote that Y has a beta-binomial distribution with mean µ and dispersion parameter φ.
The (µ, φ)-beta binomial density functions is determined by: where B(x, y) is defined in terms of the gamma function as B(x, y) := Γ(x)Γ(y)/Γ(x+ y).The mean and variance of Y are: Therefore, the variance of a (µ, φ)-beta-binomial variable is: The behavior of the beta-binomial density function is illustrated in Figure 1.Each of the four graphs represents the behavior of the beta-binomial distribution for different mean and dispersion values, where m (number of trials) is assumed to be constant and equal to 20.The title of each graph contains two numbers, which represent, respectively, the mean and dispersion parameters.For instance, in the first graph (top left box), µ = 0.3 and φ = 2, thus the mean of the beta-binomial variable is mµ = 6, and since the dispersion parameter is small, the graph does not show a noticeable accumulation around the mean 6.On the other hand, a distribution like the third (bottom left box), in spite of having the same mean as the first, shows greater accumulation around mean 6 because the dispersion parameter is bigger (φ = 20).The other graphs can be similarly interpreted.By definition, the characteristic function of the binomial distribution is given by: Thus, given that p ∼ Beta(µ, φ), the E(p k+j ) is the (k + j)-moment of the beta distribution.In fact, the characteristic function of the beta-binomial distribution can be seen as: such that φ p (t) is the characteristic beta distribution function.
Cepeda-Cuervo (2001) and Cepeda-Cuervo & Gamerman (2005) propose a new parameterization for the gamma distribution in terms of the mean µ and the shape parameter α.This reparameterization, the gamma density function is given by: Hereafter, we use λ ∼ G(µ, α) to mean that λ follows a gamma distribution with mean µ and shape parameter α.With this reparameterization, the negative binomial density function, denoted N B(µ, α), has a density function given by: The mean and variance of the random variable X ∼ N B(µ, α) are: It is possible to observe several phenomena which can be studied using negative binomial distributions.Some examples of the negative binomial distribution's applications are: the number of European red mites on apple leaves Demétrio, Kokonendji & Zocchi (2007); the number of coin flips necessary to get a determined value (whose domain are integers valued between 2 and infinity), or the number of units for inspection until getting exactly a determined number of defective units from a production line.In sum, in the literature, this distribution is widely interpreted as the required number of independent Bernoulli experiments (success/failure events) to achieve k successes.
To observe the behavior of the gamma-Poisson density function, we change one parameter at a time and we get Figure 2: each graph refers to the negative binomial density function for different values of its parameters (µ and α).For instance, in the first (left-top box) there are two numbers, 5 and 8, meaning that for this case the mean of this distribution is 5 and the shape parameter is 8.The graph shows bigger values for the density function near 5, and in comparison with the cases µ = 5, α = 4 (top-right box), this first graph has greater accumulation around the mean.The other graphs can be interpreted in the same way.From (7), the characteristic function of the reparameterized negative binomial distribution is given by: Moreover, by making r = α and p = µ/(µ + α) in equation ( 7), we get a usual parameterized density function for the negative binomial distribution: where x = 0, 1, 2, . . .It X ∼ N B(r, p) is used to denote a discrete random variable X follows a negative binomial distribution with parameters r and p defined in ( 11), where X can be intuitively considered as the number of successes in a sequence of independent and identically distributed Bernoulli trials before r failures occur and with a success probability equal to p in each trial.
In this model, the usual links are the logit function for the mean and the logarithmic function for the dispersion parameter.
For this model, the usual link functions are the logarithmic functions for both h and g.
The N B(µ, σ 2 ) regression model is defined as the N B(µ, α) regression model.

Bayesian Estimation
In following subsections we develop the Bayesian method for the three proposed two-stage models: the (µ, φ)-beta-binomial, the (µ, α)-negative binomial and the (µ, σ 2 )-negative binomial models.In these models, we have two vector parameters to estimate: the mean regression parameters β, and the shape (or variance) regression parameters γ.The Bayesian method used to fit the proposed models is defined in the next nine points following Cepeda-Cuervo (2001).
1. Let Y i , i = 1, 2, . . ., n, be n independent observed values obtained from one of the two-parameter distributions.
3. The link function should be strictly monotonic, twice differentiable in classic regression, and once in the Bayesian approach.
4. Without loss of generality, the mean and shape (or variance) regression parameters are assumed to have independent normal prior distributions: 7. Since the posterior conditional distribution π(β | γ) is analytically intractable, we propose to use working variables to build a kernel transition function to propose posterior samples of the β parameter vector.This working variable is determined by the first order Taylor approximation of h(.) around the current parameter value of µ i : where E(y i ) = µ i , and Therefore, if β (c) and γ (c) are the current values of the parameters, then: 8. For this proposal, we assume the kernel transition function given by: where Ỹ2 = (ỹ 11 , . . ., ỹ2n ) and Σ = Diag (V ar ( ỹ1i )).With the kernel function ( 16), the values of β, which will appear in the sample of the posterior distribution π(β, γ), will be generated.9. Since the posterior conditional distribution π(γ | β) is analytically intractable, it is necessary to use working variables and a second kernel function to generate posterior samples of the γ parameters.To do this, Cepeda-Cuervo (2001) assumes there are variables t i , such that E(t i ) = τ i , where τ i is the second parameter to be modeled, and that τ i = g −1 (z i γ).So, the working variables related to the second parameter, denoted by ỹ2i are given by: The mean and variance of these working variables are given by: Then, the working variables are: 18) 10.The second kernel transition function, q 2 , is where ) and Ψ = Diag(V ar ( ỹ2i )).
Once the kernel transition functions and working variables are established, the Metropolis Hastings algorithm is defined by the following steps: 1. Begin the chain iteration counter at j = 1.

Working Variables in the BB(µ, φ) Regression Model
We assume the (µ, φ)-beta-binomial regression model defined in Section 5 with the logit link function for the mean and logarithm link function for the dispersion parameter.Thus it follows from equation ( 15) that the working variables to define the kernel transition function to propose samples of the mean regression parameters are: To propose a kernel transition function to obtain samples of the dispersion regression parameter vector γ, we use the working variables established in equation ( 18).These working variables were proposed in Cepeda-Cuervo, Migon, Garrido & Achcar (2014) in their framework of the generalized linear models with random effects.In order to get these working variables, we have to ascertain the expression for t i such that E(t i ) = φ.Then, with i n i , using equation ( 18), this working variable is defined by: where µ c) .The variances of these working variables are, respectively: , ∀i = 1 . . .m.

Working Variables in the NB(µ, α) Regression Model
In the N B(µ, α) regression model defined in Section 5, assuming the logarithmic function for both the mean and the shape parameters from the first-order Taylor approximation with t i = y i , the working variables used to define the kernel transition function to propose samples of the mean regression parameters are given by:  2014), the working variables used to propose samples of the shape regression parameter require a variable t such that E(t) = α, thus t = αy/µ.Then, from the first order Taylor approximation of the logarithmic function around α, the working variables are: The variances of these working variables are, respectively:

Working Variables in the N B(µ, σ 2 ) Regression Model
Assuming the logarithm link function for the mean and variance parameters, the working variables ỹ1i , obtained using t = Y ; and ỹ2i , obtained using t = σ 2 Y µ , remain as in equations ( 20) and ( 21), respectively.Therefore, the variances of the working variables are:

Proposed Regression Models: A Classic Approach
In order to obtain the parameter estimates of the proposed models using the Newton Raphson algorithm, in this section, we develop the first-order and secondorder partial derivatives of the logarithm of the likelihood functions for each of the proposed models.For each of the regression models defined in Section 5, the logarithm of the likelihood function is given by: where l i (θ) = log f (x i | θ) and θ = (β, γ) is the vector of the regression parameters.
Therefore, the first-order partial derivatives of the logarithm of the likelihood function are given by: And the second-order partial derivatives are given by: x ji z si , j = 1, . . ., p, s = 1, . . ., r Thus, denoting by H the Hessian matrix and by q the vector of first derivatives of the logarithm of the likelihood function, the equation of the Newton-Rapson algorithm, is well defined.The parameter estimates are obtained by setting initial parameter values and applying the Newton-Rapson equation until a convergence criterion is satisfied.

Fitting the BB(µ, φ) Regression Model
In order to fit the BB(µ, φ) regression model, if Y i ∼ BB(µ i , φ i ), i = 1, . . ., n; is a sample of the variable of interest, the logarithm of the likelihood function is given by: where l i is the i-th component of the likelihood function defined in (22).
Thus, using the development in equations ( 23) and ( 24), the first order derivatives of the logarithm of the likelihood function are given by: where Ψ(z) represents the digamma function, Ψ(z) := d dz ln Γ(z).The second-order partial derivatives are given by equations ( 25), (26), and (27), where: With these derivatives, the vector of the first derivatives q and the Hessian matrix H of equation ( 28) are well defined, and thus, the Newton-Rapson algorithm can be applied in order to obtain the maximum likelihood estimation of the model's regression parameters.

Fitting NB(µ, α) Regression Model
In the negative binomial regression model, if Y i ∼ N B(µ i , α i ), i = 1, . . ., n; is a sample of the negative binomial distribution, the logarithm of the likelihood function is given by: where l i is the i-th component of the likelihood function.
Thus, from equations ( 23) and ( 24), the first order derivatives of the logarithm of the likelihood function are given by: Finally, from equations ( 25), ( 26) and ( 27), the second order derivatives of the logarithm of the likelihood function are given by: With these derivatives, the Newton Rapson equation ( 28) can be proposed, and the maximum likelihood estimation of the regression parameters can be obtained using the Newton Rapson algorithm.

Fitting the NB(µ, σ 2 ) Regression Model
If Y i ∼ N B(µ i , σ 2 i ), i = 1, . . ., n; is a sample of the variable of interest, then the logarithm of the likelihood function is given by: The first derivatives are: where: The second derivatives can be obtained from equations ( 25), ( 26) and ( 27) as in sections 7.1, and 7.2, and the maximum likelihood estimation of the regression parameters can be obtained by applying the Newton Rapson algorithm.

School Absenteeism Data
The data analyzed in this paper are presented in (Quine 1975) and come from a sociological study of Australian Aboriginal and white children from Walgett, New South Wales.There are nearly equal numbers of the two sexes and an equal number from the two cultural groups.Children were classified by culture, age, sex, and learner status, and the number of days absent from school in a particular school year was recorded.The response variable of interest is the number of days that a child was absent from school during the year; children who had suffered a serious illness during the year were excluded from this analysis.
Assuming, a mean structure (29) without the age variable, and if the shape structure given by (30), the parameter estimates, standard deviations, and 95% credible intervals are reported in Table 2. Again we calculated the BIC with both programs: R software yielded BIC equal to 829.83 while the BIC for the OpenBugs estimation was 833.92.
Therefore, from the parameter estimates that are given in Table 2, it is possible to conclude that the variable Days.abseent decreases while the Cultural.backgroundor the Learning.abilityincreases.For Slow learner status (0), the mean of Days Absent is 2.4927 for Aboriginal background and 3.1571 for White cultural background.For average learner status (1), the mean of Days Absent is 8.61 for White cultural background and 16.73 for Aboriginal cultural background.

Conclusions
In this paper, new parameterizations of the beta-binomial and negative binomial distributions have been proposed.These formulations, in terms of the mean and shape parameters, are good options to explain the behavior of overdispersed count data.From these reparameterizations, overdispersed regression models have been defined assuming that the two parameters of these models follow regression structures.
We also propose Bayesian and classic methods to fit these models, which we apply to analyze school absenteeism data.In this application, the proposed methods show suitable performance and posterior inferences; those obtained by applying the Bayesian method in RStudio and in OpenBugs agree.The observed results showed that the mean depends on the cultural background and the learning ability, and the shape parameter depends on the cultural background.

Appendix
In the classic approach, there were some calculations related to the first and second derivatives of the different likelihood functions that were dropped.In this section, some of them appear.
• Some details of the BB(µ, φ) and the negative binomial maximum likelihood functions' derivatives that were used in the Newton Rapson algorithm (Section 7.1): • In Section (7.2), some details of the derivatives of the logarithm of the N B(µ, α) maximum likelihood function were dropped.These are: • Finally, regarding Section (7.3), we find the following first derivatives for the (µ, σ 2 ) − N B log-likelihood function:
Revista Colombiana de Estadística 40 (2017) 141-163In Figure(3), each graph represents the N B(µ, σ 2 ) distribution for different values of mean and variance.The first number which appears in the top of each box refers to the mean µ and the second one to the variance σ 2 .For instance in the first (top-left box) the numbers 8 and 10 respectively, are the mean and variance of the N B(µ, σ 2 ) distribution.

Table 2 :
Parameter estimates of the beta-binomial regression model.