Overdispersed Nonlinear Regression Models

In this paper we propose nonlinear regression models in the biparametric family of distributions. In this class of models we propose two new classes of overdispersed non-linear regression models: the first, defined from the overdispersion family of distributions proposed by Dey, Gelfand, and Peng (1994), and the second from a class of compound distributions. For these models, we develop a Bayesian method in which samples of the posterior distributions are obtained by applying an iterated Metropolis-Hastings algorithm obtained by assuming two groups of parameters, defined by the mean and dispersion regression structures. In the first subclass of models, to improve the performance of the iterated Metropolis-Hastings algorithm, we develop worked variables from the application of Fisher scoring algorithm, to build the kernel transition function. A Bayesian method to fit compound regression models is also proposed. Finally, we present an application to neonatal mortality dataset to illustrate the use of the proposed models and the performance of the Bayesian method to fit the proposed models.


Introduction
The binomial and Poisson distributions are usually used to model count datasets.However, count data frequently present overdispersion, which means that the sample variance exceeds the nominal variance of the model and if an appropriate model is not considered, the analysis can present problems in the inference, such as underestimation of the errors and variance, loss of efficiency in the estimations and incorrect inferences about the confidence intervals of the regression parameters (Williams 1982;Cox 1983;Collet 1991).Multiple models have been developed to analyze overdispersed datasets, and have been categorized in two classes (Hinde and Demétrio 1998): models in which a new independent parameter is added to model the variance and/or overdispersion and those in which the parameter of the discrete distribution is assumed to follow an appropriate distribution function.The first class contains models developed by Williams (1982) for binomial count data and by Breslow (1984) for Poisson count data.To this class of models also belong the doubly exponential family of distributions presented by Efron (1986), the exponential dispersion models proposed by Jorgensen (1987) and overdispersed generalized linear models presented by Dey et al. (1994).The second class includes the overdispersion models developed by Crowder (1978), who proposed regression analysis of proportions based on the beta-binomial distribution, and by Margolin, Kaplan, and Zeiger (1981), who formulated the negative binomial distribution for data analysis of the Ames test used for the detection of Salmonella.
The overdispersed models proposed by Dey et al. (1994) are linear in the sense that their systematic components are linear functions of the regression parameters.However, frequently a mean linear regression structure is not appropriate to describe the dataset's behavior, when the mean of the variable of interest follows a nonlinear regression structure.Thus, in this paper we propose overdispersed nonlinear regression, model in which the systematic component is given by: where x i = (x 1i , ..., x pi ) T is the i-th vector of observations of p covariates and S(x i , •) is a known nonlinear function of β = (β 1 , ..., β p ) T , a vector of unknown regression parameters.
In the case of count data, Frome (1983) analyzed lung cancer mortality data among British physicians who were regular cigarette smokers, by using a nonlinear Poisson regression model derived from the multistage theory of carcinogenesis, relating the mortality rate with the time of exposure (years of smoking) and with the amount of carcinogen applied per unit time (cigarettes per day).McCullagh and Nelder (1989), in chapter 11, presented an experiment in which the number of grasshoppers that die is observed, from an initial exposed number (a binomial variable of response) to different combinations of doses of an insecticide and a synergist (a substance that potentiates the insecticide) and analyzed some nonlinear functions to model the proportion of dead grasshoppers.Another study, reported in Donoso, Carvajal, Vera, and Poblete (2014), involved the relationship between maternal age and fetal, neonatal, infant and maternal mortality rates in Chile.They found that the lowest mortality rates for both newborns (under 28 days) and infants occurred when the mothers were between 25 and 30 years old, an intermediate age, generating a nonlinear relationship between the rate and maternal age.Other examples are given in Frome, Kutner, and Beauchamp (1973) and Frome and Dufrain (1986), where the effects of radiation on the survival of stem cells or the rate of neutron decay in a Beryllium sample were analyzed using nonlinear regression functions.
In this paper, we propose nonlinear (linear) regression models in the biparametric family of distributions (F 2 ).In this class of models we propose two new class of overdispersed nonlinear regression models.The first is defined from the family of overdispersed distributions proposed by Dey et al. (1994) and the second from a class of compound distributions obtained by assuming that the parameter of a one-parameter distribution follows an appropriate distribution belonging to F 2 , in both by assuming that the mean systematic component follows a nonlinear (linear) regression structure.
For these models we develop a Bayesian method in wish samples of the posterior distributions, denoted by π(.), are obtained by applying an iterated Metropolis-Hastings (M-H) algorithm obtained by assuming two groups of parameters, defined by the mean and dispersion regression structures.To build the kernel transition functions used to propose samples of π(.) in the M-H algorithm, we follow the Bayesian methods proposed in Cepeda-Cuervo (2001) and Cepeda and Gamerman (2005), developing worked variables from the first order Taylor approximation of the nonlinear mean regression models or from the application of the Fisher scoring algorithm.In the first subclass of models, defined from the overdispersed family of distributions of Dey et al. (1994), to improve the performance of the iterated M-H algorithm, we develop worked variables from the Fisher scoring algorithm, used to obtain maximum likelihood estimation of the parameters, and use them to build the kernel transition function.
Finally, we propose a Bayesian method to fit compound models.In these models, we also propose a M-H iterated algorithm, where the mean kernel transition function is obtained from the combination of the prior distribution and an observational model proposed from a worked variable obtained from the first-order Taylor approximation of the mean nonlinear regression structure.To define the kernel transition function for the dispersion regression parameters, we propose to use the worked observation variables obtained from the Fisher scoring algorithm of overdispersed models, but assuming a sample of the distribution used to define the compound models, at the current values of the mean and dispersion regression parameters.
The performance of the Bayesian method extended in this paper to fit nonlinear double generalized regression models has been demostrated in the framework of generalized linear models (Cepeda-Cuervo 2001), heteroscedastic Weibull-normal mixture models (Cepeda-Cuervo, Migon, Garrido, and Achcar 2014), heteroscedastic normal-exponential mixture models (Garrido, Cepeda-Cuervo, and Achcar 2011) and heteroscedastic nonlinear regression models (Cepeda-Cuervo and Achcar 2010).Finally, we report some results of an application of the nonlinear negative binomial models in the analysis of the neonatal mortality dataset.
After this introduction, this paper is structured as follows: In Section 2, the nonlinear biparametric regression models are defined, including the classes of overdispersed nonlinear regression models, defined from the Dey et al. (1994), and the compound family of nonlinear regression models.In Section 3, a Bayesian method to fit nonlinear regression models in F 2 is defined.The working variables obtained from the application of the Fisher scoring algorithm to fit overdispersed nonlinear regression models are used to define the Bayesian method to fit the proposed models.In Section 4, from the previous maximum likelihood and Bayesian developments, a Bayesian method to fit nonlinear compound regression models is proposed, by introduction of latent parameters.Section 5 presents an application to neonatal mortality dataset.Section 7 is an appendix that include a summary of the main results presented in Cepeda and Gamerman (2005), related to double generalized linear models, and the maximum likelihood and Bayesian methods used to obtain parameter inferences.

Nonlinear biparametric regression models
In the framework of the biparametric family of distributions, Gelfand and Dalal (1990) discussed the biparametric exponential family of distributions given by: where, if Y is a continuous random variable, f is a density function with respect to the Lebesgue measure while if Y is discrete random variable, f is a count measure.The components of (δ, τ ) ∈ S are the parameters of the distribution family (2), where S ⊆ R × R denotes their parametric space.To obtain orthogonality between the parameters of the distribution, Dey et al. (1994) proposed a new parameterization of (2) in terms of the mean of Y , µ := E(y|δ, τ ) = ∂ ∂δ ρ(δ, τ ), as in (3), where Ψ (i,j) ( Some examples of distributions belonging to the biparametric exponential family of distributions (3), denoted by F 2 e , are the normal N (µ, σ 2 ), gamma G(µ, α), lognormal LN (µ, σ 2 ) and inverse Gaussian IG(µ, λ), among others.The beta distribution is an example of a distribution that does not belong to the exponential family of distribution (2), but the density function of the random variable obtained by the logit transformation Ỹ = logit(Y ) belongs to this family of distributions.The overdispersed exponential family of distributions proposed by Jorgensen (1987) can be seen as part of the double exponential family of distributions proposed by Efron (1986), which is part of the biparametric exponential family proposed by Dey et al. (1994).When the dispersion parameter τ is known in (2), a one-parameter exponential family of distributions is obtained, which include the exponential Exp(λ), Poisson P(λ), binomial Bin(m, p) and, in general, all the power series distributions introduced by Noack (1950).See McCullagh and Nelder (1989).
From the one-and the two-parameter exponential family of distributions, we define a compound family of distributions by the combination of a distribution F 1 (θ), belonging to the one-parameter exponential family of distributions, and an appropriate distribution F 2 e (µ, τ ), belonging to the two-parameter exponential family of distribution (3), by assuming that θ ∼ F 2 e (µ, τ ).Thus, if a random variable Y follows a distribution F 1 (θ) belonging to the one parameter exponential family of distributions, where θ follows a distribution F 2 (µ, τ ) belonging to exponential family of distributions (3), where µ := E(θ|µ, τ ) > 0 and τ ∈ R is a dispersion parameter, the compound distribution of Y , denoted Y ∼ F (µ, τ ), is given by: where f 1(θ) (y) is the density (probability) function F 1 (θ) and f 2 (µ, τ )(y) is the density of the distribution function F 2 (µ i , τ i ).Henceforth, we use F (µ, τ ) to denote the compound distribution functions, with density (probability) function given by (4).The compound distributions are also two-parameter distributions.
An important subfamily of the compound model is the compound distribution function, where F 1 is a discrete distribution.Some usual examples of compound discrete distributions are the Poisson normal distribution (Cameron and Trivedi 1998), the (µ, α)-negative binomial distribution and the (µ,ϕ)-beta-binomial distributions (Cepeda-Cuervo and Cifuentes-Amado 2017).The first is obtained from the combination of Y |λ ∼ P(λ) and λ ∼ Lognormal(µ, τ ); the second is a composition between the Poisson and a gamma distributions, belonging to the exponential family (3), and the third is a composition of the binomial and beta distributions.(Hougaard, Lee, and Whitmore 1997).
The link functions must be strictly monotonous and doubly differentiable to obtain maximum likelihood parameter estimations, and once differentiable in a Bayesian approach.Some usual link functions are the logit, logarithmic and identity functions, depending on the parameter space restrictions of the distribution.The doble generalized linear regression models are defined by the same tree components, assuming that in the systematic component η 1i = x T i β and η 2i = z T i γ (Cepeda-Cuervo 2001, Cepeda andGamerman 2005).Some examples of nonlinear biparametric regression models are: overdispersed random variables that follows Poisson normal distribution.A nonlinear Poisson normal regression model can be defined by assuming that the systematic components are given by: η 1i = S(x i , β)+τ i /2 and η 2i := z T i γ, where S(x i , β) is a nonlinear function of β, for example, as it proposed in Fokianos (2012).

Beta-binomial regression model. Let
dent overdispersed random variables that follow a beta-binomial distribution.A nonlinear beta-binomial regression model can be defined by assuming that the systematic components are given by: The nonlinear regression structure of the systematic component η 1i is proposed by McCullagh and Nelder (1989) to analyse the proportion of insects killed by exposure to the mixture of an insecticide and a synergist.Finally, given that 0 < µ i < 1 and ϕ i > 0, possible link functions are logit(µ i ) = η 1i and log(ϕ i ) = η 2i .

Bayesian biparametric nonlinear regression models
In this section we extend the Bayesian method proposed by Gamerman (1997) and Cepeda-Cuervo (2001) to obtain samples from the posterior distribution of the linear regression parameters, by applying the Metropolis-Hastings algorithm.This method was proposed by Gamerman (1997), in the framework of generalized linear mixed models, and extended by Cepeda and Gamerman (2000), Cepeda-Cuervo (2001) and Cepeda and Gamerman (2005) to obtain samples from the posterior distribution in the framework of double generalized linear regression models defined in the F 2 e family of distributions, when functions of the mean and dispersion (variance) parameters follow linear regression structures.Usual examples of the biparametric regression models where both parameters follow linear regression structures are the joint mean and dispersion beta regression models and the joint mean and shape gamma regression models, proposed in Cepeda-Cuervo (2001) and fitting by applying Bayesian methods.As in these works, to define Bayesian nonlinear overdispersed models, we assign normal prior distributions for the regression parameters, and without loss of generality, we assume an independent normal prior distribution p(β, γ) = p(β)p(γ) for the regression parameters, given by: Thus, if L(β, γ) denotes the likelihood function defined from a distribution belonging to exponential family (3) or ( 4), the posterior distributions is given by π(β, γ) ∝ L(β, γ)p(β, γ).However, given that π(β, γ) is analytically intractable, we propose to obtain samples of the posterior distribution from their posterior conditional distributions π(β|γ) and π(γ|β), which are also analytically intractable, except for π(β|γ) in the case of normal linear regression models (Cepeda and Gamerman, 2001).Thus, we propose to obtain samples from the posterior conditional distribution by applying the M-H algorithm, for which we propose to build kernel transition functions as in Gamerman (1997), Cepeda-Cuervo (2001) or Cepeda and Gamerman (2005), to propose samples of the posterior distributions.These kernels are obtained as a combination of working observation model, defined by assuming that appropriate working variables follow a normal distribution, and the normal prior distributions of the regression parameters.The working variables are obtained by a linearized form of the mean links and nonlinear regression structures.These variables can be obtained from the Fisher scoring algorithm, as in McCullagh and Nelder (1989), for generalized linear models, or in Cepeda and Gamerman (2000) and in Cepeda and Gamerman (2005), for double generalized linear models.The first-order Taylor approximation of the link functions are used to define the kernel transition functions used in the definition of the M-H algorithm applied to fit the joint mean and dispersion beta regression models proposed in Cepeda-Cuervo (2001).Examples of kernels transition functions in the framework of DGLMs are summarized in Appendix 7.1.
A general approximation to obtain worked variables from which the worked observational models are used to build the kernel transition functions is the Fisher scoring algorithm.In the biparametric nonlinear regression models, if the distribution belongs to the biparametric exponential family (3), the mean worked regression variable is given by ( 28), as obtained in Section 7.2 by applying the Fisher scoring algorithm.Thus, the worked observational model is obtained by assuming that the worked observational variable follows a normal distribution, where ∇S(x i , β (c) ) denote the gradient of S(x i , .), estimate at the current value of β, β (c) , and h ′ (.) the first order derivative of h(.).The mean and variance of these random variables are given by: V ar(Y i ).Thus, the kernel transition function q 1 to obtain samples of β is the posterior distribution function obtained by the combination of the conditional prior normal distribution and the worked observational model obtained by assuming that the worked observational variable follows the normal distribution, Ỹi ∼ N (∇S(x i , β (c) )β (c) , V ar (c) (ỹ 1i )).So, the kernel transition function q 1 (.) is given by: where ) the i-th row of X, Σ = Diag V ar (c)  Ỹi and Ỹ the vector of observed values of (6).
If U (., .), in the definition of systematic component is a linear function of γ, g(τ i ) = z T i γ, the worked observational variable to define the kernel transition function to propose samples of γ, can be obtained by applying the Fisher scoring algorithm, as in Section 7, or by first-order Taylor approximation of g(.), where this worked variable is given by: where g ′ (.) denote the first-order derivative of g(.) and t i is a random variable such that denotes the current values of γ, the worked dispersion variables are given by: and thus the kernel transition function q 2 (.) to obtain samples of γ is obtained from the combination of the conditional normal prior distribution, γ|β ∼ N (g, G), and the worked observational model obtained from ( 8) by assuming that it follows the normal distribution, yi ∼ N (z T i γ (c) , V ar (c) (y i )), by applying Bayes' Theorem.Thus, q 2 (.) are given by: where , with the variance covariance matrix Σ 2 = diag(V ar (c) ( Yi )), Z the matrix values of the dispersion explanatory variables and Y a vector of observed worked values.The kernel transition function (9) also can be obtained from a combination of a normal prior distribution of γ and the worked observational model given by ( 29), obtained from the Fisher scoring algorithm.Thus, with these kernel transition functions, the following Bayesian iterated algorithm is proposed to obtain samples of the posterior parameter distribution of nonlinear regression models.
1. Begin the chain iteration counter at j=1.
7. Change the counter from j to j + 1 and return to 2 until convergence is reached.
To fit the orthogonal nonlinear overdispersed regression models defined from the biparametric exponential family of distributions (3), we can apply the Fisher scoring algorithm to obtain the mean and dispersion worked observational variables ( 28) and ( 29), used to built the kernel transition functions ( 7) and ( 9), respectively.Thus, samples of the posterior distributions of β and γ can be obtained by applying the same Bayesian Metropolis-Hastings algorithm algorithm defined above (3).

Bayesian method to fit compound regression models
In this section we propose a Bayesian method to fit compound nonlinear (linear) regression models, defined as in the biparametric nonlinear (linear) regression models, but in the compound distribution defined by ( 4).The idea is the same as presented in Section 3, where samples of the mean and dispersion regression parameters are proposed from the kernel transition functions (7) and ( 9).However, although the respective worked variables can be obtained from first-order Taylor approximation, here the mean worked variable is given by ( 28), the same obtained by the first-order Taylor approximation, but the worked variable for the dispersion regression parameters is obtained from simulated values generated from a latent variable.Thus, to obtain a worked variable associated with γ, simulated values of a random variable θ must be obtained, for which we propose the following procedure: given that i ), and it is assumed to be a sample of the response variable to obtain values of the worked variable (29), where y i is now replaced by θ i .
With the introduction of this new worked variable, the respective kernel transition functions are constructed as in Section 3 and the iterated Bayesian algorithm presented in that section is reformulated as follows: 1. Begin the chain iteration counter at j=1.
8. Change the counter from j to j + 1 and return to 2 until convergence is reached.
Example 4.0.1.(µ, α)-negative binomial nonlinear regression models.If y i ∼ N B(µ i , α i ), i = 1, . . ., n, where log(µ i ) = S(x i , β) and log(α i ) = z T i γ, then from (28), the mean random worked variables are given by: ỹ1i where, S(x i , β (c) )) T denote the gradient of S(x i , .)estimated on β (c) .From the mean and variance of the negative binomial distribution, the mean and variance of ỹ1i are given by , respectively.
For the α regression structures, assuming that θ i follows gamma distribution G(µ 29) the worked variables are given by: where, for example, , denote the first-order derivative d dα i log Γ(α i ), estimated on α (c) i .The mean and variance of this worked variable are given respectively by: The variance of ỹ2i is obtained by applying the properties of the gamma distribution, proven in Cepeda-Cuervo (2001): If a distribution does not belong to the biparametric exponential family distribution (3), such as the beta distribution, the nonlinear regression models also can be defined by assuming a nonlinear regression structure for the mean and a linear (or nonlinear) structure for the dispersion parameter: h(µ i ) = S(x i , β) and g(ϕ i ) = z T i γ.In this case, the maximum likelihood estimates can be obtained by using the Fisher scoring algorithm and the nonlinear Bayesian regression model can be fitted by application of a iterated Bayesian algorithm, where the kernel transition functions can be developed from worked variables obtained by the first-order Taylor approximation.

Applications: Neonatal mortality dataset
The age of a woman when she gives a birth is a factor that influences the risk that the newborn baby will die in the first 28 days of life (neonatal death).To observe the relation between neonatal death and mother's age, data on the annual number of live births (LB) and number of neonatal deaths (ND) were tabulated between 1999 and 2016 by Donoso et al. (2014) • Annual number of children born alive, in thousands (LB).
Figure 1 shows the annual neonatal mortality (1000N D/LB) for each category of maternal age of 120 observations obtained from 1999 to 2016.From this figure, it is clear that the number of neonatal deaths decreased with the Maternal Age from M A = 1 to M A = 4, after which the neonatal death increased until M A = 8, corresponding to maternal age bigger than 45 years.Thus, from the nature of the data and the data behavior, we assume the negative binomial model, where the mean and dispersion regression structure parameters are given by: in order to capture the mean change in neonatal death through the categories of the MA, we take into account that the explanatory variable MA is obtained by a partition of the continuous variable "Age of the Mother", assumed to be between 10 and 50 years old, in consecutive equal longitude intervals.The Bayesian nonlinear negative binomial regression model is defined by assuming normal prior distributions for all the parameters, N (0, 100).These nonlinear negative binomial regression models were fitted by applying the Bayesian methods proposed in Section 4, which ensure fast convergence of the chains.
In this model, the mean regression structure is proposed taking into account the behavior of N D as a function of M A, presented in Figure 1, which depends on the LB values, including a scale transformation, plus a linear component in time to detect possible decreasing behavior of mortality.Thus, β 1 is an intercept parameter; β 2 is related with the average rate of change in a parabola with vertex at (β 3 , β 4 ), and β 5 is the parameter related to a linear regression structure in T .Thus, taking into account the behavior of N D as a function of the explanatory variables, we assume a continuous model that can be a good approximation of the mean behavior of N D in the domain of the function, determined by the values of (M A, LB, T ).
In this fitting procedure, n=60000 posterior samples were generated, with a burn-in of 10,000 and taking one sample each every 10 to reduce autocorrelation.The posterior parameter estimates, their standard deviations and their 95% confidence intervals are summarized in Table 1, where the Monte Carlo (MC) error is also estimated for each of the parameters.
The DIC value for this model was 1013.From these results, we conclude that the neonatal mortality decreased for each category up category 4, corresponding to mother's age between 25 and 29 years, and that increased from the categories 5 to 8, corresponding to mothers between 45 and 49 years old.β 4 = 4.515 > 0, indicating that N D increased, in each category of M A, when the number of children born alive, LB, increased, being greatest in the M A categories with older ages.
Figure 2 shows the behavior of the standardized Bayesian residuals, indicating good performance of the model fitting the neonatal deaths dataset, showing residuals distributed around zero between -3 and 2, that do not show any trend behavior with the observation numbers.The shape of the nonlinear negative binomial model with mean and regression structures given by ( 10) and ( 11) confirm the conclusion reported in Donoso et al. (2014), where a descriptive analysis of the data indicated parabolic nonlinear relationship of neonatal mortality versus maternal age in Chile.As in Villarroel del Pino (2003), this nonlinear model indicates high risk of maternal and perinatal morbidity of pregnancy in adolescents and women with advanced maternal ages (Donoso and Villarroel 2003).

Conclusions
In this paper we propose nonlinear (linear) regression models in F 2 .In this class of models, we propose two new classes of overdispersed nonlinear regression models.The first class is defined from the overdispersion family of distributions proposed by Dey et al. (1994) and the second from the class of compound distributions.For these models, we develop a Bayesian method, as an extension of those proposed in Cepeda-Cuervo (2001), in which samples of the posterior distributions are obtained by applying an iterated M-H algorithm by assuming two groups of parameters, defined by the mean and dispersion regression structures.In the first class of models, defined in the overdispersed family of distributions of Dey et al. (1994), to propose the iterated M-H algorithm, we obtain working variables by applying the Fisher scoring algorithm and use these to build kernel transition functions to propose samples of the posterior parameter distributions.A Bayesian method to fit compound models is also proposed and applied to fit simulated and real datasets.
Particular cases of Bayesian methods proposed here to fit nonlinear biparametric regression models were developed in Cepeda-Cuervo and Achcar (2010) to fit heteroscedastic nonlinear  Cepeda-Cuervo (2001) to fit DGLMs.In all of these cases, including the applications developed in this paper, the results show good performance of the Bayesian methods proposed.However, there are many possible extensions that can be developed by practitioners to illustrate the efficiency of the MCMC algorithms developed here.Multiple applications and simulation studies can be developed as extensions of this work to illustrate the use of the proposed models and the performance of the Bayesian method proposed to fit nonlinear regression model.
In the aplication, several chains were generated, starting from different values.All of them exhibited the same qualitative behaviour though iterations after an small initial transient period, providing a rough indication of stationarity.The convergence of the posterior chains can be evaluated by applying Geweke's diagnostics (Geweke 1992).

Double generalized linear models (DGLMs)
This section is a summary of the main results presented in Cepeda and Gamerman (2005), where the double generalized linear regression models (DGLMs) are defined in the framework of the biparametric family of distributions.A particular case of these models is defined from the family of distributions defined by Dey et al. (1994), were the mean and dispersion parameters are orthogonals, in the sense of Box and Cox, and with systematic components assumed to follow linear regression structures.Some examples of these models, considered in Cepeda-Cuervo (2001) and Cepeda and Gamerman (2005) are the beta regression models, where appropriate functions of the mean and dispersion parameters follow regression structures, or the gamma regression models, where appropriate functions of the mean and variance (or shape) parameters follow regression structures.

Model definition
The double generalized linear regression models are defined by the following three components: ..n, are independent random variables with distribution belonging to the two-parameter exponential family of distributions (F 2 e ).• The systematic component: • The link functions: h(.) and g(.), such that h(µ i ) = η 1i and g(τ i ) = η 2i .
The link functions must be strictly monotonous and doubly differentiable to obtain maximum likelihood parameter estimations, and once differentiable in the Bayesian approach.1. Heteroscedastic normal regression models.Let Y i ∼ N (µ i , σ 2 i ), i = 1, ..., n, be n independent random variables that follow normal distributions.This model is defined by assuming that 2. Mean and shape gamma regression models.Let Y i ∼ G(µ i , α i ), i = 1, ..., n, be n independent gamma random variables.These models are defined by assuming that As examples of non-orthogonal two-parameter regression models, Cepeda-Cuervo (2001) and Cepeda and Gamerman (2005) proposed: 1. Mean and dispersion beta regression models.Let Y i ∼ B(µ i , ϕ i ), i = 1, ..., n, be n independent beta random variables, where µ i = a i /(a i + b i ), ϕ i = a i + b i .These beta regression models are defined by assuming that: Mean and dispersion beta regression models.Let Y i ∼ G(µ i , σ 2 i ), i = 1, ..., n, be n independent gamma random variables, where µ i = E(Y i ) and σ 2 i = V ar(Y i ).These gamma regression models are defined by assuming that:
To obtain samples from the posterior conditional distributions by applying the M-H algorithm, the authors propose worked variables to approximate h(µ) and g(τ ) around the current parameter estimates, obtained from the equation of the Fisher scoring algorithm or from the first-order Taylor approximation of h(.)and g(.) around the current values of µ i and τ i .Thus, a mean working variables are given by: Ỹi ≈ h(µ where h ′ (.) denotes the first order derivative of h(.), h(µ ).From ( 16), a mean working observational model is obtained by assuming that the working variable Ỹi , for which E( Ỹi ) = x T i β (c) and V ar( Ỹi ) = h ′ (µ V ar(Y i ), follows a normal distribution.The mean kernel transition function q 1 , to obtain samples of β, is defined as the posterior distribution function obtained by the combination of the conditional prior distribution π(β|γ) and the working observational models Ỹi ∼ N (x T i β (c) , V ar (c) ( Ỹi )).Thus, the mean kernel transition function q 1 (.) is given by: where b * = B * (B −1 b+X T Σ −1 Ỹ) and B * = (B −1 +X T Σ −1 X) −1 ; with Σ = Diag V ar (c)  Ỹi and Ỹ the vector of observed values of ( 16).
The full posterior conditional distribution π(γ|β) is also intractable analytically.Thus, samples from this distribution are obtained by applying the M-H algorithm, for which a kernel transition function is proposed.From a random variable t i , such that E(t i ) = τ i , a working dispersion random variable is defined by: Y2i = g(τ where E( Y2i ) = z T i γ (c) and V ar( Y2i ) = g ′ (τ V ar(t i ), by assuming that it follows a normal distribution.The dispersion (τ ) kernel transition function q 2 , to obtain samples of γ, is defined as the posterior distribution function obtained by the combination of the conditional prior distribution π(γ|β) and the working observational models Yi ∼ N (z T i γ (c) , V ar (c) (y i )).Thus, the kernel transition function q 2 (.) is given by: where and Ỹ the vector of observed values of the working variables.
In following examples we present the working variables developed in Cepeda-Cuervo (2001) to build the kernel transition function used in the M-H algorithm to fit beta regression and gamma regression models.
Example 7.1.1.Mean and shape gamma regression models.In gamma regression models with h(µ i ) = x T β and g(α i ) = z T γ, if β (c) denotes the current values of β, the mean working variables are given by: for which h(µ (c) ) = x T i β (c) and To define a working variable to build a kernel transition function to propose samples of γ, the random variable t i = λY i is proposed, given that E(t i ) = α i .Thus, if g(.) is the logarithmic function, from the first-order Taylor approximation of log(t i ) around the current value of γ (c) , the working shape variable is given by: Example 7.1.2.Mean and dispersion beta regression models.In beta regression models with logit mean link function, h(µ i ) = logit(µ i ), and logarithmic dispersion link function, g(ϕ i ) = log(ϕ i ), where ϕ = a + b, the mean working variable, obtained as in example 7.1.1,from the first order Taylor approximation of logit(y i ) around the current values of µ (c) is: for which g(µ (c) ) = z T i γ (c) and To define a working variable to build a kernel transition function to propose samples of γ, the author starts by defining a random variable t i = (a+b) 2 a Y i , given that E(t i ) = ϕ i .Thus, from the first-order Taylor approximation of log(t i ) around the current value of ϕ (c) , the working dispersion variable is given by: and their associated variance is σ

Frequentist parameter estimates of nonlinear DGLMs
The Fisher scoring is defined from the first-and second-order derivatives of the logarithm of the likelihood function (25), as follows: let y i , i = 1, 2, ..., n, be n independent observations of n random variables Y i ∼ F 2 e (µ i , τ i ), belonging to the exponential family (3).The logarithm of the likelihood function, ℓ = log L, is given by: where ℓ i is the logarithm of the i-th likelihood component, h(µ i ) = S(x T i , β) and g(τ i ) = z T i γ.Thus, the first-order derivatives of l are given by: where . Thus, the Hessian matrix, obtained by second-order derivatives of ℓ, is given by: where k, j = 1, ..., p for β and k, j = 1, ..., r for γ.So, the Fisher information matrix, obtained by the expected values of the second-order derivatives, is given by: From these results, the Fisher information matrix is block diagonal, the parameter vectors, β and γ, are globally orthogonal (Cox and Reid 1987) and their maximum likelihood estimates are asymptotically independent.
Thus, given the k-th estimates of the regression parameters, β (k) and γ (k) , by the Fisher scoring algorithm, the (k+1)-th estimate of β is given by the equation I i ), where wi = , and Ỹ1 is an (n x 1) worked vector with components ỹ1i , i = 1, 2, . . ., n, given by: ỹ1i = p m=1 ∂S(x i , β (k) ) Then, given the k-th estimates of the regression parameters, β (k) and γ (k) , by the Fisher scoring algorithm, the (k+1)-th estimate of γ is given by the equation I + T (y i ) + Ψ (0,1) µ With these results an iterate maximum likelihood algorithm can be proposed to obtain maximum likelihood estimates of the regression parameters.To fit the Bayesian model obtained by assuming a normal prior distribution for the regression parameters, we can apply the Bayesian method proposed in Section (3), where the mean kernel transition function is given by ( 19), built with the worked variable (28).

Figure 1 :
Figure 1: Neonatal mortality versus maternal age in Chile, between 1999 and 2016

FrequencyFigure 2 :
Figure 2: Plot of standardized Bayesian residuals versus fited values and Histogram of residuals Ỹ1 , where the entries of X are given by xij =∂S ∂β i (x j , β (k) ), W = Diag(w (k) i-th component of the right-hand side is given by:I (k) γ γ (k) + q (k) l ) + T (y l ) + Ψ (0,1) µ l+ E (T (y l )) + Ψ (0,1) µ Fisher scoring worked variables used to build the kernel transition function to obtain samples of γ are given by: , from the Vital Statistics Yearbooks of the National Statistics Institute of Chile.In this application, the variable of interest or dependent variable is N D per year, while the explanatory variables are: