On the existence of maximum likelihood estimators in Poisson-gamma HGLM and negative binomial regression model

: A breakthrough is provided in the study of the existence prob- lem for maximum likelihood estimators (MLE) in the hierarchical generalized linear model (HGLM) of Poisson-gamma type, as well as in the negative binomial regression model. Any more than the uniqueness problem associated, the existence problem of MLE for these models has not yet been studied except in the very special case of the sample. This issue is addressed here for the Poisson-gamma HGLM, and a suﬃcient condition is obtained to ensure the MLE existence in that case. It is also shown that this condition has the same eﬀect in the negative binomial regression model with the index parameter considered as unknown. In the latter model, the obtained condition appears as a natural extension of the necessary and suﬃcient condition well known for solving the existence and uniqueness problems for the index parameter MLE in the sample case.

The Poisson-gamma HGLM are members of the hierarchical generalized linear model family (Lee and Nelder [14]), which is an extension of the generalized linear model family and of the generalized linear mixed model family. HGLM have many fields of application, and are specifically adapted for representing longitudinal data that are generally correlated (Cameron and Trivedi [4], Hilbe [12], Lee and Nelder [14], Molas and Lesaffre [19]). For instance, Poisson-gamma HGLM are used in practice to describe longitudinal count data, as the claim numbers in non-life insurance (Dionne and Vanasse [6], Boucher et al. [3]), among others. It should be noted that a Poisson-gamma HGLM considered at a single moment follows a negative binomial regression model (Cameron and Trivedi [4], Hilbe [12]), since a mixture of Poisson distributions by a gamma distribution gives a negative binomial distribution (Johnson et al. [13]). Therefore, without the regression part, a very special case of Poisson-gamma HGLM is the negative binomial sample. To estimate parameters in HGLM, several methods have been proposed in the literature, see, for instance, Lee and Nelder [14], Lee et al. [15], Molas and Lesaffre [19], or Liang and Zeger [18] and Zeger et al. [24]. In this work, we only focus on the classical maximum likelihood estimation approach.
The existence and uniqueness problems for maximum likelihood estimators (MLE) in Poisson-gamma HGLM are not considered in the literature, except in the case of the negative binomial sample (see Levin and Reeds [17] for a correct solution in this case). And it may be observed that this very special case has had a "long" history before being entirely resolved (see Fisher [7] and Haldane [11] for the beginning). The aim of this paper is to provide a breakthrough in the study of the existence problem for MLE in Poisson-gamma HGLM. The same issue for generalized linear models and log linear models is discussed for instance in Christensen [5], Haberman [10], Santner and Duffy [22], a special case of these types of models consisting of Poisson regression models.
However, it should be noticed that the problem addressed is only partially solved here. A condition on the observations is presented, and it is only shown that it is sufficient to solve the MLE existence problem. At present, interesting extensions of this work seem not to be easy to achieve, despite the fact that the presented condition appears to be a good candidate to be a necessary and sufficient condition for solving the existence and uniqueness problems of MLE for the HGLM Poisson-gamma. On the contrary, the transposition of the obtained result to the negative binomial regression model is trivial.
To complete this introduction, it is probably useful to discuss the importance to be given to the fact that maximum likelihood estimators do not always exist. First, it may be observed that two conditions must be introduced to ensure that these estimators exist for Poisson-gamma HGLM. The former is typical of the presence of covariables in a Poisson regression model and ensures that there is a solution to the likelihood equations associated (Haberman [10]). The second is concerned with the additional parameter which exists in the Poisson-gamma HGLM and fixes the longitudinal structure of the model. Also note that the need for such conditions is well referenced, for example in the special case of samples of discrete distributions. Moreover, in this latter case, it is known that these conditions are not satisfied on a set of observations of nonzero probability, even if the probability of this set converges to 0 when the sample size goes to infinity. Now, some additional details specifying the role of these conditions are given by separating theoretical and practical aspects.
The theoretical aspects are considered discussing a well-known example which is among the simplest possible to be described. Consider a sample of negative binomial distribution N B(r, p), with parameters r ∈]0, ∞[ and p ∈]0, 1[, and probability mass function f given by f (y) = (Γ(y + r)/(y! Γ(r))) p y (1 − p) r for y ∈ N. The empirical mean and variance are calculated for this sample (the empirical variance has to be normalized by the sample size). The necessary and sufficient condition for solving the existence and uniqueness problems of MLE for the parameter r and p is then that the empirical variance is strictly greater than the empirical mean. This condition has long been known (Fisher [7] and Haldane [11]), but its role was fully demonstrated nearly forty years later (Levin and Reeds [17]); moreover, incorrect demonstrations have even been published since this last publication. Under the condition set out the MLE of r and p exist in ]0, ∞[×]0, 1[. What will happen when this condition is not satisfied? A laborious calculation shows that a solution that maximizes the likelihood function exists on the boundary of the area where the parameters are defined. This solution is such that r is equal to infinity, p is equal to 0 and rp is equal to the empirical mean (and the likelihood function is finite). Then, recalling the convergence result of the negative binomial distribution N B(r, p) to the Poisson distribution P(λ) when λ is positive, r goes to infinity, p goes to 0 and rp goes to λ, the "boundary" solution may be connected to the MLE of parameter λ in the Poisson sample when the empirical mean is positive. And it remains to handle the case where the sample mean is zero, again a change of statistical models can be considered for this purpose. However, from a theoretical point of view, a comfortable approach is to not change the statistical model and to seek estimators within their domain of definition. It is a choice which is usually done and this is the one retained in this paper: the conditions of existence presented below ensure that the MLE will exist in the area where the parameters are defined for the model used and not on the boundary of this area. However, it is important to note that the same boundary phenomenon as the one just described for the negative binomial sample arises for the Poisson-gamma HGLM. The Poisson regression models are then the boundary statistical models having to be considered in that case. As well, the convergence of Poisson-gamma HGLM to Poisson regression models is a known fact (Gning [8], Rodríguez-Avi et al. [21]) that is used in this paper in proving the main result presented. Indeed, this convergence result allows to obtain the behavior of the likelihood function when the parameters are at the border of their domain of definition and this is crucial to conclude on the existence of MLE in the sense that has been reminded.
From a practical point of view, several numerical methods already exist to find estimators in the context of HGLM and these algorithms address these models in a wider context than envisaged here. Some of them, however, rely on the concept of maximum likelihood (Molas and Lesaffre [19]), other methods are also used (see, for instance, Lesaffre and Spiessens [16]). What meaning should be given to the conditions of existence obtained? To the extent that it would be shown that these conditions are necessary and sufficient to ensure that MLE exist, it seems inevitable that an algorithm takes into account their function to change statistical models when they are not satisfied by the observations, and also when they are close to not be satisfied. Otherwise, errors and instabilities will appear in a numerical research of MLE. Moreover, it may be indicated that, at least for the case of a simple negative binomial sample, other estimation methods, such as the method of moments, encounter the same problem and there exist recent publications still questioning methods for estimating both parameters r and p when the condition of existence of MLE is not satisfied (see, for instance, Al-Khasawneh [2], Robinson and Smyth [20]). Finally, the following point should be emphasized. It will be easy to see that the two conditions obtained in the case of Poisson-gamma HGLM become less easy to check when the number of covariables in the regression part of the model increases, therefore, the subset of observations satisfying these conditions is likely to become smaller in this case.
The rest of this paper is organized as follows. In the first section, we present the Poisson-gamma HGLM and introduce notation that will be useful later. In the second section, we state our main result, and also preliminary results that will be used to prove the main result. This section concludes with a conjecture, since it seems quite likely that our condition is also a necessary and sufficient condition for the MLE existence and uniqueness in the models considered. All the results are proved in the third section. And some possible extensions of this work are discussed in the last section.

Poisson-gamma HGLM
Consider a sample of n individuals kept under observation over T periods. For the individual k during the period t, a random count variable Y kt is observed and represents the dependent variable, while the J deterministic characteristics x kt = (x kt1 , . . . , x ktJ ) ′ ∈ R J are known and represent the covariables. For individual k, it is also assumed that there is a real and positive random characteristic Θ k , which is attached but is not observed. Finally, the regression structure will use the vector of regression parameters β = (β 1 . . . β J ) ′ ∈ R J , the quantities λ kt = exp(x ′ kt β), for k = 1, . . . , n and t = 1, . . . , T , and the nT × J regression matrix X = (x 11 x 12 · · · x nT ) ′ .
In this setting, denoting by P(λ) the Poisson distribution of parameter λ, λ > 0, and, by gamma(a, b) the gamma distribution of parameters a and b, a > 0, b > 0, the Poisson-gamma HGLM requires the three following assumptions: (H1) the individual k is represented by the random vector (Θ k Y k1 . . . Y kT ) ′ , and these vectors are independent for k = 1, . . . , n; (H2) for each k and θ k > 0, given Θ k = θ k , the variables Y k1 , . . . , Y kT are independent and for each t the conditional distribution of Y kt is the Poisson distribution P(λ kt θ k ); (H3) for each k the distribution of Θ k is the gamma distribution gamma(a, a).
For the presentation of the results obtained, one additional assumption must still be made: (H4) X is a matrix of full rank, and nT ≥ J.
From these assumptions, it follows that the model defined is identifiable and satisfies the following constraint: log λ ∈ F , with λ = (λ 11 λ 12 · · · λ nT ) ′ ∈ R nT , log λ = (log λ 11 log λ 12 · · · log λ nT ) ′ ∈ R nT and F = Im(X) ⊂ R nT which is the range of X. Observe also that the non-conditional distribution of Y kt is the negative binomial distribution N B(a, λ kt /(a + λ kt )), since, by (H1) and (H2), for y kt ∈ N: Therefore, when T = 1, the Y k1 , k = 1, . . . , n, constitute one of the main forms of negative binomial regression models, with β as regression parameter and a as index parameter (see Hilbe [12]).
An extension of the Poisson-gamma HGLM, useful for instance in insurance, is to assume that individuals are not necessarily observed for the same number of periods. All results presented seem to be easily extended to cover this case.

Existence results for MLE
Starting with the calculus of the log-likelihood expression, this section introduces the main results obtained. Several preliminary results are also stated.

The log-likelihood function
For any k = 1, . . . , n, given Θ k , the count variables Y k1 , . . . , Y kT are assumed independent. Therefore, their joint probability function is given for (y k1 · · · y kT ) ′ ∈ N T by: where s k = T t=1 y kt and µ k = T t=1 λ kt . So that the log-likelihood function of the model differs of the following function ℓ y by a constant term, where y = (y 11 . . . y nT ) ′ ∈ N nT is the vector of observation: where, for k = 1, . . . , n, j=0 log(a + j) = 0 whenever s k = 0 (for properties of the function gamma, see Abramovitz and Stegun [1]). Finally, the following notation is introduced. Let r = 1 a and Φ be the function defined from ]0, +∞[×R J to R by: (2.1) Therefore, passing to the limit as r tends to 0, the function Φ(0, ·) can be defined from R J to R by: And it must be noticed that the function Φ(0, ·) differs of the log-likelihood function of a Poisson regression model by an additive term which is independent of the model parameter β. Indeed, that is the model where the Y kt , k = 1, . . . , n, t = 1, . . . , T , are independent and the distribution of Y kt is the Poisson distribution P(λ kt ). Moreover, the following condition is known to be a necessary and sufficient condition for solving the existence and uniqueness problems of MLE for this Poisson regression model (see Haberman [10]): (C1) there exists δ ∈ F ⊥ such that y kt + δ kt > 0, k = 1, . . . , n, t = 1, . . . , T.
Under (C1), the vector y is nonzero, and the MLE of β in this model is denoted in the following byβ(0). The following remarks can be made on the condition (C1). Recall that all y kt are natural numbers, so that condition (C1) requires that there is not too many of these observations of the dependent variable that are equal to zero. In the particular case of a Poisson sample, the subspace F is generated by the vector in R nT with all coordinates equal to 1 and condition (C1) is then equivalent to at least one of the y kt has to be different from zero (which is also equivalent to their sum has to be different from zero). On the other hand, when all covariates are categorical and the Poisson regression model has a multiplicative structure (Haberman [10], Santner and Dufy [22]), condition (C1) is equivalent to an elementary form. Indeed, in that case, it can be shown (Gning and Pierre-Loti-Viaud [9]) that (C1) is equivalent to the fact that for any covariate and any category of this covariate, there is an observation with this characteristic and a nonzero dependent variable (in other words, all sums on the margins of the observations of the dependent variable are different from zero). In general, in a less accurate way, the dimension of the subspace F gets larger when the number of covariables increases, which in turn requires an augmentation of the number of nonzero observations of the dependent variable.

Main result
Now, we can state the sufficient condition for the existence of MLE of the Poisson-gamma HGLM introduced in Section 1, letting, whenβ(0) exists: Remark 1 (Negative binomial regression model). By taking T = 1, Theorem 1 applies to one of the main forms of negative binomial regression models. Assuming further that all λ k1 are equal, so that Y k1 , k = 1, . . . , n, is a negative binomial sample, then, the condition (C2) becomes: the empirical variance is strictly greater than the empirical mean. The latter condition is also necessary and sufficient for the existence and uniqueness of MLE of the index parameter in a negative binomial sample (Levin and Reeds [17]).

Remark 2.
(C2) can be interpreted as the empirical variance "between class" of the Poisson regression model related to our model has to be greater than the empirical mean of the observations. Moreover, proceeding as in the first section, it is easily seen that for k = 1, . . . , n the sum S k = T t=1 Y kt follows the negative binomial distribution N B(a, µ k /(a + µ k )), where µ k = T t=1 λ kt . Since the variance of a negative binomial distribution is strictly greater than the mean, a theoretical counterpart of (C2) is thus obtained in writing that the variance of S k is strictly greater than its mean and summing in k = 1, . . . , n. Indeed, the variance of S k can be rewritten as E((S k − µ k ) 2 ) = E((S k − T t=1 λ kt ) 2 ), so that, the Poisson-gamma HGLM introduced in Section 1 satisfies the inequality:

Conjecture
The foregoing subsections, and especially Remark 1, lead us to introduce the following conjecture.
Conjecture The MLE of the parameters of Poisson-gamma HGLM, and therefore that of the related negative binomial regression models, exist and are unique if and only if (C1) and (C2) are satisfied.

Proofs
The proofs of the results stated previously are established, starting with the proofs of the preliminary results of Section 2.3.

Proof of the preliminary results
Proof of Lemma 1.
Then, since we have 0 < 1 − log(1+u) u ≤ u 2 for u > 0, we obtain, for all r > 0 and all β ∈ R J : Finally, observing that A k (β) ≤ T t=1 e x kt β , where . denotes the euclidean norm on R J , clearly, the last inequality leads to the identity: Proof of Lemma 2. See Haberman [10] for the case r = 0 and Gning [8] or Gning and Pierre-Loti-Viaud [9] for the case r > 0.
Proof of Lemma 3. Let γ > 0, and letB γ be the closed ball in R J of radius γ, centered atβ(0). Moreover, for 0 < γ 1 < γ, let B γ1 be the open ball of radius γ 1 , centered atβ(0). Under (C1), we have by Lemma 2: Then, according to Lemma 1, there exists r 0 > 0 small enough such that: from which we deduce the following inequalities: and: Hence, for 0 < r < r 0 , the smooth function Φ(r, .) has a local maximum in B γ1 and this local maximum is necessarily the global maximumβ(r) because of the strict concavity of Φ(r, .).
Proof of Lemma 4. The proof is a direct consequence of Lemmas 2 and 3.
Proof of Lemma 5. We have to prove that the functionβ is of class C 1 in a right neighborhood of 0, and we must establish a first order Taylor expansion at this point.
Consider the function f defined from [0, +∞[×R J to R J by: If (C1) is satisfied, then Lemma 2 yields that for r ≥ 0,β(r) exists and is the unique solution of the equation f (r, β) = 0, with: This will show that the functionβ is of class C 1 by using the Implicit Function Theorem (see, for example, Spivak [23]). This theorem is applied at 0, in a right neighborhood of 0. The hypotheses of the Implicit Function Theorem are now checked.
First, note that f is of class C 1 , and let D β f denote its partial derivative with respect to β, and D r f denote its partial derivative with respect to r. Then, it is easily obtained that, for r ≥ 0: Now, we have: f (0,β(0)) = ∇ β Φ(0,β(0)) = 0, and: where D β f (0,β(0)) is invertible since X is a full-rank matrix and nT ≥ J by (H4). It thus follows by using the Implicit Function Theorem that there exist r 1 > r 0 , V an open neighborhood ofβ(0) and h a function of class C 1 defined from [0, r 1 [ to V, such that, for all 0 ≤ r < r 1 : moreover, for all 0 ≤ r < r 1 : Finally, since there exists a unique solution for the equation f (r, β) = 0 for r ≥ 0, we haveβ(r) = h(r) for 0 ≤ r < r 1 . Hence, the functionβ = h is of class C 1 in [0, r 1 [ , and its first order Taylor expansion at 0 is given by: where some straightforward calculations lead to the expression:

Proof of Theorem 1
First, under (C1), and for each r ≥ 0, the problem argmax β∈R J Φ(r, β) is solved according to Lemma 2 and its unique solution is denoted byβ(r). In addition, it must be recalled thatβ(r) is also the unique solution of ∇ β Φ(r, β) = 0. The completion of the proof of Theorem 1 then consists in showing that (C2) is a sufficient condition for the existence of solutions of the problem argmax r∈]0,+∞[ Φ(r,β(r)), given that, for each r > 0, ∇ β Φ(r,β(r)) = 0.
In fact, thanks to these properties, if the function Ψ is strictly above its limit in a neighborhood of 0, and since it is a continuous function, then it reaches its maximum in ]0, +∞[ .

Limit of Ψ at +∞
First, observe that if δ satisfies (C1), then for 0 < α < 1 any αδ satisfies too (C1) since all y kt are natural numbers. We thus consider a δ of the form δ = ∆ r , where ∆ = (∆ 11 . . . ∆ nT ) ′ ∈ R nT and ∆ kt < 1/T for k = 1, . . . , n and t = 1, . . . , T . Then, Ψ may be rewritten as follows: We start by observing that obviously: In addition, it is easy to show that: Indeed, if (C1) is satisfied, the observation that y is nonzero is crucial, so that the set A = {k ∈ {1, . . . , n} : s k > 0} is not empty. Now, since for k ∈ A, the presence of the term j = 0 yields: while for the other k, by definition we have: the proof of (3.4) is therefore complete. Then, the final step consists in showing that the remaining term defining Ψ in (3.2) is bounded above when r → +∞. For that purpose, letη(r) = (η 11 (r) · · ·η nT (r)) ∈ R nT , whereη kt (r) = x ′ ktβ (r), k = 1, . . . , n, t = 1, . . . , T , and let, for η = (η 11 . . . η nT ) ′ ∈ R nT and r > 0: We observe that the function ϕ is not explicitly related to the log-likelihood of some statistical model (because of the presence of the ∆ kt ), but this function retains the properties summarized in the following lemma, whose proof is deferred to the next sub-section.
Lemma 6. Assuming that ∆ kt < 1/T , k = 1, . . . , n, t = 1, . . . , T , then, for each r > 0, the function ϕ(·, r) is of class C ∞ and strictly concave in R nT . Moreover, its maximum is reached at a unique pointη(r) ∈ R nT , which is the unique solution of ∇ η ϕ(η, r) = 0 and is given byη(r) = (η 11 · · ·η nT ) ′ with: From this, the following upper bound is obtained: where: Moreover, making use of the convention z log z = 0 when z = 0, it is easily seen that: Behavior of Ψ in a neighborhood of 0 In this step, we take δ = 0 in the expression (3.1) of Ψ. A Taylor expansion at order 1 will be established for Ψ in a right neighborhood of 0. And it should be noted that all the expansions in what follows are given for r in a right neighborhood of 0.
We start from the result of Lemma 5, by combining this result with the Taylor expansions: log(1 + rj) = rj + o(r), j ∈ N, and, sinceβ(r) is bounded in a right neighborhood of 0, for k = 1, . . . , n: So that, we also have, for k = 1, . . . , n: where the Taylor expansion e u = 1 + u + o(u) for u in a neighborhood of 0, has been used to establish the last equality.
Combining these Taylor expansions leads to the following expansion for the function Ψ: where we recall thatλ kt (0) = e x ′ ktβ (0) , k = 1, . . . , n, t = 1, . . . , T , and that: Thus, the position of Ψ with respect to its limit Φ(0,β(0)) = Ψ(0) is given by the sign of C, where: So that, if C > 0, then in a right neighborhood of 0, Ψ is strictly above its limit in 0. Hence, there exists a maximum of Ψ that is attained in ]0, ∞[ when (C2) is satisfied .
It should be noted that under our assumptions, we have T u=1 ∆ ku < 1 for k = 1 . . . , n.
On the other hand, for each k = 1, . . . , n, let a k = (a k1 · · · a kT ) ′ ∈ R T be such that a kt = e η kt /(1/r + T u=1 e η ku ), t = 1, . . . , T , and let D k be the T ×T diagonal matrix in which the diagonal entries are the a kt , t = 1, . . . , T . Then, denoting by H η ϕ(η, r) the Hessian matrix with respect to the vector variable η, this matrix is block diagonal and equal to: where, for each k = 1, . . . , n, H k is the T ×T symmetric matrix satisfying: Moreover, for all k = 1, . . . , n, we have a kt > 0, t = 1, . . . , T , and T u=1 a ku < 1, so that it can easily be shown that a k a ′ k − D k , and therefore H k , is a negativedefinite matrix. Which obviously implies that the Hessian matrix is negativedefinite and that ϕ(., r) is a strictly concave function on R J .
Finally, the function ϕ(., r) being strictly concave, and having a valueη(r) that cancels its gradient, the function ϕ(., r) reaches its maximum at the same value.

Conclusion
When a parametric statistical model is used in the case of count observations, in general there is a nonzero probability subset of the observations on which the MLE are not found in the area where the parameters are defined. The existence of this problem is well referenced, at least for the samples, and for the GLM or log linear models. It is also known when other methods of estimation are used, as is the case when the moment method is applied to a sample of negative binomial distribution. The study of this problem in the context of HGLM is addressed in this work for the estimators obtained by the maximum likelihood method, but considering only the case of HGLM Poisson-gamma having regard to the difficulty encountered in its solving. In contrast, and although extensions are still required, the result obtained allow to evaluate the importance of this phenomenon. In particular, an augmentation of the importance of the regression component, with its own constraint, seems to increase the influence of the second constraint obtained. And one can reasonably expect that such a phenomenon exists for extensions of the model presented, always for count data, but also for other methods of research of estimators. The present study begin to measure its importance in the HGLM setting.
It may then be noted that among the HGLM that deserve to be studied, the ones where the Poisson distribution is replaced by the negative binomial distribution can be cited because another complication arises in this case. The work done here has greatly benefited of the fact that the multidimensional distribution of the count data has an explicit form, but this fact is not true for most extensions that can be considered and notably in the ones just mentioned. It can also be observed that, although numerical procedures currently exist for determining the MLE in the case of Poisson-gamma HGLM, it is important to be informed of the conditions under which such MLE exist. When these conditions are not satisfied by the data, errors and numerical instabilities will have a negative impact on the quality of estimator evaluation to be obtained.
At that time several extensions of the work done can be considered. The main objective being to complete the proof of the conjecture that the (C1) and (C2) conditions are necessary and sufficient for the existence and uniqueness of the MLE in the Poisson-gamma HGLM. In addition, by referring to the full case study of a negative binomial sample, it should be observed that the work done is the easiest part among those necessary to achieve this objective. The study of others HGLM is also an important aim. On the other hand, small extensions of the results obtained can be easily considered. For example, if the n individuals are not observed on the same number of periods, an extension of Theorem 1 appears clearly as possible following the same pattern of demonstration. For the particular case where x kt = x k for k = 1, . . . , n, t = 1, . . . , T , a detailed study is given in Gning [8].