Computation of the Distribution of the Sum of Independent Negative Binomial Random Variables

: The distribution of the sum of negative binomial random variables has a special role in insurance mathematics, actuarial sciences, and ecology. Two methods to estimate this distribution have been published: a ﬁnite-sum exact expression and a series expression by convolution. We compare both methods, as well as a new normalized saddlepoint approximation, and normal and single distribution negative binomial approximations. We show that the exact series expression used lots of memory when the number of random variables was high (>7). The normalized saddlepoint approximation gives an output with a high relative error (around 3–5%), which can be a problem in some situations. The convolution method is a good compromise for applied practitioners, considering the amount of memory used, the computing time, and the precision of the estimates. However, a simplistic implementation of the algorithm could produce incorrect results due to the non-monotony of the convergence rate. The tolerance limit must be chosen depending on the expected magnitude order of the estimate, for which we used the answer generated by the saddlepoint approximation. Finally, the normal and negative binomial approximations should not be used, as they produced outputs with a very low accuracy.


Introduction
The negative binomial (NB) distribution is a discrete probability distribution that models counts [1]. It widely used in statistics, from statistics of accidents [2] to animal counts [3]. The NB distribution can be used to describe the distribution of the number of successes or failures. Suppose that there is a sequence of independent Bernoulli trials, each trial having two potential outcomes called "success" and "failure". The probability of success is p and of failure q = 1 − p. We observe this sequence until a predefined number, r, of successes has occurred. Then, the random number of failures has the NB distribution X ∼ NB(r; p) with density P(X = x), x being a particular realization of X: with 0 < p < 1, x and r being integer > 0. The mean is µ = r (1 − p)/p. The moment-generating function of the NB distribution is: An alternative parametrization X ∼ NB(µ, θ) can also be derived from assuming that the mean parameter of a Poisson distribution has a gamma distribution: with µ > 0 and θ > 0. Note that θ is not necessarily an integer, contrary to r in (1); hence, the gamma function is used in (3) instead of a factorial, with Γ(x) = (x − 1)!. The variance of the NB distribution is µ (1 + µ/θ). As θ approaches infinity, the NB distribution tends to follow the Poisson distribution, with the mean µ.

Sum of Negative Binomials
The sum of independent NB variables is of special interest in different contexts, such as the study of animal distribution [4,5], fecal egg counts in infected goats [6], the number of emergency medical calls [7], empirical distribution of the duration of wet periods in days [8] or insurance risk [9]. When the sum of several independent NB counts is available, determining the distribution of ∑ X i with X i ∼ NB(r i ; p i ) is a problem. When the p i s are all the same and equal to p, a classical result is ∑ X i ∼ NB(∑ r i ; p) [10], but more general forms without this constraint are often needed. For example, if counts are available for various spatial or temporal units of the form X ∼ NB(µ i ; r i ), p i being r i /(µ i + r i ), it implies that the p i s are not all the same, because µ i varies among the units [4].
With the mean and variance of the NB(r; p) distribution being r(1 − p)/p and r(1 − p)/p 2 , respectively, it follows that the mean and variance of the sum of n-independent NB variables are respective: Our paper has developed some novel methods in relation to the practical computation and use of the convolution approach [9]. However, the paper also collects five different methods and presents them in one place, a useful resource for the working data scientist or statistician. We describe and reference these methods and outline the computational difficulties in getting them to work. We also point the reader to the freely available R software that implements each of the methods (plus a sixth method based purely on simulation).
Two methods have been published to estimate the distribution of the sum of NB independent variables using a finite-sum exact expression [11] or the convolution method [9]. However, the computer implementation of both methods was not available, and we have detected potential problems when a practitioner implements them. The finite-sum exact expression computer implementation is relatively straightforward, but memory overflow can occur, and the time of computing will increase as a function of the factorial of the number of observations, x. This precision was not given in the original publication [11]. The convolution method is very complex to implement and has been described as being "cumbersome" [12]; indeed, we found that its implementation was not straightforward and was even counterintuitive. The method uses a sum to infinity, and the condition to stop the recursion was not defined in the original publication.
Our solution for computation of the convolution method, presented here, is novel and has proven to be robust for extensive testing. A naïve tolerance condition has been used by one of the authors of this note (MG) (recursion stops when the change is lower than the tolerance limit) as in [4,5], but the other author (JB) found that outputs can be strongly biased in some conditions. It has been the beginning of a collaboration between the two authors to understand and solve the origin of this bias. We detected two problems: (1) the tolerance check must be applied only when the first-order change of the estimate is negative (convergence criteria being adaptive), and (2) the value of the tolerance must be proportional to the expected estimate. Then, it was necessary to have an estimate of the density to set the tolerance, to better estimate the correct density. To solve this, we used the saddlepoint approximation of the density. We show that the absolute error of this approximation can be on the order of 5%, being too high to be used in many applications, but it is sufficiently low to define a correct tolerance to be used with the convolution method.

Normal and Negative Binomial Approximations
When working on the sum of variables, the first thought is to use the central limit theorem [13] that establishes that, in many situations, the distribution of the sum-independent random variables tends to go toward a normal distribution. An alternative is to model the distribution of the sum of NB variables, as an NB distribution is based on the observation that the distribution of the sum of NB variables is a mixture NB distribution [9], according to Theorem 2, proposed by Makun, Abdulganiyu, Shaibu, Otaru, Okubanjo, Kudi, and Notter [6].

Finite-Sum Exact Expression
An exact form for the distribution of the sum of NB is: The expression (5) is compact and the exact value can be computed [11].

Approximation by Convolution
When X i ∼ NB(r i ; p i ), with i from 1 to n, the distribution of S n = ∑ X i is a mixture NB [9], with the probability mass function being approximated by: (6) is used iteratively, with k being the counter of the rank of iterations, but a condition to stop the iterations when a certain level of approximation is reached was not defined in the original publication [9].

Saddlepoint Approximation
The saddlepoint approximation method provides a highly accurate approximation formula for any probability density function (continuous distribution) or probability mass function (discrete distribution) of a distribution, based on the moment-generating function [14].
Taking the log of the moment-generating function of the NB distribution (2) and summing over n-independent NB variables, the cumulant of sum of NBs is: The first and second order of the derivatives of K(t) are: The saddlepoint, s x , is found by solving K (s x ) = x. Once s x is found, P(S n = x) can be approximated by: The value P(S n = x) is normalized to ensure that ∑ P(S n ) = 1.
In the remainder of this note, we describe the computational problems that applied statisticians or practitioners face in implementing the distribution of the sum of NBindependent variables using finite-sum exact expression [11], the convolution method [9], saddlepoint approximation, or the approximation by normal and NB distributions. We describe how these have been overcome in the publicly available R package (HelpersMG package version 5.9 and higher (https://CRAN.R-project.org/package=HelpersMG, accessed on 6 February 2023). The code can be checked after loading this package with the command ?dSnbinom. Figure 1 gives two examples of the sum S of independent NB random variables, and how these distributions are approximated using the four methods (convolution, saddlepoint, single normal, single NB) outlined in this note. In (A), we use n = 10, j = 1 . . . n, p j = 0.4 + j 10 and r j = j × 10, and in (B) n = 2, p j = j 10 , and r j = j.

Normal and Negative Binomial Approximations
When n is large and standard deviation is small as compared to the mean, the normal approximation with P(S n = x) = x+0.5 x−0.5 N (µ, σ ), where N (µ, σ ) is the normal probability density function with µ = mean(S n ) and σ = var(S n ) can be used as an approximation for the distribution of the sum of independent NB random variables ( Figure 1A). However, for a small n or large standard deviation, as compared to the mean (corresponding to a highly skewed distribution), the approximation can be very poor ( Figure 1B). The NB distribution modeled with the probability density function, NB(µ, θ), such that µ = mean(S n ) and θ = mean(S n ) 2 /(var(S n ) − mean(S n )) , better fits the exact distribution of the sum of NB variables, but still with a bias ( Figure 1B). This confirms that the distribution of the sum of independent NB variables is not an NB, as wrongly stated in [6]. It is, rather, a mixture NB (see below) [9]. In summary, the normal and NB approximations generate the highest errors (>30% in some cases) and they should not be used, especially as there are better alternatives.

Finite-Sum Exact Expression
This method permits the calculation of the exact value for P(S n = x). It will therefore be used as a reference here.
For the finite form exact expression method [11], a table of n columns with all the combinations of integers, from 0 to x, that produce a sum of x (m 1 + · · · + m n = x), must be first established. The number of different ways to distribute x-indistinguishable objects into n-distinguishable categories is C(x + n − 1, n − 1). This is the memory-consuming part of the Vellaisamy and Upadhye [11] method. The density P(X = x) in Equation (1) is calculated n times for each of these combinations in the final table (the ∏ n j=1 part of Equation (5)). This is the computationally time-consuming part of the method.

Normal and Negative Binomial Approximations
When n is large and standard deviation is small as compared to the mean, the normal approximation with ( = ) = ∫ ( , ) , where ( , ) is the normal probability density function with = ( ) and = √ ( ) can be used as an approximation for the distribution of the sum of independent NB random variables ( Figure 1A). However, for a small n or large standard deviation, as compared to the mean (corresponding to a highly skewed distribution), the approximation can be very poor ( Figure 1B). The NB distribution modeled with the probability density function, ( , ), such that = ( ) and = ( ) 2 ( ( ) − ( )) ⁄ , better fits the exact distribution of the sum of NB variables, but still with a bias ( Figure 1B). This confirms that the distribution of the sum of independent NB variables is not an NB, as wrongly stated in [6]. It is, rather, a mixture NB (see below) [9]. In summary, the normal and NB approximations generate the highest errors (>30% in some cases) and they should not be used, especially as there are better alternatives.

Approximation by Convolution
The coefficients of Equation (6) are iteratively defined, and we rewrite the published formula to make the computation more efficient using the recursion: Intermediate estimates in (11) used log of expressions to prevent a computing overflow. The conditions to stop the iterations were not defined in Furman's original publication.
A typical method in such situations is to stop the recursion when the change in the final output is below a defined tolerance. However, it cannot be used in the context of Equations (6) or (11), because, at the beginning of iterations, the change in P(S n = x) is sometimes so small that recursions will be stopped immediately and the resulting probability P(S n = x) will be biased. An example of this is shown in Figure 2A, which shows the value of P(S 7 = 6) k (n = 7, j = 1 . . . n, p j = j/10, and r j = j) as a function of the recursive iterations k from Equation (6). For the first eight iterations, the change in P(S 7 = 6) k is very small. To alleviate this problem, many iterations can be used, but without being sure that they are sufficient, and it is done at the expense of running time. This solution was chosen with at least 1000 iterations in [5], but this number of iterations is not always large enough to ensure a correct estimate when n or x are large.  The comparison of the results obtained by Equation (6), with an adaptative stopping of the recursion and tolerance setup, using saddlepoint approximation (see below) and Equation (5), are shown in Table 1 with the corresponding computing time. This table is similar to those used in Vellaisamy and Upadhye [11]; Tables 1 and 2  A better approach came from the study on the trend of the rate of change of P(S n = x) according to the rank of iteration k: P k − P k−1 vs. P k+1 − P k , where P k denotes the value of P(S n = x) at iteration k. In its initial phase, the rate of change of P is positive, with P k − P k−1 > P k+1 − P k , then it shows a peak and becomes negative, with P k − P k−1 < P k+1 − P k ( Figure 2B). The tolerance threshold must be used only after the occurrence of the peak to ensure that the phase of rapid change of P is reached. The number of iterations before the peak is dependent on the values of n, x, r i and p i , and cannot be easily anticipated at the beginning of the iterations. We have therefore developed an adaptative strategy to stop the recursion when two conditions are met: the rate of change of P(S n = x) is negative, and the change of P(S n = x) is less than the user-defined tolerance. The tolerance value must be lower than P(S n = x) or the output will be biased. As an example, if µ = (0.01, 0.02, 0.03) and θ = (2, 2, 2), then P(S 3 = 20) = 7.73139 × 10 −35 using the exact method. If the Furman method is used with the tolerance set to 10 −12 , P(S 3 = 20) = 3.879379 × 10 −35 , which is two times lower than the exact answer. The solution is to define a tolerance much lower than the anticipated results, for example, here, with the tolerance being 10 −45 , P(S 3 = 20) = 7.73139 × 10 −35 , which is the correct probability. This can be done using the saddlepoint estimate (see below).
The comparison of the results obtained by Equation (6), with an adaptative stopping of the recursion and tolerance setup, using saddlepoint approximation (see below) and Equation (5), are shown in Table 1 with the corresponding computing time. This table is similar to those used in Tables 1 and 2 in Vellaisamy and Upadhye [11].

Saddlepoint Approximation
The saddlepoint approximation (we used the Brent' algorithm [15] for the minimization needed to find the saddlepoint) is computationally fast. However, the estimate must be normalized so that the density function sums to 1 [16]. The normalization used the sum of P(S n = x), with x from 0 to mean(S n ) + Max var(S n ), with mean(S n ) and var(S n ) from Equation (4) and Max = 20. A test was performed to ensure that P S n = mean(S n ) + Max var(S n ) was 0 or that the Max was increased until this condition was reached. The relative difference between the exact value of P(S n = x) and the saddlepoint approximation can be sometimes of the order of 5% (Figure 1). On the other hand, this approximation is good enough to set the tolerance of the approximation by convolution, using a tolerance equal to P saddlepoint (S n = x) × 10 −10 .
The tolerance value to cut the iterations for an approximate Furman [9] method must be of the same order as the value of P(S n = x), multiplied by the tolerance and set at the value of 10 −10 , to have an estimate precise up to the 10th digit. The difficulty is that P(S n = x) needs to be estimated here. The chosen solution was to use a rough estimate of P(S n = x) from the very fast saddlepoint approximation method first. This approach proved to be very efficient, because the estimates of the approximate Furman [9] method are exactly the same as for the exact method (Table 1A).
Equation (5) has the advantage that it is parallelizable, but for a large n and x (see Table 1A), doing so requires a large number of both iterations and memory. Equations (6) and (11), however, are not disadvantaged by these problems. Vellaisamy and Upadhye [11] indicated that Equation (5) required less computing time than Equation (6), even for n = 7 and n = 15. This would be true only if the authors used a very large number of iterations to stop the iterations in Equation (6), or if their conditions to stop the iterations were sub-optimal.
As a general conclusion, we consider that the approximate form of distribution for the sum of independent NB proposed by Furman [9] should be used in all the contexts, whatever the parameters n, x, p i or r i . The tolerance can be approximated by using the value of P(S n = x), estimated using the saddlepoint approximation method. This solution is used by default in the R package HelpersMG (version > 5.9), available in CRAN: The Comprehensive R Archive Network [17]. Data Availability Statement: All methods used in this note, as well as an approximation using the generation of random numbers, are available in the functions dSnbinom, pSnbinom, qSnbinom, and rSnbinom in the R package HelpersMG (version > 5.9), available in CRAN: The Comprehensive R Archive Network [17].