A FOUR-PARAMETER NEGATIVE BINOMIAL-LINDLEY REGRESSION MODEL TO ANALYZE FACTORS INFLUENCING THE NUMBER OF CANCER DEATHS USING BAYESIAN INFERENCE

,


INTRODUCTION
Statistical models are essential in engineering, medical, biological science, management and public health fields, providing helpful information to draw conclusions and make decisions. Linear models describe a continuous response variable as a function of one or more predictor variables.
Although widely used, linear models have certain limitations; for example, the dependent variable must be continuous or only quantitative, discrepancies are assumed to be normally distributed and each observation is independent of other variables. Generalized linear models (GLM) as more flexible, were developed from the general linear model to offer better coverage. A general linear model can be described as a linear regression model for continuous response variables that defines continuous or categorical predictors. For a count response variable, the general linear model is not appropriate because the dependent variable must be continuous or only quantitative. Therefore, the GLM is used to model discrete response variables. Several statistical models have been developed to better understand the count response variable such as the Poisson (Pois) and negative binomial (NB) models. The Pois model is an appropriate choice for repeated count data. However, this model is not realistic because of the restriction that the mean and variance are equal, while the NB model effectively manages overdispersion of longitudinal data [1] and also overcomes constraints of problematic distribution of count data with overdispersion [2][3][4][5]. The NB distribution is proper for count data when there is an overdispersion problem, without necessarily being heavytailed but heavy-tailed distributions often present overdispersion [6].
Later, new distributions were developed to provide more flexibility and coverage. One of the most widely used is the mixed NB distribution. Many mixed NB distributions have been introduced such as the NB-Lindley [7], NB-beta exponential [8], NB-generalized exponential [9], NB-gamma [10], NB-Sushila [11] and NB-generalized Lindley [12]. Recently, [13] proposed a 3 A FOUR-PARAMETER NEGATIVE BINOMIAL-LINDLEY REGRESSION MODEL new mixed NB distribution, namely a four-parameter negative binomial-Lindley (NBL) distribution for describing over and underdispersed count data. Mixed NB distributions are applied to statistical model events for count data in real life such as actuarial and insurance models [7,10,14,15], medical or industrial models [14] or in the fields of ecology and biodiversity [16][17][18].
Parameters in mixed NB regression models were estimated using the Bayesian framework as a more flexible approach than the maximum likelihood estimation [19][20]. The difference between the maximum likelihood and Bayesian methods is that the underlying parameters are considered random variables characterized by a prior distribution [21]. Bayesian inference for mixed NB models has also been studied such as NB-Lindley [22], NB-generalized exponential [23], NB-Sushila [24], NB-generalized Lindley [12], NB-Quasi Lindley [17], NB-modified Quasi Lindley [16] and exponential [21] linear regression models.
This article analyzed factors influencing the number of cancer deaths in Thailand using the regression model for one mixed NB distribution, namely a four-parameter negative binomial-Lindley (NBL) distribution proposed by [13]. Their results showed that the four-parameter NBL distribution outperformed the Pois and NB distributions when fitting count data with overdispersion and a large number of zeros. However, the four-parameter NBL distribution has never been developed as a regression model. Therefore, this study applied the four-parameter NBL distribution under the GLM framework. Parameters of the proposed regression model were estimated using the Bayesian approach to compare the efficiency against some traditional regression models.
The rest of this paper is organized as follows. Section 2 presents an overview of the Pois, NB and four-parameter NBL distributions and also describes the generalized linear regression model.

PRELIMINARIES
This section introduces an overview of traditional distributions for count data such as the Pois, NB and four-parameter NBL distributions. The generalized linear regression model is also described.

The Pois distribution
Let Y be a random variable following the Pois distribution with parameter ,  denoted by

The NB distribution
Let Y be a random variable following as the NB distribution with parameters r and p , denoted by Y~NB ( , ) rp. Then its pmf is given by

The four-parameter NBL distribution
The four-parameter NBL distribution was proposed by [13] in 2022 as a mixture between the NB distribution and a three-parameter Lindley (L3) distribution with parameters , a b and c , i.e. |, Yr ~ NB ( , exp( )) rp  =− and ~ L3 ( , , ) abc . The L3 distribution was proposed by [25] with a probability density function (pdf) as follows: Y be a random variable following the four-parameter NBL distribution with parameters , r , a b and .
c Then its pmf is given by The four-parameter NBL distribution is versatile as it nests several distributions when specific parameters are fixed [13]. These special cases are (i) a three-parameter NBL distribution (for 1 a = ) proposed by [26] and (ii) a NBL distribution (for ab = ) proposed by [7].

The generalized linear regression model
The GLM, originally introduced by [27], allows modeling of a wide range of probability distributions for response variables such as binomial, Pois and exponential distributions. The 6 TONGGUMNEAD, KLINJAN, TANPRAYOON, ARYUYUEN difference between a traditional linear regression model and the GLM is that a response variable in a GLM is related to the linear predictor through a link function rather than being assumed to be normally distributed. This link function allows for modeling non-normal response variables while using a linear combination of the predictor variables.
The GLM consists of three main components as follows: (1) A random component which specifies that the conditional distribution of a response variable i Y , for the i th of n independently sampled observations, is given values of explanatory variables with the mean E( ) assumed to follow a certain probability distribution such as a binomial, Pois or NB. The choice of distribution depends on the type of response variable and the nature of the data.
(2) A systematic component that specifies a linear predictor, which is a linear combination of the explanatory variables with k predictors, denoted by , ik X replaced by a linear predictor, .
i  The relationship is assumed as unknown regression coefficients to be estimated.
. Its corresponding inverse transformation is where 12 (1, , ,..., ) is a vector of length (  can be rewritten to show its pmf as follows: The mean and variance of i Y are, respectively (3). Then the traditional NB distribution [2] can be rewritten to show its pmf as:  (3) and (4) If exp( ) T ii  = x β , the mean of the response can be calculated using the conditional expectation as follows: where E( )  and 2 E( )  are the first and second moments about the original L3 random variable [25], i.e.,

Bayesian inference for the four-parameter NBL regression model
The vector of unknown parameters Ω in equation (12) was estimated using the Bayesian approach, which considers prior information for parameter estimation. The Bayesian approach was implemented using a hierarchical Bayesian modeling approach relying on Markov Chain Monte Carlo (MCMC) techniques [23,28] for the f o u r -p a r a m e t e r NBL regression model.
Accordingly, under a squared error loss function, the Bayesian estimator of Ω will be ( ) E | .
The four-parameter NBL distribution is conditional upon the unobserved site-specific frailty term  in equation (11), which describes the additional heterogeneity [22]. Consequently, the hierarchical framework can be represented as: abc (15) The unknown parameters or , , , r a b c and β in equation (12)  ( 1) ( 1) kk +  + known non-negative specific matrix. If each parameter is independently distributed, the joint prior distribution of all unknown parameters can be written as: From equations (13) and (16), the posterior distribution is derived as follows: The posterior distribution does not have an explicit form and the computational method called a Gibbs sampler was used in this study. The best known MCMC sampling algorithm was applied to find ( ) E | . Ωy The model parameter Ω was then estimated from the Bayesian method [32][33].
Model performance was compared based on the deviance, DIC and 2) The DIC is regarded as a generalization of Akaike's information criterion and the Bayesian information criterion, and is widely used as a goodness-of-fit measure when using the Bayesian approach. The DIC is beneficial to Bayesian model comparison problems where the posterior distributions have been obtained by MCMC simulations [5,34]. The DIC is defined as 3) The

Statistical modeling for empirical data
This section first describes the characteristics of the applications on a real dataset and then presents the modeling results using a Bayesian approach for the four-parameter NBL regression 11 A FOUR-PARAMETER NEGATIVE BINOMIAL-LINDLEY REGRESSION MODEL model by comparing the performances with the other models.

Data description
Data used as the dependent variable in this study were the number of cancer deaths in each province in Thailand in 2021 () Y [35]. The five independent variables are described below

Modeling results
Three parallel independent MCMC chains for 100,000 iterations were generated for each parameter based on these prior densities, discarding the first 50,000 iterations as a burn-in for computation. The expected posterior of the parameters was calculated using the JAGS function in the R2jags package of R language [32][33] The deviance, DIC and

CONCLUSIONS
A four-parameter negative binomial-Lindley (NBL) was developed and applied using the GLM framework to build a regression model. The four-parameter NBL regression model addressed the number of cancer deaths in Thailand using different factors when the dependent variable had overdispersion problems and was heavy-tailed. The proposed model was compared with the Pois and NB traditional models. Results showed that the four-parameter NBL model had the highest efficiency and was more suitable than the NB and Pois models, with the lowest values for deviance, DIC and .