Influence Diagnostic Methods in the Poisson Regression Model with the Liu Estimator

There is a long history of interest in modeling Poisson regression in different fields of study. The focus of this work is on handling the issues that occur after modeling the count data. For the prediction and analysis of count data, it is valuable to study the factors that influence the performance of the model and the decision based on the analysis of that model. In regression analysis, multicollinearity and influential observations separately and jointly affect the model estimation and inferences. In this article, we focused on multicollinearity and influential observations simultaneously. To evaluate the reliability and quality of regression estimates and to overcome the problems in model fitting, we proposed new diagnostic methods based on Sherman–Morrison Woodbury (SMW) theorem to detect the influential observations using approximate deletion formulas for the Poisson regression model with the Liu estimator. A Monte Carlo method is done for the assessment of the proposed diagnostic methods. Real data are also considered for the evaluation of the proposed methods. Results show the superiority of the proposed diagnostic methods in detecting unusual observations in the presence of multicollinearity compared to the traditional maximum likelihood estimation method.


Introduction
Nowadays, there are several distributions available in the literature that can be used to remove noise and then predict data. Similarly, there is a persistent record of concern in modeling count data which has several applications in biosciences and other disciplines [1][2][3][4]. e focus of this effort is on dealing with the issues that occur after modeling the count data. For the prediction and analysis of count data, it is valuable to study the factors that influence the performance of the model and the decision based on the analysis of that model. Considering the suitable statistical modeling, when the dependent variable is count data, one of the most used statistical models is the Poisson regression model (PRM). For accurate statistical inferences, the standard ordinary least square (OLS) regression sets some important assumptions related to the model's errors [5]. Generally, numerous problems may arise when a count variable model is estimated using the OLS method, because of the level of noise. For the analysis of count data, PRM provides the most relevant results. According to McCullagh and Nelder [6], the PRM belongs to the family of GLM. e maximum likelihood ML estimation method is used to estimate the PRM estimates instead of the OLS method.
In the PRM, when the explanatory variables are linearly correlated, then the ML method is very sensitive [7]. Some biased estimators were introduced in the literature to handle the multicollinearity, i.e., Stein, ridge, Lasso, regularization, and Liu estimators; see [1,3] and [8][9][10] for more details. e most popular one is the ridge estimator, but it has some limitations, i.e., selecting the ridge parameter, where the ridge rule is based on two normal distributions. It is a shrinkage rule because it depends on the slope. In contrast, Lasso is based on the slope and the intercept. e best choice is to adopt a Liu estimator to avoid the hindrances of the ridge estimator. e Liu estimator is an ace in this regard as it avoids the disadvantages of the ridge estimator [10], where the main advantage of the ridge is easy to use, and it can be written in the explicate and the objective formulas. In the literature, various studies are available for the PRM to overcome the presence of collinearity [7,[11][12][13][14][15][16].
To evaluate the reliability and quality of regression estimates and to overcome the problems in model fitting, diagnostic techniques have been developed. Although regression diagnostics have been developed methodologically and theoretically for linear regression models together with multicollinearity (see [17][18][19][20][21][22][23][24]), some studies about the influence diagnostics in the GLM with uncorrelated explanatory variables are available in the literature. Pregibon [25] proposed the influence diagnostics for logistic regression using the one-step methods. For further discussion on influential diagnostics about GLM, see [26][27][28][29][30][31][32].
Influence diagnostics in the GLM with correlated explanatory variables is very limited.Özkale et al. [33] proposed the first study on influence diagnostics for logistic ridge regression. Amin et al. [34] worked on the influence diagnostics for the gamma ridge regression model. Khan et al. [35] assessed the performance of influence diagnostics in the PRM with a ridge estimator. Recently, Khan et al. [36] examined the superiority of influence diagnostics in the PRM with two-parameter estimator and, further, Amin et al. [37] discussed the influence diagnostics for the inverse Gaussian ridge regression model. e available literature showed that no study in the GLM is available for influence diagnostics with the Liu estimator.
ough, the Poisson Liu regression (PLR) diagnostics have got no thoughtful attention up till now. us, our present work is an effort to fill this gap. So, in the present work, we proposed diagnostic methods for the PRM under the Liu estimator, which prove to be the competed method. e remaining of the study is organized as follows: we focused on the formulation of influence diagnostic measures for the PRM under the Liu estimator (LE). Next, in Sections 4 and 5, we conducted a Monte Carlo study using two, four, and six independent variables to examine the level of detection percentage of newly developed diagnostic measures and, finally, we proved the efficacy of proposed measures with the help of real world application.

Model Specification and
Estimator. Suppose the model can be written as where y � y i : i � 1, 2, . . . , n are the observation, X � x ij : i � 1, 2, . . . , n, j � 1, 2, . . . , p is a matrix, β � β i : i � 1, 2, . . . , p are the unknown parameters, and ε ∼ N n (0, Iσ 2 ) with ε i and ε j (i ≠ j) being independent. We assume the observation is the result of the integration form (Xβ) and try to solve this problem by finding differentiation matrix. e PRM is applicable for real data, especially when the response variable y i often comes in the form of count data that are known. Let y i follow a Poisson distribution with μ i , as its parameter. e probability mass function for PRM is used to describe the relationship when y i , the response variable occurs as count data.
e PRM belongs to the GLM with log link function as where b 0 is the intercept and b 1 , b 2 , . . . , b p are the set of coefficients. e estimated mean function is defined by . Here x i is the i th row of independent variable X n×p and β p×1 of coefficients, where p represents the number of explanatory variables.
Assume that all y i are independent; then, the joint loglikelihood is defined as For finding the best value of β, we have to solve the following relation: Since the systems of equations are nonlinear, so the MLE with iterative reweighted least-squared (IRLS) algorithm is used to estimate the regression coefficients as explicit formulas: where In the presence of multicollinearity, the X ′ WX matrix becomes ill conditioned, and because of this problem, it gets complicated to draw effective inferences. To overcome these effects of multicollinearity, we use the generalization of Liu [6] to define PLRE.

Hat Matrix, Leverage, and Residuals for the PRM.
Hat matrix H is a common measure used to compute leverages. According to Davison and Tsai [39], the hat matrix H in the PRM is e diagonal elements of H are interpreted as leverages, i.e., h ii � diag(H). For the computation of regression diagnostic measures, residuals play the most important role (Belsley et al. [18]). Let χ i symbolize the Pearson residual, so for the case of PRM, we defined it as Similarly, we find the standardized Pearson residual as Another useful residual that proves to be of great help for detecting unusual observations is termed as the deviance residual. e i th deviance residual for the PRM is defined by where the sign is the sign function [31]. [25] was the first to work on logistic regression diagnostics tool and proposed the influence diagnostic measures using one-step approximations. e proposed influence diagnostics take account of Cook's distance, change in deviance, and change in Pearson χ 2 . For PRM Cook's distance, C i is suggested as

Influence Diagnostic Methods. Pregibon
e i th C i measures the overall change in the fitted model when the i th observation is deleted from the model. e onestep approximation for the expression (β ML − β ML(i) ) is defined as where W ii are the i th diagonal elements of weight matrix after the removal of i th observation. Furthermore, (13) can also be approximated as Hardin and Hilbe [40] suggested the cut point for detecting the unusual observations in GLM as (4/(n − 1)); this process is used to specify the window in GLM.
Pregibon [25] suggested change in Pearson χ 2 as another influence diagnostic measure to detect the influential observations. Applying one-step approximation, we defined Δχ 2 i as where χ 2 0 is used to represent the squared Pearson residuals of the complete data set and χ 2 i signifies the squared Pearson residuals of the data set without the i th observation, respectively. is statistic is employed to study the effect of i th observation on the goodness of fit of the model and the estimates. On similar grounds, Pregibon [25] suggested that another statistic for measuring the impact of i th observation on the goodness of fit of a model is the change in deviance statistic. e one-step linear approximation for change in deviance statistic is defined as where for complete data set d 0 is used to represent the squared deviance residuals and the squared deviance residual d (i) are found after the removal of i th observation, respectively. We suggested a simplified form of equation (17) by replacing χ 2 i by d 2 i as e cut-off value for change in deviance statistic is 3.84 to detect the unusual observations [25]. e difference of fits (DFFITS) suggested by Belsley et al. [18] is another common influence measure. After deleting the i th observation, DFFITS assesses the change in fit of model. For GLM, it is given as Computational Intelligence and Neuroscience where μ i is used to represent the predicted regressand of complete data set and μ i(i) represents the predicted regressand after deleting the i th case. Furthermore, it can also be written as By using the SMW theorem, (20) is retransformed as where is termed as the jackknife Pearson residual and DFFITS i > 2 �������� (p + 1)/n shows that the i th observation as influential.
e second matrix will be introduced in the next section.

Influence Measures in Poisson Liu Regression
Model (PLRM)

Hat Matrix, Leverages, and Residual in PLRM.
Hat matrix H d for the PLRM is defined as e leverages are the Liu hat diagonals that proved helpful in detecting influential cases with some modifications. As for d > 0, h di < h i for i � 1, 2, . . . , n and as d increases, h di decreases monotonically.
Using the Liu estimator, the Pearson residuals for PLRM are defined as e standardized form of Pearson residuals with multicollinear independent variables is given as

Influence Diagnostics for PLRM.
e approximate case deletion formulas using SMW theorem [41] are found for the identification of influential observations.

Theorem 1. After the deletion of
where X (i) represents the X matrix without the i th row. Using the SMW theorem, we approximated β d(i) .
Wz, and s � y − μ, then β ML and β d stated by (6) and (7) become Let β ML(i) and β d(i) represent the ML and PLRE of β after deleting the i th observation, respectively. us, we have With the help of SMV theorem, β ML(i) can be improved as where m di � k i (K ′ K + I) − 1 k i ′ . We also have Now, 4 Computational Intelligence and Neuroscience Hence, the theorem is completed. Following [42] for the PLRE, the Cook's distance is redefined as e i th observation is considered as influential if the distance between β d(i) and β d is larger. Another version can be expressed as Using the Liu estimator, we defined the change in Pearson chi-square as where the squared Liu Pearson residuals χ 2 d0 are used to represent the complete data set and χ 2 d(i) computed without i th observation. Correspondingly, with Liu estimator, we formulated the change in deviance statistic as where d 2 d0 and d 2 d(i) represent the squared Liu deviance residuals with complete data and the squared Liu deviance residuals computed without i th observation, and where s i is the sign function of (y i − exp(x i ′ β d )). Following (19), we give the DFFITS for PLRM as where μ d0 and μ d(i) represent the predicted regressand of the complete data set and predicted regressand after deleting the i th case.
Using the SMW theorem, we simplified (37) as where is the i th Pearson Jackknife residual with Liu estimator.

Simulation Study
In this section, we summarize the results of the PRM and the PLRM influence diagnostics using the Monte Carlo simulation scheme. We follow the same simulation scheme used by many researchers to see [43,44]. e response variable y is generated from the Poisson distribution with mean function μ i as defined by μ i � exp β 0 + β 1 x i1 + β 2 x i2 + · · · + β p x ip , i � 1, 2, . . . , n.

(39)
We set simulation with p � 2, 4, 6 explanatory variables with various sample sizes plus the mild to severe levels of collinearity.
We assumed sample sizes n � 25, 50, 100, 150, 200. Moreover, we generated the regressors using the following formula: We consider the different collinearity levels as ρ 2 � 0.75, 0.85, 0.95, 0.99, and we assume the arbitrary values of regression coefficients in such a way that p j�1 β 2 j � 1. Now we generated few influential observations in the regressors by using the expression X ij � X ij + α 0 , i � 15 and j � 1, 2, . . . , p, where α 0 � X j + 6. All the analyses are performed using the R software with 1000 replications.

Results and Discussion.
e study results of the calculations of the identification of the unusual observations with LE in the presence of mild to severe multicollinearity are provided in Tables 1-6 with p � 2, 4, and 6 with defined optimum d 1 and d 2 . From Tables 1-3 with p � 2, it is clear that performed C di is good as compared to the C i method for different sample sizes with multicollinearity. e influence detection of Δχ 2 di and Δd 2 di methods is identical and performs significantly better than Δχ 2 i and Δd 2 i , respectively. However, it is observed that their performance does not occur in a better way than that of the C i method for all the combinations of n, p and ρ. Comparable effects are observed on DFFITS i method, and it is found that the detection percentage of influential observations by DFFITS i method is better than C i , although the performance of DFFITS di related to C di is equally better. Furthermore, as we increased the sample sizes, the percentage of detecting the influential observation of the developed measures equally increases. Moreover, from Tables 4-6, we observed that the newly developed diagnostic measures performed more efficiently with d 2 , but d 1 give better detection percentage as compared to d 2 . Furthermore, varying the regressors size affects the functioning of C di method and DFFITS di method, respectively.

Application: English League Football Data
For the illustration of the proposed diagnostic methods, we analyze the football English League data set which is also available in Table 7. e said data comprise n � 20 observations with one response variable, i.e., the number of won matches (y) and p � 5 explanatory variables, i.e., the number of yellow cards (X 1 ), the number of red cards (X 2 ), goals won (X 3 ), goals conceded (X 4 ), and the number of points earned (X 5 ). Algamal and Alanaz [11] also used this data set. After checking the χ 2 , the goodness of fit test found that it is well fitted to the Poisson distribution.
e said data are multicollinear as the condition index CI � 31.274.
From Table 8, it is found that all methods commonly identify the 1st observation as the influential observation. Change in chi-square statistic and change in deviance statistic with ML estimator do not detect any of the observation as influential. Furthermore, observation 19 th was detected as influential by DFFITS without Liu estimator and by all of the proposed diagnostics. e effect of deleting the highlighted observations on the estimates of PRM and PLRM is presented in Table 9. We found a maximum change in PRM and PLRM estimates after the removal of the 1 st observation that was detected by all selected and proposed measures. e second observation that was identified by just DFFITS i and all proposed measures is the 19 th . After the deletion of detected observations, we found the maximum change on β 2 and β 3 . After examining these results, it was noted that in the presence of multicollinearity, PLRM diagnostic measures efficiently detect the influential observations. Furthermore, we incorporated index plots to summarize the efficacy of the proposed measures in Figure 4.             Computational Intelligence and Neuroscience

Conclusion
is study introduces diagnostic measures for Poisson Liu regression using biased estimator to handle influential observations and multicollinearity simultaneously in the PRM. As discussed earlier, multicollinearity affects the performance of traditional ML estimator in PRM. erefore, we adopted the Liu estimator due to its efficient statistical properties to solve multicollinearity and influential observations in PRM. e simulation results support the performance of new diagnostic measures as the detection percentage of ML estimators and the existing measures turn out to be the worst with increasing the sample size, number of regressors, and the level of multicollinearity. e results proved that the suggested measures proved more beneficial for the identification of influential observations together with multicollinearity. Hence, it is suggested that these proposed measures guide the user to handle the issue of multicollinearity with robust estimator support efficiently.

Data Availability
All data are included in the paper with their links.

Conflicts of Interest
e authors declare there are no conflicts of interest.