Determination of the best multivariate adaptive geographically weighted generalized Poisson regression splines model employing generalized cross-validation in dengue fever cases

Review Highlights • We present MAGWGPRS, a new model that is a combination and development of count response MARS and spatial regression.• Selecting the optimal bandwidth using an adaptive Gaussian kernel function based on the CV method.• Grouping districts/cities in Java Indonesia based on inter-regional disaggregating variables in modeling DHF cases.

This article constructs a new model based on multivariate adaptive generalized Poisson regression splines (MAGPRS) and geographically weighted generalized Poisson regression (GWGPR), which is known as multivariate adaptive geographically weighted generalized Poisson regression splines (MAGWGPRS). The article elaborates the steps of weighted maximum likelihood estimation (weighted-MLE) to obtain the estimated values of its parameters. MAGWGPRS and MAGPRS were applied to the number of dengue hemorrhagic fever (DHF) cases in 119 districts or cities in Java, Indonesia, in 2020, to compare their performance. The fitted value plot versus actual data and a comparison of the mean square error (MSE) value demonstrate the goodness of the two models. The best MAGWGPRS model for each location was obtained, and only one the best MAGPRS model for all locations was acquired. Based on the plot results of the fitted value with the actual data and MSE value, MAGWGPRS is determined to be superior to MAGPRS.
Previous research has not considered the spatial variation precipitated by geographical location in the study area. In fact, there are many problems associated with spatial variation, such as a relationship between the response and predictor that varies depending on geographic location, and parameters that are not constant across the study area [17] . The spatial variation occurs because each location has different characteristics such as geographical, cultural, and socioeconomic differences. As a result, the same predictor variable can have different effects at different locations. One of the statistical techniques used to overcome spatial variation is geographically weighted Poisson regression (GWPR) by Collins (2010) and Nakaya et al. (2005) [ 17 , 18 ]. To overcome cases of overdispersion or underdispersion, Adryanta et al.  [19][20][21] .
The similarity between MAGPRS and GWGPR is that both are localized regressions. The basis functions of MAGPRS are local series which are employed to model complex (non-linear) relationships. As a result, the global model of MAGPRS is a weighted sum of the local models [ 1 , 22 ]. The locality of GWGPR is due to differences in characteristics between observation areas (spatial non-stationarity). As a result, each location has a spatial weight, and the parameter estimates of the resulting regression model will differ depending on its geographical location (latitude and longitude coordinates). The MAGPRS model is extended in this article by considering spatial variation between observation areas, a modification of MAGPRS [ 15 , 16 ] and GWGPR [19][20][21] . This development model is significantly different from the previous one. The structure of the model has changed significantly. The modified model causes the basis function and its parameters to be localized at the same time. As a result, the model form, algorithm, and analysis are more complex than in the previous model.
The performance of the proposed model was implemented to the number of DHF cases in Java, Indonesia in 2020. The research units are districts or cities in Java, Indonesia. DHF is an infectious disease caused by the dengue virus and transmitted by the bite of Aedes aegypti or albopicus mosquitoes. The host (human), the agent (virus), and the environment are all essential factors in the growth and spread of DHF, and they differ by region. As a result, DHF is spatial data related to geographical location [23][24][25][26][27][28] . The spatial weight matrix is generated using an adaptive Gaussian kernel function, and the optimal bandwidth is determined using the cross-validation (CV) method. Finally, the best MAGWGPRS model is examined using the generalized cross-validation (GCV) method.

MARS model
In summary, the MARS model can be elaborated as follows. Given for each response variable , = 1 , 2 , ..., and predictor variables = ( 1 , 2 , ..., ) , where is the number of observations, and suppose that the estimated function satisfies the regression model where is error random with mean 0 and variance 2 .
Assume the function in Eq. (1) is a linear combination of the basis functions ( ) , = 1 , 2 , ..., : with 0 is the coefficient of the parent basis function, is the coefficient of the ℎ non-constant basis function. Each basis function is defined as a truncated spline function: where is a function in Eq. (2) is identified the MARS model [ 1 , 9 , 29 ]. Here, is the degree of interaction, is the sign of the basis function at the ℎ interaction and the ℎ basis function , and ( , ) is the ℎ predictor variable, in which v is the index of the predictor variable associated with the ℎ interaction and the ℎ basis function at the ℎ observation, is the knot value at the ℎ interaction and the ℎ basis function of the predictor variable ( , ) .
The MARS algorithm comprises of forward and backward stepwise [ 1 , 9 , 29 ]. Forward stepwise construct the MARS model by adding truncated spline basis functions (knots and interactions) until the model has the maximum number of basis functions. Following the completion of the forward stepwise process, a backward stepwise is performed to determine the number of feasible basis functions in the model. The basis function that contributes the least to the estimated response value based on the minimum GCV value is eliminated backward stepwise.
The basis function * ( ) in Eq. (4) and Eq. (5) is the parent basis function, which is a member of the set of basis functions existing before the addition of the new basis function pair, ( , ) is a predictor variable that does not exist in the parent basis function. The addition of new basis function pairs in the forward stepwise is carried out until the maximum number of basis functions is reached and is selected based on the minimum MSE.

Backward
Stepwise. It is conducted after the maximum number of basis functions is obtained by forward stepwise.
a. Selecting the forward basis functions one by one, with the exception of the parent basis function, and removing it if the GCV value decreases when the basis function is deleted. b. The deletion process is repeated until the GCV value does not decrease despite the fact that the remaining basis function is discarded. The optimal basis function is the basis function that remains after this backward stepwise procedure.
Friedman (1991) employed the GCV method in a backward stepwise procedure to select the optimal basis function in the MARS algorithm. The GCV method was developed from [ 30 , 31 ] where ( ̃ ) is a complex function, ( ) is the number of parameters to be estimated, is the degree of interaction, with the optimum value within the interval 2 ≤ ≤ 4 , and ̂ ( ) is the estimated value of the response variable on the basis function M and the ℎ observation.

MAGPRS model
Given a Generalized Poisson (GP) probability function from [32][33][34]: where is the mean of an event and ( ≠ 0 ) is the dispersion parameter. If the response variable in Eq. (1) is GP distributed with the probability function given in Eq. (8) , then the MAGPRS model: GWGPR model is the coordinate point with is latitude and is the longitude at ℎ location, then the GWGPR model can be written as follows [ 19 , 20 ].
is the intercept parameter at ℎ location and ( , ) are the parameters model for each ℎ predictor variable at ℎ location.

MAGWGPRS model
Model Eq. (10) has global parameters but local basis functions, resulting in a single regression model for all observations (locations). As a result, we would like to create a model Eq. (10) that is location-dependent, particularly regarding MAGWGPRS, so that the new model has different parameters for each location. The geographical location at ℎ location, denoted ( , ) , is defined as in GWGPR section.
with 0 ( , ) is the parameter of the parent basis function at ℎ location and ( , ) is the parameter of the ℎ non-constant basis function at ℎ location.

Parameter estimation of the magwgprs model
The weighted MLE method estimates the MAGWGPRS model parameters at each location by assigning geographic weights. This method maximizes the log-likelihood function by solving a gradient function equal to zero [35] . The following theorem is provided to estimate the parameters of the MAGWGPRS model. Theorem 1. Given the MAGWGPRS model in Eq. (12) . When we use the weighted MLE method to calculate the model parameters, we obtain the parameter estimator equation, which is not closed form, i.e. , and − ∑ * =1 * * e * ( , ) Proof of Theorem 1 . Based on the probability function in Eq. (8) , the likelihood function for MAGWGPRS model The natural logarithm function of Eq. (15) is To estimate the parameters of the MAGWGPRS model at ℎ location requires spatial weighting. It utilizes distance information from one location to another. Let = 1 , 2 , … , represent the location in which the local parameter estimates are generated (i.e., regression points) and * = 1 , 2 , … , represent the location in which the data has been observed (i.e., observation points). The spatial weight, * , represents the weight assigned to the ℎ observation in the calibration of the MAGWGPRS model for the ℎ location.
This research employs the Gaussian kernel adaptive weighting function in [36] , which is defined as 2 is the Euclidean distance between the * ℎ observation and the ℎ regression point and ℎ is the bandwidth at the ℎ regression point.
According to Nakaya et al. (2005), the bandwidth governs the rate at which the datum weight decreases as the distance between the observation location and the regression point increases. If the bandwidth value is very small, the variance increases, thus, the number of observations within radius ℎ will be small. If the bandwidth value is tremendously large (close to infinity), the variance decreases and the resulting weight * between locations approaches 1, hence the estimated parameters will be homogeneous, and the spatial model will be similar to the global regression model [18] . The CV method is employed in [36] to obtain the optimum bandwidth: where ̂ ≠ ( ℎ ) is the estimated value of the observation for bandwidth ℎ except at the ℎ location. Next, input * into Eq. (16) based on [37] : .
Eq. (17) is derived partially for each parameter and equalized to zero using the weighted MLE method. The description is as follow The first partial derivative of Eq. (17) with respect to ( , ) gives Similarly, the first partial derivative of Eq. (17) with respect to ( , ) gives Eq. (18) and Eq. (19) are not closed form, then Theorem 1 is proven. □ Since Eq. (18) and Eq. (19) are not closed-form, they are solved by numerical methods, namely the Berndt Hall Hausman (BHHH) method. The advantages of the BHHH method compared to other numerical methods for parameter estimation include its robustness to misspecification of the underlying distribution of the data, its simplicity, its convergence properties, and its computational efficiency. Suppose the parameters in Eq. (18) and Eq. (19) are written in the form of = ( ( , ) , ( , ) ) , then the BHHH iteration process stops when ‖̂( * +1) − ̂( * ) ‖ ≤ * . The BHHH iteration equation is ) .

Steps in the research
Stages of research analysis:

Model application
The MAGWGPRS model was implemented to data on DHF cases in 119 districts or cities in Java, Indonesia. For each district or city in 2020, data was obtained from Badan Pusat Statistik (BPS) [38][39][40][41][42][43] . The research variables consisted of one response variable and six predictor variables. The response variable is the number of DHF cases in 119 districts/cities. The predictor variables: 1 is population density (ha/person), 2 is the percentage of households that possess access to proper sanitation, 3 is the percentage of households which own access to proper drinking water sources, 4 is the percentage of poor population, 5 is the ratio of medical personnel, and 6 is the ratio of health centers.
The map below depicts the distribution of DHF cases in Java, Indonesia. According to Fig. 1 , the highest number of DHF cases were discovered in West Java Province (18 out of 27 districts/cities) and DKI Jakarta (4 out of 6 districts/cities).
To begin the analysis with MAGWGPRS, there are two steps performed. First, conduct an equidispersion test. According to [44] , if the quotient of Pearson Chi-Square or deviance with free degrees is equal to one, the data is said to be equidispersion or ( = 1 ) , overdispersion if > 1 , and underdispersion for other than that. In this research data, the deviance value is 34,807.71 with 112 free degrees, so the dispersion value is 310.783. This indicates that the data has overdispersion. Then, it is performed a spatial heterogeneity test. This test is employed to assess whether there are differences in characteristics between locations. The Breusch-Pagan (BP) test statistic is one that can be utilized. The BP value and p-value in this study are 19.996 and 0.002773, respectively. Because the p-value is less than 0.05, it can be concluded that spatial heterogeneity exists. As a result, the MAGWGPRS model has been validated for predicting the spread of DHF in 119 districts/cities in Java, Indonesia.

Table 1
Grouping of districts/cities based on inter-regional disaggregating variables in the MAGWGPRS model. Next, estimate the parameters of the MAGWGPRS model for each location. The first step is to determine the optimum bandwidth, BF, MI, and MO. Based on the spatial weights corresponding to the location and the combination of BF, MI, and MO, 36 MAGWGPRS models are generated for each location. Then, for each location, select the best model from the 36 available models based on the smallest GCV or the largest 2 value. The obtained results are 119 best MAGWGPRS models. Finally, the predictor variables are categorized as inter-regional disaggregates, that is, predictor variables that affect the model, as illustrated by Table 1 . Table 1 demonstrates six groups of districts /cities based on the disaggregating variables between regions. District/city locations are symbolized by nonnegative integer codes, which can be perceived in Appendix A. For example, the groups of districts/cities where DHF cases are influenced by 1 , 2 , 3 , 4 , 5 , 6 are districts/cities 1-26, 28-36, 40, 44- 47, 49, 52, 59-62, 67, 68, 74-77, 86, 92-95,  98, 100, 103, 109, 112-119. Districts in the same group therefore have the same MAGWGPRS model structure in the inter-regional disaggregating variables. Fig. 2 depicts the visual grouping of districts/cities based on the separating variables between regions. Fig. 2 depicts six groups of districts with adjacent areas that have similar characteristics. The first district Nganjuk (code 91), where population density and the percentage of households with access to safe drinking water have no effect on the number of DHF cases. In two districts (codes 50 and 82), the percentage of households with access to a safe drinking water source has no effect on the number of DHF cases.
As an illustration, to obtain the MAGWGPRS model, we provide examples of Central Jakarta and Surabaya cities. The possible MAGWGPRS models for Central Jakarta and Surabaya cities in accordance with a combination of BF, MI, and MO are presented in Table 2 . The best model parameter estimates are presented in Table 3 and Table 4 , respectively.
Based on Table 2 , the best model of Central Jakarta city is discovered in the 36th model with BF, MI, MO, GCV, and R 2 are 24, 3, 3, 0.823, and 0.768 successively. Table 3 implies the parameter estimation of the best model for Central Jakarta with the inter-region

Fig. 2.
Grouping of districts/cities based on inter-regional disaggregating variables in the MAGWGPRS model.

Table 4
Parameter estimation for the best model of Surabaya City.  14 Furthermore, the MAGPRS model was compared to the performance of the best MAGWGPRS model for Central Jakarta city, Surabaya city, and 117 other districts/cities. Comparisons were made between the actual number of DHF cases and the fitted values of the best MAGWGPRS and MAGPRS models. The MSE value of each model may also be employed to compare the goodness of the two models. Fig. 3 illustrates the data plot of the actual number of DHF cases, the fitted value of the MAGWGPRS model, and the fitted value of the MAGPRS model. In general, the MAGWGPRS fitted values are closer to the actual values than the MAGPRS fitted values. According to the graph, the MAGWGPRS curve pattern is more similar to and closer to the actual curve pattern than the MAGPRS curve pattern. The MSE value of the MAGWGPRS model is less than the MSE value of the MAGPRS model, as illustrated by Table 5 .

Conclusion
The MAGWGPRS model, a modification of the MAGPRS model and GWGPR spatial regression, was proposed in this study. We demonstrated a step-by-step procedure for obtaining the estimated parameters of the MAGWGPRS model's estimated parameters. Aside from the benefit of displaying the regression coefficients for each location, MAGWGPRS is a complex model that necessitates a difficult coding program. As a result, determining the best model form becomes more complicated than the MAGPRS model.
Furthermore, the MAGWGPRS model was applied to dengue case data in 119 districts or cities on the Indonesian island of Java. The best model was obtained for each location based on the optimal bandwidth and spatial weighting with adaptive Gaussian kernel and the combination of BF, MI, and MO, specifically providing examples for the Central Jakarta and Surabaya cities. There are 119 best models, but if we employ the MAGPRS model to analyze them, we only obtain one of the best models for all locations. Based on the graphical comparison of actual values with MAGPRS and MAGWGPRS fit values and MSE values, the MAGWGPRS model is better than the MAGPRS model in this case.
The limitation of this article is that there is no hypothesis testing of the proposed model. Therefore, future research can test hypotheses. Furthermore, for various case studies, this model can be extended with other distributions or kernel functions.

Ethics statements
The data used in this research are secondary data derived from the official website of BPS Provinces in Java, Indonesia.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability
The authors do not have permission to share data.

Table A1
Codes and names of districts/cities on the island of Java, Indonesia.

Code and Name
Code and Name Code and Name Code and Name