SPATIAL MODELING OF CONFIRMED COVID-19 PANDEMIC IN EAST JAVA PROVINCE BY GEOGRAPHICALLY WEIGHTED NEGATIVE BINOMIAL REGRESSION

In East Java, 11,910 confirmed incidences for COVID-19 were registered as of 30 June 2020. We propose a Geographically Weighted Negative Binomial Regression (GWNBR) model to evaluate the effect of population density and daily average temperature on COVID-19 transmission. Our results reveal that the areas with high population density have much higher incidences than the areas with a low population density. This result indicates the COVID-19 spread quickly in locations with high population density. So, achieving a reduction in the contact rate between uninfected and infected individuals by quarantined susceptible individuals can effectively reduce disease transmission. However, the average temperature affect spatially only in several areas which shows that there is not enough evidence to explain the effect temperature on COVID-19 cases.


INTRODUCTION
Since December 2019, the world has been attacked by an outbreak of mysterious case of pneumonia in Wuhan, the capital of Hubei Province in China [1]. The outbreak rapidly spread from Wuhan into all provinces of China and then spread widely to other countries. WHO announced officially "Coronavirus Disease (COVID-19)" as the name of this new disease on 11 February 2020 which caused by virus name as "severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)" [2]. On 30 June 2020, there were 10,185,374 cases confirmed globally which spread worldwide included Africa, Americas, Eastern Mediterranean, Europe, South-East Asia, and Western Pacific [3].
The COVID-19 pandemic has been transmitted to Indonesia since 2 March 2020 [4]. On 9 April, the pandemic had spread to all 34 provinces and by 24 June, almost half of them had more than 500 cases. East Java province announced the first incidence of COVID-19 on 17 March 2020.
There were six confirmed COVID-19 incidences founded in Surabaya [5]. Then, on 9 May 2020, East Java was recorded as the top province with the highest daily incidence at the national level [6]. The daily confirmed COVID-19 incidences increased significantly and reached 12,118 cumulative incidence at 30 June 2020 [7].
The study of COVID-19 pandemic transmission in Indonesia especially in East Java Province are limited. However, understanding the COVID-19 outbreak is required to prevent transmission.
Previous studies of spatial COVID-19 outbreak in China examined whether a spatial association existed with using Moran's I statistics. The result showed a significant spatial dependency on the number of newly confirmed case. Furthermore, more people are likely to be infected with the virus in densely populated regions, which leads to the active spread of COVID-19 to other areas [8].
The COVID-19 outbreak can be influenced several factors. Data on 163 infected countries were analyzed latest on 3 April 2020 which reveals that countries with higher average temperature, lower international trade, and weaker democracy tend to experience comparatively smaller number of cases of infection per million people [9]. The air quality index significantly effects on the spread of the COVID-19 in China [10]. In Latin America, demographics and environmental factors 3 SPATIAL MODELING OF CONFIRMED COVID-19 PANDEMIC IN EAST JAVA BY GWNBR influence the current COVID-19 infection [11].
The number of confirmed COVID-19 incidences is categorized as count data. Poisson regression model is commonly used to analyze the count data. The basic assumption of Poisson regression is equidispersion (mean of the response variable equal to its variance). Frequently, the conditional variance of data exceeds the conditional mean, which is usually referred to overdispersion [12]. The alternative approach to deal with this problem is Negative Binomial regression which is derived from a Poisson-Gamma mixture distribution where the Gamma distribution is used to overcome the overdispersion in the Poisson regression [13]& [14].
According [15], the spatial data analysis is proper to explain the relationship between the confirmed COVID-19 incidences with several factors by considering the spatial dependency and heterogeneity. The spatial dependence occurs because of the transmission of COVID-19 depend on space. Nearby areas tend to have the same risk. The primary transmission route of COVID-19 is through person-to-person contact and through direct contact with respiratory droplets generated when an infected person coughs or sneezes [16]. Hence, an area with COVID-19 endemic will definitely affect its surrounding area. Meanwhile, spatial heterogeneity can occur due to differences in geographically, climatic factor, socio-cultural that are owned by each region.
The classic regression analysis with global coefficients less accurate and reliable to model spatial dependency and heterogeneity phenomenon [15]& [17]. The global coefficients may be not represent detail local variation in the data adequately. If the local coefficients allow to vary in space, it can be used to overcome the spatial non-stationarity [18]. The Geographically Weighted Regression (GWR) models allow the regression coefficients vary across space which is more familiar as local regression model. The GWR model estimates different regression model for each location that explains the spatial diversity [17]. The GWR models dealing with overdispersion are known as the Geographically Weighted Negative Binomial Regression (GWNBR). This study is developed to analysis the relationship between risk factors and confirmed COVID-19 incidences at district levels in East Java Province by means GWNBR model.

MATERIAL AND METHOD
This study used cumulative confirmed COVID-19 incidences on 30 June 2020 as response variable ( ). This data were collected from officially website of East Java Province Government (http://infoCOVID19.jatimprov.go.id/) which contains daily cases COVID-19 by 38 district/city in East Java. The predictor variables are population density ( 1 ) and average temperature( 2 ).
The population density data were collected from https://jatim.bps.go.id/ and average temperature from website https://www.accuweather.com/ respectively.

Poisson Regression. Poisson regression is a model with the response variable following
Poisson distribution which is one of the Generalized Linear Model (GLM) because it is included in exponential family of distributions [13]& [19]. In the GLM, there is a linear function which is linking the expectation value of response variable with predictor variables as namely link function.
The link function is log, so it can be written as: The Poisson regression model stated by Myers [20] as: where as an error parameter for the -th observation and is the expectation value of .
Overdispersion in Poisson model occurs when the response variance is greater than the mean [13]. It is caused by positive correlation between responses or by an excess variation between response probabilities or counts. Overdispersion may cause standard errors of the estimates to be deflated or underestimated, i.e. a variable may appear to be significant predictor when it is in fact not significant. Furthermore, overdispersion can be recognized by dividing Pearson Chi-square statistic by the degrees of freedom [14]. The division result states the dispersion parameter (θ). If θ > 1 means overdispersion, θ < 1 means underdispersion, while θ = 1 indicates the condition of equidispersion.
Negative Binomial regression model is also belongs to the GLM so the predictor variable can be written trough a linear combination with the natural logarithm link function as eq.(1), so it also can be written as eq. (2) [14].

Spatial Effects.
Spatial dependency is the extent to which the value of an attribute in one location depends on the value of the attribute in nearby locations [17]. One of the spatial autocorrelation measure is Moran's Index that describe the similarity between observations that is close to each other.
Spatial heterogeneity as a condition in which a global regression model cannot explain the relationship between variables due to the characteristics of observational areas that spatially vary [17]. Detecting of spatial heterogeneity in the data can be done by the Breusch-Pagan test [15].

Spatial Weighting.
Bandwidth is a discrete number identifying the number of neighbours to include in the local regression [21]. The optimum bandwidth selection is one of the important things because it will affect the accuracy of the estimation. If the bandwidth becomes larger, the closer will be the model solution to that of OLS. Conversely, as the bandwidth becomes smaller, the parameter estimates will increasingly depend on observations in close proximity and hence will have increased variance [22]. The optimal bandwidth is the value that minimize the cross validation (CV) score [21] with the formula: wherê≠ ( ) is the fitted value of with the observations for point omitted from the calibration process [17].
Each observation has a weight of unity in the global regression analysis that is = 1 ∀ , [17]. Whereas, in Geographically Weighted Regression (GWR) model an observation is weighted 6 RINDA FITRIANI, I. GEDE NYOMAN MINDRA JAYA in accordance with its proximity to point so that the weighting of an observation is no longer constant in the calibration but varies with [22]. The weighting process in the parameter estimation of GWR model follows Tobler's first law of Geography that the closer data to the location will have stronger influence in predicting the parameter on location compared to the farther data. Furthermore, to estimate parameter of GWNBR model in a point ( , ) , the weighting function used in this study is Adaptive Bi-square Kernel function that can be written as follows: where = √ ( − ) 2 + ( − ) 2 is euclidean distance between location ( , ) and location ( , ), while ( ) is optimum bandwidth value in each location.

Geographically Weighted Negative Binomial Regression (GWNBR). The GWNBR model
is a method that modelling of non-stationary count spatial data with overdispersion [23]. This model will provide different local parameter estimates for each location. GWNBR is extension of the global or non-spatial model, which allows the spatial variation of the parameters and .
This local model is described as the following [23]: where ( , ) are the locations (coordinates) of the data points , for = 1, … , and is an offset variable.
The parameter estimation of GWNBR model is performed interactively with the combination of the Iteratively Reweighted Least Squares (IRLS) and Newton-Raphson (NR) methods [23]. The local log-likelihood of GWNBR at location can be approximated as observation at regression point , which depends on the distance between them. A kernel function can be used to determine these weights that has described before.
The simultaneous test can be done by used the Maximum Likelihood Ratio Test (MLRT). If test statistics is greater than ( , ) 2 means that there is at least one of GWNBR parameter that has a significant effect on the response variable. The partial test was done by a -test, if test statistics is greater than 2 , which means that the parameter has significant effect on the response variable.

Descriptive Analysis.
On 30 June 2020, there were 11,910 confirmed COVID-19 incidences across 38 districts in East Java, which contributes 21.4 percent on the national incidences. The highest incidence was detected in Surabaya district with 5,815 incidences and followed by Sidoarjo with 1,579 incidences. The spatial distribution of COVID-19 incidences in East Java is presented in Fig.1.    Based on Figure 2 and 3, it can be seen that the score of explanatory variables from districts/cities that are close to each other has similar values which are shown by the color similarity on the graph. This indicates that the variables in a district/city are not mutually independent, there is spatial dependence.

Identification of Multicollinearity.
Before modelling, we must check multicollinearity as the regression modelling assumption which assume that between explanatory variables must be mutually independent. The result shows that VIF value of variables population density ( 1 ) and average temperature( 2 ) is 1.0107 which is less than 10, therefore it can be concluded that there is no multicollinearity among predictor variables. Table 2    Whereas, the p-value of Breusch-Pagan is 0.000078 which it can be concluded that there is spatial heterogeneity at the significance level of 0.05. The existence of spatial dependencies and heterogeneity shows that modelling of confirmed COVID-19 cases in East Java Province cannot use global Negative Binomial regression, because it will cause parameter estimation becomes inefficient. Thus local modelling is needed to take into account to be able to explain the spatial effects on the data.

GWNBR Modeling.
Spatial modelling assume that each observation location relates or influences other observation locations. This indicates that the closer range proximity of one location to another location will affect the strength of the relationship so that a weighting is needed.
Before determining the spatial weights matrix, first we need to determine how much bandwidth is appropriate, in order not to produce oversmooth or undersmooth models.
The optimum bandwidth is obtained with cross validation technique (CV). In this study the minimum CV score is 44,936,550 with optimum bandwidth score in every locations. The optimum bandwidth score indicates that the observation location which is in the radius of its score degrees from the observation location still has an influence on confirmed COVID-19 cases at the observation location. Furthermore, the Euclidean distance and bandwidth generated were used to calculate spatial weights matrix by using the Adaptive Bi-Square kernel function.
After the spatial weights matrix was obtained, the next step was modeling using GWNBR. To obtain a convergent parameter, iteration was done 20 times for each region. The GWNBR model generated 152 coefficients estimates for 38 districts/cities in East Java Province. The coefficient determined the magnitude of the changes in the response variables for each change of the predictor variables.
Based on simultaneous testing at the significance level of 0.05, indicates that 43.52055 of Deviance value is greater than (0.05;2) 2 as 5.991465 which means that there is at least one of GWNBR parameter was significant in the model. Table 4 shows the minimum, median, and maximum score of the GWNBR model parameters estimation from each observation location. It can be seen that the variable population density have a positive with the response variable in all observation locations, it is seen that the minimum and maximum score of parameter estimation from this variable is positive. On the other side variable temperature has negative relation in one area and positive in other locations on response variable. Theta represents dispersion parameters.    It could be seen from Table 6 that the GWNBR model had the smallest AIC value. Therefore, it could be conclude that GWNBR method was better for modeling the confirmed COVID-19 cases in East Java Province.

CONCLUSION
In this study, the Poisson regression for modeling confirmed COVID-19 cases in East Java has overdispersion problem so the Negative Binomial regression is appropriate approach. The global Negative Binomial regression model shows that there is one significant predictor variables namely population density variable. Analysis of the GWNBR model has the smallest AIC value and provides different result with global regression, so that the GWNBR model is suitable to be used. In general, the variable that influence the model is population density, while the average temperature affect spatially on confirmed COVID-19 cases only in several regions in East Java province.