BAYESIAN SPATIAL HIERARCHICAL MIXTURE MODELS FOR EXCESS ZEROS DATA: REVIEW AND APPLICATION TO FEMALE LYMPHATIC FILARIASIS CASES

: Many cases of epidemiological data reported has an excessive value of zero. The number of excess zeros can be more than half, even up to 80% of all existing data. This can occur in cases of rare diseases that do not cause significant symptoms at the start of infection. Therefore, the number and the rise of reported cases becomes difficult to detect. This paper proposes several Bayesian methods which are mixture of several distributions, namely binomial, Poisson and zero-inflated Poisson, and discuss the extension of these models to spatial data with excess zeros. Spatial data is implemented with Bayesian hierarchical framework, using Besag-York-Mollié re-parameterization (BYM2) model for spatial random effects, and penalized complexity prior for latent level process in the mixture models of different types of zeros. Bayesian inference uses INLA (Integrated Nested Laplace Approximation) for more accurate and faster results for spatially based hierarchical data. We further review recent implementation of proposed Bayesian mixture models using female lymphatic filariasis cases in 2019 at 27 district city level in West Java, Indonesia, and its elevation as explanatory variable. Mixture models were compared using DIC, and the results obtained indicate that


INTRODUCTION
Many observed events contain a large number of zero values (excess zeros).This incident does not only occur in the field of epidemiology, but also occur in other fields such as environmental studies.Observing the number of rain events in several areas in a particular season can produce excess zeros, which allows to occur throughout the season or even the year (e.g.see [1], [2]).More cases of excess zeros are found in discrete/count data in epidemiology, including cases of rare disease.Rare diseases often do not cause specific symptoms so the number of reported cases is often undetectable and ends up with the number of cases being zero or unreported.A minimal number of cases, even zero, can reduce serious attention to rare but contagious, this is a serious event that must be avoided.Having large proportion of zeros is inevitable and cannot be omitted from the analysis.Because zeros in the data structure have an implied meaning which also contains important information.Therefore, it is very necessary to have a model that can properly handle the case of excess zeros.This is because standard single distribution models such as the binomial, Poisson, negative binomial and even zero-inflated Poisson (ZIP) cannot fit proficiently for data with excess zeros problem [3].Spatial or spatio-temporal count data are often based on single Poisson distribution.However, the problem that is often encountered in Poisson distribution modeling is overdispersion which can result in invalid model conclusions.One way to take care overdispersion problem is with a ZIP model.ZIP can be used to overcome overdispersion which comes from excess zeros data.ZIP is a model that combines zero values with a certain probability and non-zero events with a Poisson probability distribution [4].Hence, Feng [5] compared and highlighted differences of zero-inflated BAYESIAN SPATIAL HIERARCHICAL MIXTURE MODELS and hurdle models for modeling zero-inflated count data and did the simulation process.From the simulation result, hurdle and zero-inflated models perform almost equivalently in the overall model fit when there are no or few zero deflations.Then [6] used zero-inflated models such as Poisson, negative binomial, ZIP, zero-inflated negative binomial and hurdle regression to analyse number of antennal care service visits during pregnancy months in Ethiopia.The model's results hurdle model was better fitted excess zeros data than any other ZIP models.
Spatial and spatio-temporal modeling for zero-inflated data makes modeling increasingly complex [7] [8].This is because the number of observations in each regional unit is increasing and also involves spatial (and/or temporal) influences between regions.Thus, in many recent publications on spatial and spatio-temporal modeling, models for excess zeros are carried out using a single distribution method, such as [9], [10], [11].Single distribution methods such as zero inflated models are widely used and are known considered to fit well.However, on the other hand, if the response comes from two different distributions (occurrence and number of occurrence), and each can be modelled with the same factors (namely spatial influences), then the more appropriate model to use is a mixture model [4].Mixture models are one way to make the model fit better, control various sources of uncertainty from each process, and certainly to obtain more accurate results.
Mixture models, which are a combination of several distributions, can fit excellently on data with excess zeros rather than single distribution methods [12].Recently, there are only a few publications that use mixture models in Bayesian spatial modeling for excess zeros data.Several mixture model publications are Lyme disease mapping in the Eastern United States [3] and male breast cancer in Ethiopia [4].Lyme disease spatial and spatio-temporal data were implemented using INLA inferences with covariance Matérn model to describe spatial structure in each spatial coordinate.Mixture of two distributions produce logistic regression for probability of occurrence and log-linear regression for the number of occurrences that fit excess zeros proficiently.Bayesian spatial joint model also implemented to investigate rare event disease mapping such as male breast cancer in Iran [4].Male breast cancer data are mapped using INLA inferences of mixture binomial and zero-inflated Poisson models with BYM2 in order to define the spatial random effects in the level process.In Bayesian spatial mixture model, response from each distribution is arranged hierarchically.However, inference from the posterior distribution using the classical MCMC method can cause problems [13], especially convergence and time issues.Therefore, posterior inference from the complex spatial model is performed using INLA approach [14] [15].In modeling with excess zero data, the binomial-Poisson mixture model is rarely used compared to ZIP model.However, in fact, the combination of these distributions is the basis of distribution for occurrence and number of occurrences, so its use can really be considered to handle cases of excess zeros data.This paper proposes mixture models and assumes responses of zero-inflated data come from two distributions: Binomial-Poisson and Binomial-ZIP with different types of zero.Each regression model for response corresponds to its link function (logit for binomial and log-linear for Poisson) which includes BYM2 as spatial random effects [16].Parameters in mixture distribution are modelled using penalized complexity priors [17] and non-informative prior distribution [13].As an application, excess zeros data are derived from female lymphatic filariasis cases in West Java and compare mixture models according to the deviance information criteria (DIC).The remainder of this paper is structured as follows: Section 2 provides a review of mixture models for data with excess zeros in Bayesian spatial framework.Section 3 discusses mixture model implementation in spatial count data at district city level of female lymphatic filariasis cases.
The implementation divided in two cases according with and without elevation as explanatory variable.Finally, section 4 provides some conclusions, remarks and possible model developments for future publications.

HIERARCHICAL SPATIAL BAYESIAN MIXTURE MODELS
This section proposes and reviews some mixture models that can be implemented with excess zero response.We propose Binomial-Poisson, Binomial-ZIP mixtures with different types of zero.
In zero-inflated data modelling, there are two types of zeros that can occur, i.e. the structural/true BAYESIAN SPATIAL HIERARCHICAL MIXTURE MODELS zeros and sampling zeros.Structural zeros are conditions where true zeros/absent from an event actually occurs, while sampling zeros come from number of occurrences in area is reported zero based on change or mistake [4].Although these three models' approaches are similar, there is a fundamental difference between three models which lies in the probability form of Poisson distribution.

Mixture Models
Under Binomial-Poisson model, it is assumed to be two-stage process that generate zero and non-zero data.Binomial-Poisson mixture model for the set of  independent and identically distributed observations of region ,   ,  = 1,2,3 … ,  can be described as the mixture of a point mass at zero with Binomial distribution with  = 1 and probability of successive zero is , and a Poisson distribution for number of occurrences without concerning any type of zero: If there is only structural zero may occur in data, then mixture model used is Binomial-ZIP type0 which assume to follow probability as: where   denotes response of region  with  =     as the mean of truncated Poisson distribution.If there are assumed that two types of zero may occur through processes, either by structural or sampling zero, then mixture model used is Binomial-ZIP type1 which assumed to follow probability as: This probability function is a mixture of proportion of structural zeros (  ) and sampling zeros The interest now is in modeling the latent fields   and   using canonical link function.This paper uses a Bayesian framework in spatial modeling for excess zero data, so there are three hierarchies that show the levels of modeling.The three levels modelling process can be written as: with ▪ Parameters Level At data level we divided in two data/responses i.e. probability of successive event which defined as Binomial probability distribution and number of occurrences as Poisson distribution.
is expected case in specific region  to the whole population, and   is relative risk of region  to its standard population [16].In process level probability of successive event modelled using  [16], [17].More explanation about BYM2 and penalized complexity (PC) prior we refer to [18] and [17].
and the number of occurrences   as Therefore, matrix response  in mixture model can be written as However, literature that specifically studies the influence of demographics and geographic conditions in the context of statistical spatial Bayesian mixture modeling for excess zeros data is very rare.Therefore, in this paper the modeling will be expanded by including elevation as an explanatory variable, so equations ( 7) and ( 8) at the process level can be written as logit(  ) =   +  ,elev  elev +   (14) log(  ) =   +   +  ,elev  elev + log(  ) which  ,elev and  ,elev state the influence of elevation in each region respectively.

INLA (Integrated nested Laplace approximation) makes it possible to perform approximate
Bayesian inference on Gaussian latent models which are part of generalized linear mixed spatial models.Specifically, the model at the data level has the following form: | ~ ((), () − ),

𝚯 ~ π(𝚯)
where  are the observed data,  represents a Gaussian field, and  are parameters and hyperparameters.() and () each represent the mean and the precision (inverse of covariance) matrix.In many situations, observation in region ,   are assumed to belong an exponential family with mean   =  −1 (  ).  is the canonical link function of response in exponential family.Linear predictor   can be written in additive form to accounts effects as: Here  is the intercept, {  }'s quantify the linear (fixed) effects of covariates {  } on the response, and { () (. )}'s are a set of random effects defined in terms of some covariates {  }.
Analytical approximation and numerical integration are combined to obtained approximated mean posterior distribution of the parameters.Let  = (  ,   ,  ,elev ,  ,elev , )~(, 1/√) denote the vector of latent Gaussian and  * = (  , ) denote the vector parameter of the random components uses penalized complexity prior as stated in (10).For a more in-depth explanation on INLA inference we recommend referring to [16].We implemented the mixture model with R-

CASE STUDY: FEMALE LYMPHATIC FILARIASIS CASES IN WEST JAVA
Lymphatic filariasis still exists in Indonesia, especially in specific provinces such as Papua, East Nusa Tenggara, and West Java.According to data from the Ministry of Health of the Republic of Indonesia, almost 13,000 cases of lymphatic filariasis have been recorded [19].Lymphatic filariasis is a type of dangerous infectious disease and can cause physical disabilities for the sufferers.Lymphatic filariasis is caused by Filaria Worms which are transmitted by various types of mosquitoes.In Indonesia, it is currently known that there are 23 species of mosquitoes from the genera Anopheles, Culex, Mansonia, Aedes and Armigeres which can act as vectors for transmitting Filariasis.This disease is chronic and if sufferer do not receive prompt and optimal treatment it can cause permanent disabilities in form of enlargement of the legs, arms and genitals in both women and men [20].
Here we consider modeling female lymphatic filariasis disease counts in West Java, Indonesia.
Data was taken from West Java provincial government website [21], in 2019 at district city observation unit level in 27 regions.Many cases of zero sufferers have been reported (with 67% zeros proportion), but there are several regions that show a very high number of cases compared to other regions.Some regions do not report cases, so in this case they are assumed to have zero (sampling) cases.We fitted three mixture models and reported mean posterior probability for fixed and random effects, standard deviation and DIC in Table  In models with elevation, three mixture models do not have significant intercepts for either logit or log-linear regressions.Likewise with elevation, three models state that there is not strong enough evidence to state that elevation has significant influence.From a geographical background, West Java has very diverse regional altitudes.Regions with high number of extreme cases stand at very different altitudes, such as Tasikmalaya (15 cases) at 411.66 meters above sea level (masl), Depok city (11 cases) at 87.8 masl, while Cimahi city (zero cases) at 794.36 masl.This supports evidence that in this case study no significant relationship was found between regional altitudes and number of female lymphatic filariasis cases.The standard deviation,  ̂, for fixed effects has the smallest value in Binomial-ZIP type 0, while other two mixture models have almost the same standard deviation.
The posterior mean for random effects shows that Binomial-ZIP type 0 model has the highest mixing parameter value,  ̂.This shows that influence of spatial structure dependence in this case is around 17.6% and 18.7%, while spatial structure independence is around 82.4% and 81.3% for modeling with and without elevation respectively. ̂ near to 1 shows the spatial pattern of   model in the proposed mixture models provide very good results.This is because BYM2 prior parameters are reparametrized using a scaled precision matrix.Therefore, the interpretation of parameters is very clear, precision parameter indicates marginal deviation from the initial risk (intercept), and mixing parameter expresses the variation between spatial dependence and spatial independence of the data.In addition, BYM2 uses a PC prior which provides the advantage of a simpler model so that the spatial random component scaling technique becomes more efficient.
Because there are two types of zeros in data, namely structural zero and sampling zero, we propose a combined Binomial-ZIP model of type 0 and type 1.In many case studies regarding data with excess zeros, the number of non-zero cases that occur has a value that is not very high or around zero.However, in this case study of lymphatic filariasis, the number of extreme events occurred in several regions, there were even some regions with 11 to Even though Binomial-Poisson mixture model has the best performance in this case study, exploration using comprehensive simulations still needs to be done.This is to further investigate the characteristics of zero-inflated data that are suitable for this model.In this case the proportion of zero data is around 67% with extreme event values in several regions which are very suitable for the Binomial-Poisson model.The mixture model used in this paper can be developed using other zero-inflated models, including the negative binomial distribution [24], or can even use the beta distribution [25].Recently, studies on lymphatic filariasis using a differential equation approach have been carried out as in [26].The development of modeling based on differential equations can also be developed into spatial modeling with regionalization based on stochastic differential equations on point spatial data.Moreover, the development of spatio-temporal and multivariate modeling for zero-inflated data with mixture models is also very interesting to study in future research.

Figure 1 (Figure 1 .
Figure 1.The histogram and the map (b) female filariasis cases in West Java at county level.

ZIP type 0 with 1 √𝜏̂𝑢 = 4 .
occurrence and number for occurrences are similar in mixture model.Both with and without elevation modeling,  ̂ is close to 1 with the most similar random effect values for both regressions is being Binomial-ZIP type 0. Meanwhile, 1/√û shows the marginal deviation from regression intercept (initial risk) , independent in the graph.The highest marginal deviation is in Binomial-866, while the Binomial-Poisson and Binomial-ZIP type 1 models have almost the same small marginal deviation or in other word the models have large precision values.The mean posterior of zero probability for logistic regression are estimated the same ̂= 0.653 for Binomial-ZIP type 0 for both with and without elevation, and 0.066 and 0.062 for Binomial-ZIP type 1 for with and without elevation modeling respectively.Binomial-ZIP type 1 gives lower probability owing to some zeros covered by Poisson distribution.According to DIC, Binomial-Poisson and Binomial-ZIP type 1 performs better compared to Binomial-ZIP type 0. Low DIC values are shown by Binomial-Poisson and Binomial-ZIP type 1 models.To investigate in more detail the estimator values for the two models, Table 2 (without elevation RACHMAWATI, RAHMAN, PUSPONEGORO Disease mapping is based on the smallest DIC values in Binomial-Poisson model.Figure 2(a) presents a mapping of spatial random effects estimator which is the BYM2 component in regression equation, (b) presents estimated relative risk for occurrence probability and (c) estimated relative risk for number of occurrences.Left and right side of Figure 2 state for modeling without and with elevation respectively.In Figure 2(a), spatial random component is very capable of representing the spatial pattern of lymphatic filariasis cases very well.The model can detect region with a high number of extreme cases such as Tasikmalaya, Depok city and Subang which are marked with a strong red color.Likewise, the model very well detects regions with a low number of cases colored in white and green.The red color indicates disease potential is above average, whereas the green color indicates disease potential of region is below average with an average value of 0, and the average is indicated with white color.White region is clearly visible in model with elevation, namely Bandung which is in the central region of West Java.

Figure 2 (
Figure 2(b) shows the same spatial pattern.However, if we see in more detail according to Tables2 and 3, the chance of occurrence probability in model without elevation is Tasikmalaya, namely 0.955, while in model with elevation region with the largest occurrence probability is Depok city with a chance of 0.943.Figure2(c) also shows the same spatial pattern with Tasikmalaya as the region with the largest number of occurrences, but in general model with elevation has a higher number of occurrences in all areas.

Figure 2 .
Figure 2. Binomial-Poisson without elevation (left), with elevation (right) logistics regression and log-linear regression for number of occurrences, with   ,   are the intercept,   is BYM2 used to model spatial random effects.BYM2 model is re-parameterization of BYM model with mixing parameter  ∈ [0,1] and the precision parameter   .BYM2 consist of spatially structured component   and an unstructured component   , while  represents a shared spatial random component between two regressions.In parameters level   ,   are set to have normal prior distribution with large variance,  is set to normal prior with mean 0 and  2 = 0.01.Mixing  and precision parameters are set to penalized complexity prior BAYESIAN SPATIAL HIERARCHICAL MIXTURE MODELSMixture models have two likelihoods, one for successive event for region  with Binomial distribution,   ~Binomial(  ,   = 1) , and one for number of occurrences with (truncated) Poisson distribution,   ~(Trunc)Poisson(    ).Because the linear predictors for each response are different, the response matrix must also be adjusted to combine two responses with larger matrix dimensions.Response matrix can be formed from the combination of two column vectors for each process, namely number of occurrences in region  can be zero or positive number, so the disease occurrence   in region  is defined as

Table 1 .
̂, has significant influence on Binomial-ZIP type 1 mixture model.BAYESIAN SPATIAL HIERARCHICAL MIXTURE MODELS Mean posterior probability, standard deviation and DIC . We divide modeling into two large parts, namely without elevation and with elevation as an explanatory variable.In modeling without elevation, fixed effect in form of intercepts,  ̂ and  ̂ have significant influence (with credibility interval (CI) does not contain zero) on Binomial-Poisson mixture model with negative values.However, on contrary, intercepts have significant influence on Binomial-ZIP type 0, and only intercept in logit regression,

ZIP type 1 Binomial-ZIP type 1 with Elevation Mean (95% CI)
15 cases in 1 year for rare types of disease.For this reason, this paper also proposes a combined Binomial-Poisson distribution model.The Poisson distribution in this mixture model is expected to be able to overcome the phenomenon of regions that have a number of extreme cases.Inference uses INLA to avoid convergence issue of parameters posterior distribution, and selection of the best model uses smallest DIC.Finally, Bayesian spatial mixture model was applied to female lymphatic filariasis data in 27 districts cities in West Java.The three proposed models can map lymphatic filariasis cases very well.Binomial-ZIP type 0 and Binomial-ZIP type 1 models have similar spatial estimation patterns but type 0 has a larger DIC value.This indicates that filariasis data contains not only structural zero but also sampling zero which must be taken into account in the analysis.The Binomial-Poisson model has the smallest DIC value, this indicates that extreme case events can still be handled very well by Binomial-Possion model.For this reason, the Binomial-Poisson distribution to model zero-inflated data with the number of extreme cases in several regions can be considered for use in modeling.Tasikmalaya has the highest spatial risk compared to other counties.The probability of a lymphatic filariasis case occurring in this region is around 95% and the relative risk of lymphatic filariasis cases in Tasikmalaya can reach 27 times compared to its standard population.Furthermore, in this case there was no significant relationship between elevation and the number of female lymphatic filariasis cases in West Java.BAYESIAN SPATIAL HIERARCHICAL MIXTURE MODELS