CLUSTERING OF CHILDHOOD DIARRHEA DISEASES USING GAUSSIAN MIXTURE MODEL

,


INTRODUCTION
Diarrhea remains one of the leading causes of childhood morbidity and mortality worldwide, especially in developing countries.Diarrhea is a digestive disorder characterized by the sufferer passing stools more frequently than usual.Diarrhea is caused by a number of things, for example Escherichia coli bacteria, viruses such as Rotavirus, parasites, food allergies, Crohn's disease, side effects of certain drugs, and others.Diarrhea can be classified based on the frequency and duration of bowel movements and the characteristics of the faeces [1].The classification of diarrhea includes acute diarrhea lasting 14 days, persistent diarrhea lasting > 14 days, and chronic diarrhea lasting > 30 days [2] [18].Serious diarrhea can lead to malnutrition and in the most severe cases, death due to lack of salt and water in the body [19] [20].Indonesia as the fourth most populous country in the world is place to around 22 million children under five.The Health Ministry of the Republic of Indonesia indicates diarrhea as the leading cause of death in children under 5 years of age with a mortality rate of 10.7% in 2019 [3].In addition to causing death, prolonged diarrhea can lead to malnutrition and stunting in children [4].
Diarrhea remains a major problem of public health in Indonesia with 40% of cases occurring in children under five years old.Bandung is one of the major cities with the sixth highest diarrhea cases in West Java.The high incidence of diarrhea in Bandung City is a serious concern in efforts to address public health, especially children under five.In 2022, the incidence of diarrhea in children under five in Bandung City was recorded as many as 6376 cases served.Figure 1 shows that Babakan Ciparay is the sub-district with the most diarrhea cases in Bandung City.
There are various factors that cause diarrheal diseases, one of which is hygiene problems including improper sanitation facilities [17].Germs that cause diarrhea are easily spread from one CLUSTERING OF CHILDHOOD DIARRHEAL DISEASES person to another through contaminated water, food or objects [5].Food hygiene is associated with the development of diarrhea and malnutrition in low socioeconomic children [6].In addition, limited and inadequate sanitation facilities are likely to have a bad level of hygiene and may increase the risk of diarrhea.Various programmes have been carried out to reduce the occurrence of diarrhea, one of which is the provision of clean water and sanitation in area that still difficult access to clean water.The provision of sanitation and clean water the goal 6 of SDGs [31].However, morbidity and mortality rates from diarrhea remain high due to the high prevalence of contributing factors.Therefore, this study aims to cluster sub-districts in Bandung City based on the percentage of diarrhea prevalence, percentage of households with healthy latrine facilities, percentage of households with clean and healthy living behaviours, and population density per hectare as an effort to provide better insight into the distribution of diarrhea cases for the achievement of Sustainable Development Goals (SDGs).

The results of this study can show the importance interventions and efforts of the Bandung
City government in preventing high cases of diarrhea and optimizing the improvement of health levels in children under five.

Data
The data used for this study is secondary data obtained from the data portal and health profile of Bandung City includes several variables, namely the prevalence of diarrhae in children under five, the percentage of households with clean and healthy living behaviour, the percentage of households with healthy latrine facilities, the population density per-hectare, and the many baby less than 6 months old who are exclusively breastfed.This study used 30 observations, namely the number of sub-districts in Bandung City.

Prevalence of diarrhea in children under five
The prevalence of diarrhea in children under five years old is a record of the number of diarrhea cases identified and treated.The data used is the prevalence of diarrhea in children under five.In this study, the prevalence of diarrhea in children under five was measured in each sub-district in Bandung City.

Percentage of households with clean and healthy living behaviour
Households with Clean and Healthy Living Behaviour is all health behaviours which is done out of awareness and taking an active role in community health activities [7].PHBS involves several elements, namely households, schools, workplaces, health facilities, and public places (Health Profile of Bandung City 2022).The data used in the study is the percentage of households identified as implementing clean and healthy living behaviours.

Percentage of households with healthy latrine facilities
Healthy latrines are proper sanitation facilities that are able to prevent themselves from various diseases.The criteria included in the operational definition for healthy latrines are a latrine building that is closed and has a non-slip floor, no odor and no visible dirt, septic tank distance ≥ 10 metres, available cleaning tools, and free from insects [8].The data used in the study is the percentage of households that have healthy latrine facilities.

Population density of a sub-district per-hectare CLUSTERING OF CHILDHOOD DIARRHEAL DISEASES
Population density is the number of people per unit area (ha).Population density is an indicator to measure the population in an area [28].This indicator is used to determine the population density per-hectare of an area.

Many babies less than 6 months who are exclusively breastfed.
Breast milk is the ideal food for baby up to 6 months old in terms of physical and psychological health [9].Exclusive breastfeeding until the baby is 6 months old will influence the optimal development of the child's intelligence potential [10].Exclusive breastfeeding in children before 6 months of age is very important to reduce the risk of developing various diseases and breast milk can accelerate recovery in sick children [29][30].

Data Standarization
Standardization is a technique used to transform data so that it has a mean equal to 0 and a standard deviation equal to 1. Standardization is used in data analysis when the observed variables have different scales or distributions [27].Data values that have been standardized are notated as z, x is the actual data value, µ is the mean of the data, and σ is the standard deviation of the data.

Variance Inflation Factor
Variance Inflation Factor (VIF) is a measure of the severity level of multicollinearity in multiple linear regression models involving more than one variable.Multicollinearity is a measure that refers to the comparison of variance when there is multicollinearity between predictor variables and variance when there is no multicollinearity.The formula for calculating VIF is as follows [11].
in the equation states the determination coefficient of the i th variable.The occurrence of multicollinearity in data based on the VIF value > 10 which indicates that the greater the VIF, the more serious the multicollinearity [12].

Gaussian Mixture Model
Gaussian Mixture Model (GMM) is a statistical model used to represent complex data distributions by combining multiple Gaussian distributions [21] [22].In this model, data is considered to come from several different components of the Gaussian distribution.GMM is used in various fields including clustering [26], dimensionality reduction, data distribution modelling, image restoration, and others.In this research, the use of GMM aims to perform clustering.
GMM assumes that the resulting component of the Gaussian distribution is the number of clusters formed.The combination obtained from the mean and variance will represent each Gaussian.The purpose of clustering using GMM is to determine the model parameters (mean and matrix) that best fit the data [13].The model used to perform clustering with respect to the geometry formed from components of Gaussian with different parameters [14] shown in table 1.The probability density function for a one-dimensional Gaussian distribution is: µ and σ represent the mean and the standard deviation of the distribution.The probability density function for d-dimensional multivariate data is expressed as follows: Where µ denotes the mean of the represented of distribution as a d-dimensional array, ∑ CLUSTERING OF CHILDHOOD DIARRHEAL DISEASES is the covariance matrix of X, T denotes the transpose vector, and -1 denotes the invers of the matrix [15].To maximize the likelihood of data from GMM, the Expectation-Maximization (EM) algorithm can be used.The steps are as follows [13][23]: 1) Initialize the value of µ  ,   , and   randomly for all clusters, where π is the mixture coefficient and k value is a number that indicates the cluster.The linear function of the cluster distribution density is: 2) E-Step is evaluating the log-likelihood results using the parameter µ  ,   , and   .Suppose the cluster   represented by a Gaussian distribution (µ  ,   ), The probability of   in cluster   can be calculated by: Likelihood value: (  ) = ∑ (  |  )(  ) 3) M-Step is changing the value of µ  ,   , and (  ) with the following calculation: 4) Perform steps 2 and 3 again until the convergence criteria are met.Therefore, set a certain threshold value for the change of mean and variance in successive iterations, so that the cluster members can be clustered by the Maximum a Posteriori (MAP) method.
Selection of the best model in analysis using the Gaussian Mixture Model (GMM) method is DEFI YUSTI FAIDAH, ASHILLA MAULA HUDZAIFA based on the general approach, namely the Bayes Information Criterion (BIC) value for the model of the parameters and the number of clusters formed [16][24] [25].
P(y| ̂,   ) : integrated the maximum mixed likelihood for   model.

𝑉 𝑘
: number of independent parameters estimated in the   model.
The best model and number of clusters are determined based on the highest BIC value.

RESULTS
Clustering analysis with sub-districts as observations on the research data, which included the prevalence of diarrhea in children under five, percentage of households with clean and healthy living behaviours, percentage of households with healthy latrine facilities, population density perhectare, and the baby < 6 months old who is exclusively breastfed d, was conducted using R software.The mapping of sub-districts in Bandung City based on the characteristics of each variable can be seen in figure 2, figure 3, figure 4, figure 5, and figure 6.The highest population density at 399 per-hectare is Sub-district of Bojong Kaler.There are 5 sub-districts in the city of Bandung with a high number of babies less than 6 months old who is exclusively breastfed above 400 babies, namely Andir, Coblong, Bandung Kulon, Sukajadi, and Ujung Berung sub-districts.

Multicollinearity Test
Testing the multicollinearity between variables is carried out as an initial stage in determining Based on table 2, not found the value of VIF more than 10.So it can be concluded that there is no multicollinearity in the variables and all variables will be used in cluster analysis.

Identification of BIC Value
The identification of the BIC value is done to determine the best model and the number of   3. City can be grouped into 5 clusters based on predetermined variables.

Clustering
The grouping of sub-district into 5 clusters based on the EEV model shows that cluster 1 consists of 6 sub-district, cluster 2 consists of 9 sub-district, cluster 3 consists of 5 sub-district, cluster 4 consists of 4 sub-district, and cluster 5 consists of 6 sub-district with the characteristics of each cluster can be viewed in table 4. The grouping of 30 sub-districts into 5 clusters can be viewed in table 5, and the cluster mapping can be viewed in figure 8.

Figure 1 .
Figure 1.Many Cases of Diarrhea in Children Under Five in Kota Bandung.

Figure 2 .Figure 3 .
Figure 2. Mapping of Sub-districts based on Prevalence Rate of Diarrhea in Bandung City.

Figure 5 .
Figure 5. Mapping of Sub-districts based on Population Per-hectare in Bandung City.

Figure 6 .
Figure 6.Mapping of Sub-districts based on Number of Baby less than 6 Months Old who is Exclusively Breastfed in Bandung City.
clusters formed.Based on the analysis results, a comparison chart of the BIC values of various models was obtained.The Ellipsoidal, Equal Volume and Shape (EEV) model has the highest value BIC based on figure 7.

Figure 7 .
Figure 7. BIC Value of GMM results.The identification results of the Gaussian Mixture Model EEV based on the Expectation-Maximization (EM algorithm show a model with five components which can be viewed in Table

Figure 9 .
Figure 9. Cluster map of diarrhea in children under five in Kota Bandung.Based on Figure 9, the different colours of the regions indicate the different clusters.Areas with black colour are areas characterised by a high prevalence of diarrhea.

Table 1 .
Covariance matrix and geometric formed of Mclust in the Gaussian Mixture Model.

Table 2 .
CLUSTERING OF CHILDHOOD DIARRHEAL DISEASES variable selection.If there is multicollinearity in the variables used, it is necessary to select variables with related methods, such as Principle Component Analysis (PCA).The results of multicollinearity testing using the Variance Inflation Factor (VIF) can be viewed in table 2. VIF value

Table 3 .
Five component of EEV model

Table 4 .
Means of cluster each other, indicating that the distribution of sub-districts in each cluster is based on the extent to which sub-district characteristics are close to the centre of a particular cluster.CLUSTERING OF CHILDHOOD DIARRHEAL DISEASES

Table 5 .
Clustering and characteristics of each cluster