Clustering Provinces in Indonesia Based on Daily Covid-19 Cases

The latest form of coronavirus is Coronavirus Disease 2019 (Covid-19). It was discovered in Wuhan, Hubei Province, China. The Covid-19 virus is growing very rapidly. Starting from Wuhan which then spread to the surrounding provinces to all provinces in China, and even spread to other countries such as Indonesia. Until August 17, 2020, an update from the KEMENKES RI that the development of Covid-19 in 34 Indonesian Provinces was positive ( 141,370), recovered (94,458), and died (6,207). This means that the case fatality rate in Indonesia is 4.39%. As a disease that is currently becoming a pandemic has a major impact in all sectors, it is very important to explore data and carry out clustering that is useful for policy making for the government such as decision making on WFH, PSBB, New Normal, or lockdown. The clustering provinces based on those similarities (homogeneity) data to find out which provinces have Covid-19 cases with similar characteristics and between clusters have different characteristics. Clusterization is expected to inform areas of high, medium, and low risk and their characteristic characteristics. Based on Covid-19 data, we used the hierarchical clustering method shows there are 4 clusters: cluster 1 (Jakarta and East Java), cluster 2 (Central Java, West Java, and South Sulawesi), cluster 3 (South Kalimantan, North Sumatra, South Sumatra, Papua, Bali, and East Kalimantan), and the provinces others in cluster 4. Validation of clustering shows Dynamic Time Warping (DTW) distance for hierarchical clustering (average linkage) is a good classification with an average silhouette value of 0.70.


Introduction
A new category of coronavirus is Coronavirus disease: Severe Acute Respiratory Syndrome Coronavirus 2 or abbreviated SARS-CoV-2 [1]. It was discovered in Wuhan, Hubei Province, China as mysterious pneumonia in Dec-2019. So, it called Coronavirus disease 2019 or COVID-19 This spread rapidly from Wuhan to other regions and has now spread to almost all countries and territories. The World Health Organization (WHO) declared the COVID-19 outbreak a pandemic on March 11,2020 [2]. On March 2, 2020, the first confirmed positive Covid-19 case appeared in Indonesia (DKI Jakarta Province). Similar to other countries, the case of covid-19 has spread rapidly to all provinces in Indonesia [3]. Until August 17, 2020, an update from the Indonesian health ministry stated that the development of Covid-19 in 34 Indonesian Provinces was positive: 141,370, recovered: 94,458, and died 6,207. This means that the case fatality rate (CFR) in Indonesia is 4.39%.
Pandemic Covid-19 has a major impact in all sectors so it is very important to descriptive data and clustering data which is useful for policy making for the central government such as making decisions on Work From Home (WFH), New Normal, or lockdown [4]. One of the difficulties experienced by the government in handling COVID-19 is the level of emergency and policies implemented by local governments. The central government knew each region has different characteristics of Covid-19 cases so that knowledge is needed about the similarities of regional characteristics in handling it. Therefore, the aim of this study is to clustering data using time series data active cases of COVID-19 in Indonesia. Clustering of time series has been shown to be efficient in providing useful information and, unlike static data, a feature's time series contain values that have changed over time [5]. So, we use this method for analyzing clustering provinces based on daily Covid-19 data.
Time series clustering has been carried. In 2018, development of time series models with cluster analysis for the broad proportion attack of main plant-disturbing organisms of food crops in indonesia [6]. In 2019, the implementation of clustering time series for province in indonesia based on the rice production [7] and in 2020, development of rice price modeling in western indonesia with the time series clustering approach [8].

Time Series Clustering
Clustering is the unsupervised method of grouping data patterns into clusters so that cluster patterns are closely similar to each other, but somewhat different from other cluster patterns [9]. Time series clustering is often necessary for the solution of real problems originating from a different domain to provide useful knowledge [10]. Clustering is appropriate in the absence of labeled data, irrespective of whether the data are nominal, ordinal, interval, rasio, textual, spatial, temporal, spatio-temporal, image, multimedia, or mixtures of the data types listed above.
Clustering time series analysis is used by grouping objects based on their time series patterns [5]. If all their function values do not change over time, or change negligibly, data is considered static. One of clustering methods developed for handing various static data is hierarchical methods. A technique of hierarchical clustering operates by grouping data items into a cluster tree. Time series clustering, like static data clustering, involves a clustering algorithm or process to shape clusters given a collection of unlabeled data objects, but the method of distance selection and clustering is in accordance with the time series data structure which is very dynamic in nature [5]. Some algorithms or procedures for general purpose clustering that were used in clustering studies of the time series.

Distance Measurement
To find the similarity or dissimilarity between two time series, the distance measure was used. There are two distance measurements available: the Euclidean distance metric and the distance measurement with Dynamic Time Warping (DTW) [11]. We used DTW because DTW is a design measure of similarity. DTW breaks the one-to-one alignment constraint, and also accepts time series that are not equal in length. The DTW distance is the minimum distance where the cumulative distance of each element in the matrix is the minimum of the three surrounding neighbors [8]. The goal of the DTW distance is to find a mapping r between the series in order to minimize a particular distance measure between the coupled observations ( , ). Let M be the set of all decision sequences of m pairs that preserve the order of observations in the form of with , ∈ {1, . . . , } such that 1 = 1 = 1, = = and +1 = or + 1 and or +1 = or + 1, for ∈ {1, . . , − 1}. The definition of the distance from DTW is given by [12] ( , ) = min

Clustering tools: A Hierarchical clustering
Hierarchical clustering approaches arrange knowledge on the basis of acceptable proximity measures in the hierarchical structure. That is, similarity indices and dissimilarity measurements for distance  [13]. There are two types of clustering: agglomerative and divisive approaches. The method of agglomerative hierarchical clustering is more common than the method of division [5]. With 1-clusters, agglomerative clustering begins. In these strategies, each of which requires exactly one data point, a sequence of merge operations are followed next, eventually forcing all objects into the same class. [13]. The following procedure can summarize the general agglomerative clustering [14]: 1. Begin with the singleton clusters of I. 2. Look for the minimum distance in the distance matrix and update the cluster. 3. Update the distance matrix by calculating the distances between Step two and the other clusters of the clusters; 4. Repeat measures 2 and 3 until there is just one cluster.
Several strategies for describing distance functions: Single linkage (SL), complete linkage (CL), and average linkage (AL). In calculating their inter-cluster distance, these methods utilize all points of a pair of clusters, and they are also called graph methods [13].

Evaluation of Technical Dissimilarity Measures
One of the clustering technique dissimilarity measures is the cophenetic correlation coefficient [15]. The coefficient of cophenetic correlation is the coefficient of correlation between the original dissimilarity matrix elements (Euclidean distance) and the elements generated by the dendrogram (Cophenetic matrix based on distance measures and connectedness methods used). The cophenetic correlation coefficient formula is: Where: ℎ = Cophenetic correlation coefficient = Euclidean distance of -th and -th objects ̅ = Average = Cophenetic distance of -th and -th object ̅ = Average ̅ .
The DTW distance measure with the linkage method that produces the largest cophenetic value is the best solution for the hierarchical clustering method [7].
Let the average or squared euclidean distance of i-th unit to all the other units belonging to cluster be denoted by . Also, Let this unit's average distance to all the units belonging to some other q ( ≠ ) cluster be labelled . At last, let be a minimum of d measured over = 1, . . . , , ≠ , reflecting the -th unit's dissimilarity to its nearest neighbouring cluster. Then, the -th object silhouette is described as follows: Where the equation is a concept of normalization. Obviously, the greater the value, the better the classification to the c-th cluster of the -th unit. The silhouette identified as the mean over i= 1,...,I is [13]:

Methods
The data used in this study is the total of daily positive cases of Covid-19 in Indonesia. This data from kawal information about COVID-19 (https://kawalcovid19.id/). The total of daily positive cases of Covid-19 data is time-series data. This study used data from March 18, 2020 to September 17, 2020.
The initial step in this study is to describe daily positive cases of Covid-19 in 34 provinces. The next step is cluster analysis for time series data which there are some stages: Distance measure used Dynamic Time Warping (DTW), Clustering used Hierarchical methods (Single linkage (SL), complete linkage (CL), and average linkage (AL)), Evaluation of Dissimilarity Measures used the cophenetic correlation coefficient, and then cluster validity criteria used Silhouette criterion. The results obtained at this stage is the clustering of 34 Provinces in Indonesia based on the total of daily positive cases of Covid-19.

Descriptive Statistic
The number of daily Covid-19 cases in Indonesia (March 18, 2020-September 17, 2020) reached 232.628. These cases spread in 34 Provinces. Plot for number of daily Covid-19 can be seen in Figure  1: This plot shows of daily Covid-19 in Indonesia always increase every day. As the capital city of Indonesia, the number of covid-19 case in Jakarta is very different and highest from the others. The descriptive statistics of covid-19 cases in Indonesia:

Clustering
The initial step in the time series clustering is to calculate the distance measurement. This distance is used DTW distance. The results of the calculation of the DTW distance of the covid-19 cases in 34 Provinces are Aceh-Bali (89.575), Banten-Aceh (8295) and etc. The minimum DTW distance is Sulteng-Babel (555) and the maximum DTW distance Sulteng-Jakarta (2758122). So the first steps clustering is make Sulteng and Babel as one cluster.
The next step is clustering data used Hierarchical (single linkage (SL), complete linkage (CL) and average linkage (AL)). Based on calculations using R, we has the dendogram:  Figure 2, it can be seen that the cluster of Covid-19 cases is 4 cluster. The next step is calculate cophenetic value for evaluation of technical dissimilarity measures. The results of the calculation of the cophenetic value of the three linkage methods are as follows:  Table 2. These results indicate that average linkage is the best methods because cophenetic value is largest (0.69) Then we use alternative cluster with no weight for the dynamic behavior (raw DTW) as comparing. We has the dendogram are follows:   Figure 3, it can be seen that the cluster of daily Covid-19 cases is 4 cluster. The cophenetic value for evaluation of technical dissimilarity measure are as follows:  Table 3 shows the results of the cophenetic value which indicate that average linkage is the best methods because cophenetic value is largest (0.95).
One of the methods to estimate the cluster quality is by looking at the silhouette value.   Based on the number of positive covid-19 provinces in Indonesia, the grouping of provinces indicates that high mobility cities such as Jakarta and East Java are in one cluster. This cluster can also be said to be the cluster with the highest covid-19 risk, and the fourth cluster is provinces where Covid-19 risk is quite low because these provinces in this group are low mobility dan small provinces.

Conclusion
A time-series clustering approach for clustering provinces in indonesia based on daily covid-19 cases result that there were 4 clusters: cluster 1 (Jakarta and East Java), cluster 2 (Central Java, West Java, and South Sulawesi), cluster 3 (South Kalimantan, North Sumatra, South Sumatra, Papua, Bali, and East Kalimantan), and the provinces others in cluster 4. Validation of clustering shows Dynamic Time Warping (DTW) distance for hierarchical clustering (average linkage) is a good classification with an average silhouette value of 0.70.