The ensemble distance on model-based clustering for regions clustering based on rainfall: The case of rainfall in West Java Indonesia

Time series data clusters are being researched thoroughly. The distance metric drives the development of the clustering time series. The ARIMA model is one of the models that can be employed in model-based clustering, although differing model selection criteria can lead to uncertainty in the model. In this investigation


Introduction
Cluster analysis, a technique aimed at grouping similar objects together while keeping dissimilar ones apart (Gan et al., 2007), has emerged as a pivotal tool in the realm of data analysis and pattern recognition.In recent years, the application of cluster analysis has extended its reach to encompass a diverse range of data types.Notably, the utilization of time series data-a category of dynamic data where observations evolve sequentially-has gained prominence in various domains.Time series data clustering have the capacity to offer valuable insights into the temporal evolution of phenomena, and their analysis plays a pivotal role in a multitude of applications, including but not limited to the detection of data relationships, predictive modeling, recommendation systems, and the discovery of intricate data patterns (Aghabozorgi et al., 2015).researchers have applied time-series data clustering techniques across an array of fields.Gullo et al. (2012), for example, used the Dynamic Time Warping distance metric when combined with k-means to cluster the healthcare data.Meanwhile, Caiado et al. (2006) explored time series clustering in the economic domain, specifically in the context of industrial production index data.Corduas and Piccolo (2008) extended the utility of time series clustering to domains as diverse as economics and medicine.
In the pursuit of refining the art of clustering time series data, several researchers have ventured into this multifaceted domain.Studies by Aghabozorgi et al. (2015), Liao (2005), Rani and Sikka (2012), and Ergüner Özkoç (2021) have undertaken comprehensive reviews of the existing body of research on time series data clustering.These endeavors have unveiled the remarkable diversity of approaches and techniques employed in the field.For instance, Javed et al. (2020) undertook a comparative analysis of clustering algorithms, categorizing them into three distinct categories-partition, hierarchy, and density-basedwhile also evaluating the suitability of various distance measures, including the Euclidean, Dynamic Time Warping (DTW), and shape-based measures.The complexity of time series data presents a common challenge that has motivated researchers to seek innovative solutions.Most time series data, due to their unique temporal structure and high dimensionality, defy the straightforward application of conventional clustering algorithms.Consequently, researchers in the field have directed their efforts towards the development of novel measures of similarity and dissimilarity.This endeavor is necessitated by the intrinsic characteristics of time-series data, where temporal dependencies and large dimensions add layers of intricacy (Keogh & Kasetty, 2003;Rani & Sikka, 2012).As a result, the search for effective distance metrics and the creation of customized similarity measures have become cornerstones of time series clustering research.To address the inherent challenges posed by time series data, clustering methods have undergone significant adaptations.Traditional algorithms of clustering have been expanded to fit the time-series format, or data from time-series has been turned into a more amenable structure, allowing the application of normal clustering techniques (Liao, 2005).The crucial role of similarity and dissimilarity metrics is underscored by Keogh and Kasetty (2003), emphasizing that these measures lie at the heart of clustering algorithms, shaping the outcomes and patterns that emerge from the data.Moreover, there have been instances where experts in specific domains have crafted custom distance metrics, such as Biabiany et al. (2020), who devised an expert distance metric for climate clustering.Time series data clustering approaches offer a diverse array of techniques, each specifically designed to address the special features of time-series datasets.Typically, these approaches can be categorized into three primary methods: clustering based on raw data, feature based, and model based of model (Liao 2005).The selection of the methodology depends on the characteristics of the data and the goals of the analysis.Model-based clustering has gained traction due to its ability to accommodate time series with varying observation periods.In the realm of model-based time-series data clustering, one prominent model utilized is the Auto Regressive Integrated Moving Average (ARIMA).The application of ARIMA-based clustering has been employed by several researchers as an effective means to measure similarity between time series data, including Piccolo (1990), Piccolo (2010), Maharaj (2000), Kalpakis et al. (2001), Corduas andPiccolo (2008), andTriacca (2016).
One central challenge in employing ARIMA models for time series clustering is the selection of the most appropriate model from the plethora of possibilities.From a given set of time-series data, it is possible to derive multiple ARIMA models, each potentially suited to the data but distinct in terms of complexity and representation.The crux of the issue lies in the diverse selection criteria employed to identify the optimal model, including well-known measures such as the Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), and various others.The selection of a different model based on varying criteria has the potential to yield diverse results, casting a shadow of uncertainty over which model to choose for the analysis.This quandary holds profound implications for time series clustering.Specifically, it can lead to markedly different cluster results, as the choice of model plays a pivotal role in shaping the clustering outcomes.This issue is at the heart of the problem that the present study seeks to address: devising a robust and consistent approach to model-based clustering for rainfall patterns in West Java, Indonesia, amidst the inherent variability introduced by model selection criteria.
While existing approaches have explored clustering time-series data using various methods, this research charts a novel path with distinct objectives.Notably, Hendrawati et al. (2020) developed a method for clustering time-series data by harnessing the ensemble parameters of the ARIMA model.In contrast, the primary aim of this study is to forge a method for time-series data clustering employing the concept of ensemble distance.The ensemble distance method leverages the strengths of multiple models, each selected through diverse model selection criteria, rather than adhering to a single model choice.This approach offers an innovative solution to circumvent the limitations associated with rigid model selection, which can potentially yield inaccurate or unreliable clustering results.As we delve deeper into this research, we explore the intricate details of the ensemble distance method, its application to rainfall pattern clustering in West Java, Indonesia, and the invaluable insights it offers in unveiling the complex temporal dynamics of this region's meteorological data.

ARIMA Model
Rainfall data, which is time series data, can be modeled using ARIMA.Here is the formulation of the model.The Autoregressive Model, denoted as AR (p), is mathematically represented as follows: The Autoregressive Moving Average model represented as ARMA (p, q), is formulated as follows: (p, q)) The Autoregressive Integrated Moving Average, abbreviated as ARIMA (p, d, q), is expressed as follows: Where: ; p is order of AR, d is differencing, and q is order of MA (Wei 2006;Hendrawati et al. 2020).

Model Selection Criteria
There are several criteria for selecting the best model: Akaike's Information Criterion (AIC); Bayesian Information Criterion (BIC); Akaike's Information Criterion Bias Corected (AICc); Mean Absolute Percentage Error(MAPE); and Root Mean Squared Error(RMSE).
where  denotes the estimator for the variance of error; k denotes the number of parameters;  denotes the number of observations;  denotes the observation's value at time t; and  denotes the fit observation's value at time t (Cryer & Chan, 2008;Montgomery et al., 2008).

Ward Method
The Ward Method aims to join objects into groups where the variance within groups is minimized.The pairings of objects with the least increase in the Error Sum of Squares (ESS) are combined at each phase of Ward methods.

Optimization of clusters number
To ascertain optimal number of clusters, we employ average of silhouette coefficient (SC).The SC represents averaging of distance between a data point and the others within its designated cluster, as well as all data points belonging to the nearest neighboring cluster.
() denotes the distance between i-th object to its own cluster;  () denotes the distance between i-th object to its nearest neighbor cluster (Everitt et al. 2011;Jain & Dubes 1988;Kaufman & Rousseeuw 1990).

Cluster Model
Each cluster is represented by a time series data.In this instance, the prototype-a time series data set-is derived from the cluster's average value of data.
The assessment of the cluster method's accuracy occurs following the formation of clusters.Each cluster created is subsequently compared to the reference or ground-truth data.Clusters that originate from the same underlying data will be consolidated into the same group.In cases where data points from the same underlying group are assigned to different clusters, this indicates an error in the clustering process, referred to as a misclassification error.The accuracy is computed using the following formula: where  represent the number of object and  represent the number of objects that misclassified.

Distances for Time-series Data
In 1990, Piccolo introduced a method for quantifying the similarity between two time-series data using a distance formula, as presented in Eq. ( 13).A collection of time-series data can be represented through various ARIMA models, with the model selection based on specific criteria.Choosing a model, for instance, based on the smallest AIC or BIC.The Piccolo method employs the ARIMA model approach to compute the similarity between two time-series datasets.Based on the invertible ARIMA process,  and  can be expressed by the Autoregressive model (AR (∞)).The distance between ARIMA processes  and  is distance between the coefficients  in AR (∞) using the equation: which  , and  , represent coefficient of AR model for time-series data  and  .Distance ( ,  ) serves as a metric for quantifying the structural similarity between two ARIMA processes (Corduas & Piccolo, 2008;Piccolo, 1990Piccolo, , 2010)).

Proposed Method
Debates exist regarding the most suitable criteria for choosing a model.According to some academics, choosing the best model should not solely rely on specific information criteria but should also consider alternative approaches (Anderson & Burnham, 2002;Brewer et al., 2016;Burnham & Anderson, 2004).On the other hand, Claeskens (2016) argued that if multiple estimators of model parameters are derived from the same population, combining these estimators may yield a superior model parameter estimator.Several researchers advocate for the use of multiple models (multimodels) rather than exclusively relying on a single model, considering it as an alternative to becoming constrained by a potentially incorrect model (Burnham and Anderson 2004).An approach for selecting a model from a collection of models is the averaging method (Claeskens, 2016).
In this research, the chosen methodology is the ensemble distance, specifically the average distance.Applying different model selection criteria, notably AIC, BIC, AICc, RMSE, and MAPE, five different distances are determined.For instance, Model A is selected based on the AIC criterion, while Models B, C, D, and E are determined using the BIC, AICc, RMSE, and MAPE criteria, respectively.The distance associated with each model measures the dissimilarity between two time series, aligning with the corresponding model selection criterion.
Subsequently, to represent these five distances collectively, an average distance is employed, as illustrated in the following equation: where  ,  ,  ,  ,  are distance associate to Eq. ( 13) which based on model selection with criteria of AIC, BIC, AICc, RMSE, and MAPE respectively.

Simulation
The simulation involves the generation of data from three distinct clusters, each of which adheres to an Autoregressive (AR (2)) model.The parameter values for model A, model B and Model C are (0.2 , 0.1), ( 0.4 , 0.5 ), and ( 0.6 , 0.2 ), respectively.Each cluster generates a total of 10 series, resulting in 30 series of generated data.The observation period () for the generated data varies across six different values: 50, 75, 100, 150, 200, and 300 (Kumar dan Patel 2008;Hendrawati et al. 2020).
The generated data were first organized into clusters using the Piccolo distance method (Hendrawati et al. 2020) and the ensemble distance approach.Subsequently, the data were modeled using the Autoregressive (AR ()) method, with  ranging from 1 to 15. Model selection was based on the minimization of the AIC criterion.The distances between time-series were computed using the distance formula, as defined in Equation (1).This process was repeated; the least BIC, AICc, RMSE, and MAPE were the new model selection criteria used in this process (Cryer & Chan, 2008;Montgomery DC, Jennings CL, 2008).The average of the five distances obtained from the previous steps was then calculated using a formula like Eq. ( 2).
Subsequently, clusters were determined using the Ward method (Eszergár-Kiss & Caesar, 2017;Everitt et al. , 2011;Jain & Dubes, 1988;Murtagh & Legendre, 2014;Kaufman & Rousseeuw, 1990).The quality of the clustering results was evaluated by calculating the percentage of correct cluster membership across 100 replications (Hendrawati et al., 2020).The simulation procedure is outlined as follows: 1. Generate a time series dataset consisting of three clusters according to the specified rules.2. Modelling the generation time series data using the AR () model approach, where  = 1, 2, …, 15.The optimal model is selected based on the minimization of Akaike's Information Criterion (AIC).3. Compute the distances between time series using the distance formula provided in Eq. ( 13). 4. Repeat steps two and three, but this time utilize distinct standards for selecting models, including the smallest BIC , AICc , RMSE , and MAPE . 5. Calculate the average of the five distances obtained from the previous steps, using a formula in Equation ( 14). 6. Determine the clustering of the time series data using Ward's method.7. Evaluate the accuracy of the clustering results by employing a formula, often detailed in Equation ( 12). 8. Repeat steps one to seven a total of 100 times.

Results
Table 1 displays the percentage of correct cluster membership using the ensemble distance and Piccolo method.In the Piccolo distance method, various criteria were applied, including AIC, BIC, AICc, RMSE, and MAPE.When the observation period length () was set to 50, the lowest correct cluster membership percentage, at 72.47%, was observed in the Piccolo method when using the RMSE criterion.Conversely, for  = 75, 100, 150, 200, and 300, the Piccolo method with the MAPE criterion consistently exhibited the lowest correct cluster membership percentage among the different criteria.However, as illustrated in Fig. 1, it is evident that the ensemble distance method consistently yielded higher correct cluster membership percentages when compared to the Piccolo method.The ensemble distance method compared to Piccolo with the RMSE criterion for the length of the observation period () = 50, significantly increased the correct cluster membership, resulting in an increase of 15.16%.Furthermore, for  = 75, 100, 150, 200, and 300, the ensemble distance method achieved substantial enhancements in correct cluster membership.Specifically, it raised the correct cluster membership by 12.54%, 12.34%, 11.63%, 11.46%, and 11.63%, respectively, in comparison to the Piccolo method with the MAPE criterion.

Application for Rainfall Data
In this study, secondary data was employed, specifically monthly rainfall data (in millimeters), sourced from 26 rainfall monitoring stations located in West Java.The data spanned the years 2000 to 2009 and were acquired from the meteorology, climatology, and geophysics agency (BMKG) in Indonesia.
The rainfall data was processed using the R programming and clustered using the ensemble distance method.Initially, the rainfall data was segregated into two categories: training data and testing data.The testing data was employed for model evaluation, whereas the training data was utilized for clustering and modeling purposes.Specifically, there were 24 data points designated for testing and 96 data points allocated for training.The dendrogram that illustrates the clustering process is shown in Fig. 1.The silhouette index approach used to find the optimal number of clusters (Kaufman & Rousseeuw, 1990), was found to be three.The outcomes of the clusters and their respective members are detailed in Table 3.This table illustrates the formation of three distinct clusters, denoted as Cluster A, B, and C. Cluster A consists of 16 member stations, Cluster B comprises five members, and Cluster C includes five members as well.Each cluster is characterized by a prototype, represented by its average value.A visual representation of the regional clustering can be observed in Fig. 2.  Fig. 3 presents the prototype plot along with the members of each cluster.Within each cluster, there is a noticeable similarity in the pattern of rainfall.Specifically: • In Cluster A, frequent rainfall events are observed, characterized by very high intensity (> 500 mm), high intensity (300-500 mm), medium intensity (100-300 mm), and low intensity (0-100 mm).This cluster experiences a wide range of rainfall intensities.• Cluster B exhibits regular rainfall patterns with occurrences of high, medium, and low-intensity rainfall.However, rainfall events with very high intensity are relatively rare in this cluster.• Cluster C often experiences rainfall events with low, medium, and high intensity, but very high-intensity rainfall is infrequent in this cluster.
These findings suggest that the clusters are defined by their distinctive rainfall patterns, with varying levels of intensity and frequency.The cluster models are represented by their respective prototype ARIMA models, as outlined in Table 4.The ARIMA model for cluster A ( 0, 0, 0 ) ( 1, 1, 0 ) ; cluster B has an ARIMA model ( 1, 0, 2 ) ( 1, 1, 0 ) ; and cluster C has an ARIMA model ( 1, 0, 0 ) ( 1, 1, 0 ).These ARIMA models provide a statistical framework for understanding and forecasting the rainfall patterns within each cluster.
By contrasting the RMSE values for models with and without clustering, the clustering results were evaluated.To conduct this comparison, a mean difference test was employed at a significance level () of 0.05.The RMSE values were derived from both the model predictions using the training data and the forecasting RMSE calculated from the model predictions using the testing data.The results of this analysis indicate that the p-value for the predicted RMSE is 0.525, and the p-value for the forecasting RMSE is 0.464.All these p-values exceed the significant level of 0.05.Therefore, there is insufficient evidence to reject the null hypothesis (H0).Stated differently, the model with and without clustering do not differ statistically significantly.This implies that individual models inside a cluster can be accurately represented by the cluster model.

Discussion
In this paper, to determine which ARIMA model is the best, several model selection criteria are applied.A parameter ensemble method was created by Hendrawati et al., (2020), and in comparison to the Piccolo (Piccolo, 1990;Piccolo, 2010), this method was able to increase the percentage of clustering accuracy by more than 10%.In this research develops the Piccolo method by using the average distance.This method is called the ensemble distance method.Based on the simulation results shown in Table 1, it is found that the ensemble distance method is better than the Piccolo method.When compared to the Piccolo method, the ensemble distance method is able to increase the clustering accuracy percentage by over 11%.In other words, this method is better than the Piccolo method (Piccolo, 1990;Piccolo, 2010) and the ensemble parameter method (Hendrawati et al., 2020).
This research uses the ARIMA model with model selection criteria AIC, BIC, AICc, RMSE, and MAPE.There are many criteria that can be used to determine the goodness of a model.As a suggestion, the next research needs to try with various models and the latest model selection criteria.

Conclusion
This study focuses on the development of a clustering method for time-series data using the ensemble distance approach.The simulation results demonstrate the superiority of the ensemble distance method, which utilizes the average distance of five models, compared to the Piccolo method that relies on a single model.The percentage of clustering accuracy using the ensemble distance method increases as the observation period () is extended.
The simulations reveal that the ensemble distance method can improve the percentage of clustering accuracy by more than 11%.In the practical application of monthly rainfall data for the West Java region, it was determined that the optimal number of clusters is three.These clusters exhibit similar rainfall patterns, and cluster models are effective in representing individual models within their respective clusters.

Fig. 1 .
Fig. 1.The dendrogram's cluster of precipitation data based on ensemble distance
Prototype plots and clusters of rainfall data in West Java using the ensemble distance method: cluster A (a), cluster B (b), and cluster C (c)

Table 1
Percentage of correct cluster membership using the ensemble distance and Piccolo method

Table 2
Monthly rainfall characteristics and Geographical location of the rainfall monitoring stations in West Java

Table 4
Cluster Table1shows the time series data clustering using the Piccolo method with different model selection criteria.The performance evaluation of clustering with different model selection criteria is done by simulation.The simulation results show that the accuracy of findings obtained from clustering using the AIC, BIC, AICc, RMSE, and MAPE criteria varies.Clustering with BIC or AICc criteria shows better results compared to AIC, RMSE, and MAPE.Clustering with the MAPE criterion shows the lowest accuracy results.The longer the observation period (t), the AIC, BIC, AICc, and RMSE criteria show similar clustering accuracy results, but MAPE shows different results from the others.Based on the simulation, it can be concluded that the accuracy of clustering results using the Piccolo method(Piccolo, 1990; Piccolo, 2010)is influenced by the model selection criteria and the length of the observation period (t).The results are in line withRahkmawati et al. (2019)where the AIC, BIC, and AICC criteria have similar accuracy patterns.RMSE has a pattern that is quite similar to AIC, BIC, and AICC while MAPE does not have a similar pattern with other criteria.