A medoid-based deviation ratio index to determine the number of clusters in a dataset

Most existing methods of determining the number of groups apply to particular data types or are calculated based on the distance matrix for all object pairs. In this paper, we propose a medoid-based Deviation Ratio Index (DRI) to determine the number of clusters. The DRI is calculated based on the distance matrix for each object to k final medoids. These final medoids are produced by the block-based k-medoids algorithm (BlockD-KM). We choose a specific transformation and a suitable distance for certain variables before executing the BlockD-KM. We illustrated the detailed stages of DRI on secondary data in the 2022 environmental index of Asia Pacific countries, so that they are easy to reproduce. We use eight real datasets, namely Breast Cancer, Heart Disease, Iris, Wine, Soybean, Ionosphere, Vote, and Credit Approval data, to validate the DRI method. We compare the DRI method with the Calinski-Harabaz (CH) and the Silhouette index. The experimental results show that the DRI is 100% correct in predicting the number of clusters. While the CH index correctly predicts 62.5% and the Silhouette index of 75%. We also generated three kinds of artificial data to evaluate the proposed method, and 76.7% of the experiments were predicted correctly.• The medoid-based deviation ratio index aids the researcher in determining the number of clusters• The DRI method applicable to any medoids-based partitioning algorithm• This method is suitable for all data types (categorical, numerical, and mixed)


a b s t r a c t
Most existing methods of determining the number of groups apply to particular data types or are calculated based on the distance matrix for all object pairs. In this paper, we propose a medoidbased Deviation Ratio Index (DRI) to determine the number of clusters. The DRI is calculated based on the distance matrix for each object to final medoids. These final medoids are produced by the block-based -medoids algorithm (BlockD-KM). We choose a specific transformation and a suitable distance for certain variables before executing the BlockD-KM. We illustrated the detailed stages of DRI on secondary data in the 2022 environmental index of Asia Pacific countries, so that they are easy to reproduce. We use eight real datasets, namely Breast Cancer, Heart Disease, Iris, Wine, Soybean, Ionosphere, Vote, and Credit Approval data, to validate the DRI method. We compare the DRI method with the Calinski-Harabaz (CH) and the Silhouette index. The experimental results show that the DRI is 100% correct in predicting the number of clusters. While the CH index correctly predicts 62.5% and the Silhouette index of 75%. We also generated three kinds of artificial data to evaluate the proposed method, and 76.7% of the experiments were predicted correctly.
• The medoid-based deviation ratio index aids the researcher in determining the number of clusters • The DRI method applicable to any medoids-based partitioning algorithm • This method is suitable for all data types (categorical, numerical, and mixed) Specifications Table   Subject area Mathematics and Statistics More specific subject area Medoid-based Clustering Name of your method Medoid-based deviation ratio index to determine the number of clusters Name and reference of original method: • T. Calinski   : number of objects; : number of actual clusters; : number of categorical variables; : number of numerical variables.

Data types in cluster analysis and datasets to validate the medoid-based Deviation Ratio Index
There are four scale data types in cluster analysis: nominal, ordinal, interval, and ratio scale [10] . Cluster analysis can also be applied to text, images, graphs, and time series. The proposed method is limited to data with categorical (nominal and ordinal), numerical (interval and ratio), and mixed data. The mixed data consists of nominal and ordinal, nominal and numerical, ordinal and numerical, and a mixture of the three, namely nominal, ordinal and numeric.
In this paper, we implement three kinds of data to explain and evaluate the proposed method. To explain in detail how to run our proposed method, we use secondary data about the environmental health of Asia Pacific. These data consist of 25 countries observed on four environmental issues with numerical data, namely Air Quality (AQ), Water and Sanitation (WS), Heavy Metals (HM), and Waste Management (WM) scores [11] . Then we apply eight real datasets to evaluate the proposed method, i.e. Breast Cancer data, Heart Disease data, Iris data, Wine data, Soybean (small) data, Ionosphere data, Vote data, and Credit Approval data [12] . The Breast Cancer and Heart Disease datasets are well-known in medical research and pattern recognition. The Breast Cancer dataset contains 569 samples of Breast Cancer tissue, each with 30 numerical variables related to the characteristics of the tissue. The dataset is commonly used for classification tasks and contains two classes of breast cancer tissue: malignant and benign. Meanwhile, the Heart Disease dataset consists of 303 samples of patients, each with five numerical and eight categorical features related to their medical history and examination results. The dataset contains two classes of patients: those with heart disease and those without heart disease. The Iris and Wine datasets are well-known in pattern recognition and multivariate statistics. It contains 150 samples of Iris flowers, each with four features: sepal length, sepal width, petal length, and petal width. The dataset is commonly used for classification tasks, and it contains three clusters of iris flowers: Iris Setosa, Iris Virginica, and Iris Versicolor. The Wine dataset contains 178 samples, each with 13 numerical features related to the chemical properties of the wine. This data is grouped into three classes. The Soybean (small) dataset is well-known in agriculture and pattern recognition. It contains 47 samples of soybean plants, each with 35 binary features indicating the presence or absence of certain disease symptoms and grouping into four classes. The Ionosphere dataset is usually used in the field of pattern recognition and signal processing. This data consists of 351 samples of radar return signals, each with 34 numerical variables related to the properties of the signal. The dataset contains two clusters of radar signals: good and bad. Furthermore, the Vote dataset is data about political science and pattern recognition. These data involve 232 (non-missing) house of representative members of congress with 16 key votes binary attributes. The last dataset, Credit Approval data, is a well-known dataset in the field of finance and pattern recognition. It contains 653 non-missing samples of credit applications, each with nine binary and six continuous variables related to the applicant's credit history and financial situation. The dataset contains two groups of credit applications: those that were approved and those that were denied. All real datasets are publicly available in the University of California, Irvine (UCI) Machine Learning Repository. The summary of eight real datasets is in Table 1 .
Additionally, we created three various artificial datasets and executed each of them fifty times. This experiment is intended to evaluate the performance of the medoid-based Deviation Ratio Index (proposed method).

Pre-processing for the medoid-based Deviation Ratio Index
Pre-processing data is an essential stage in cluster analysis as it can significantly affect the results. There are different ways of pre-processing data; the best way will depend on the specific data and clustering algorithm used. Pre-processing typically includes cleaning, transforming, and scaling the data to ensure that it is in a suitable format for the clustering algorithm to work effectively. Data cleaning involves removing missing or incorrect values and dealing with outliers. Meanwhile, data transformation can include normalizing or standardizing the data or converting categorical variables into numerical ones. Scaling is also an important step in pre-processing, which ensures that all the dataset variables are on the same scale. This process makes it possible to compare the variables and make the data more suitable for clustering algorithms. We suggested pre-processing in a standardized form to achieve comparability in numerical or mixed data. We will obtain the unification from this preprocessing. Eighteen methods for standardizing variables in cluster analysis have been investigated based on distance and normalized data matrix [13] .
The general normalization of numerical data are linear transformations with a standardization formula as follows [14] , where ( ) denotes the value (standardized value) of the -th variable for -th object. The value that can be used in cluster analysis is = For ordinal data, Reference [10] page 30 suggests the formula such as below, where is a rank of the i th object in a j-th variable and is the highest rank of a j-th variable.

Proximity measure for the medoid-based Deviation Ratio Index
Proximity measure is a method used in cluster analysis to quantify the similarity or distance between objects in a dataset. It is important to note that some of the proximity measures are sensitive to the scale of the variables, and it is necessary to standardize the data before applying the distance or similarity measure. The choice of proximity measure will depend on the specific data and the type of clustering algorithm being used. There are some similarities between objects for binary data, such as the Simple Matching coefficient, Jacard coefficient, Rogers and Tanimoto coefficient, Sneath and Sokal coefficient, Gower and Legendre coefficient, and others. Euclidean distance, Manhattan distance, Minkowski distance, Pearson correlation, and Canberra distance are commonly used as proximity measures for continuous variables [ 10 , 15 ].
The similarity measure for categorical data with more than two levels could be dealt with similarly to binary data, with each level of a variable being regarded as a single binary variable [15] . A method is to allocate a score of zero or one to each variable , depending on whether the two objects and are the same on that variable. These scores are then simply averaged over all variables to give the required similarity coefficient, such as Reference [15] page 48 below, The Euclidean distance is formulated as follows [15] , where is distance object and object . While the Canberra distance is formulated as follows [15] , Furthermore, a generalized distance function (GDF) between object and for non-missing mixed data as follows [ 10 , 15 ], where , , and are the weights for the binary, categorical, and numerical variables, and is the weight for the whole function, respectively. In this paper, we apply the same weight, i.e. = = = = 1 .

Medoids-based partitioning method for the medoid-based Deviation Ratio Index
The new method, a medoid-based Deviation Ratio Index to determine the number of clusters, is flexible for any medoid-based partitioning method. Since Kaufman and Rousseuuw first introduced the Partitioning Around Medoids (PAM) or often called the K-Medoids (KM) algorithm, in 1987 [10] , this method has inspired many researchers to improve the performance of KM. Some of them are simple and fast KM [16] , ranked KM [17] , simple KM [18] , initialization of the flexible KM using deviation [19] , fast and eager KM [20] , minimization of the number of iterations in KM [21] , crow search algorithm of KM [22] , block-based KM with standardized data [23] and many more. In this paper, we use k-medoids based on the block of the deviation and the sum of variable values (BlockD-KM). In the BlockD-KM, we substitute the first phase of simple and fast KM [16] with the first stage of the flexible KM [19] . The procedure of the BlockD-KM is as follows [23] .
1. Calculate the sum up of -variables values and standard deviation of -variables for each object, , ( = 1 , 2 , ⋯ , ) , such as below, where ̄ = ∕ ; with is sum up of -variables values as follows, Sort objects in ascending order, first based on Eq. (8) , , then each block of the identical standard deviation (if any), arrange objects based on Eq. (9) , , also in ascending order. 3. Select the first object as initial medoids from the first blocks of the combination of and (or may only block of ). 4. Determine the members of initial groups based on the distance of an object to the nearest medoid. 5. Update the current medoid in each cluster based on the object that minimizes the average distance to other things in its group.
The average distance within cluster g-th, which has members for an object i th, ̄ , defined as follows, 6. Obtain the cluster by assigning each object to the closest medoid. Then calculate the total deviation within groups or the sum of the distance from all things to their medoids, ( ) , such as below, with is a medoid of the group containing an object .
7. Repeat steps 5 and 6 until the ( ) is equal to the previous one or the set of medoids does not change, or a pre-determined number of iterations is reached.
Even though the medoid-based deviation ratio index is designed for the k-medoids algorithm, we can apply it to other clustering methods by first finding the central object as a medoid. To obtain the medoid, we can run it easily, based on things with the smallest average distance in the group.

Validation in cluster analysis and validation for the medoid-based Deviation Ratio Index
One method for validation in cluster analysis is external validation. External validation is a method used to evaluate the quality and performance of clustering results. External validation measures the similarity between the clusters obtained by the clustering algorithm and the true class labels of the data if they are available [24] . This value can be done by comparing the clusters obtained by the algorithm to the true cluster labels using measures such as clustering accuracy, Fowlkes-Mallows index or adjusted Rand index. The value range is between 0 and 1, with 1 indicating a perfect match between the clusters and the true class labels. The larger this value, the better the accuracy. The clustering accuracy is defined as follows [25] where is the number of objects; is the number of clusters; and is the number of objects in considered groups correctly assigned to the actual clusters.
Furthermore, we also compare the result of our proposed method with others. We use the Variance Ratio Criterion based on distance [9] and the Silhouette index [2] as comparisons. The widely used formula VRC (often called CH index) to determine the number of clusters such as below [7] , where and are matrix of between-cluster and within-cluster sum of squares errors. The Calinski-Harabaz also formulated the VRC based on the distance matrix as below [9] .
where = 1 2 and = 1 2 is a weighted mean of the difference between the general and the within-group mean squared distances. For observation , let ( ) be the average distance of object to all entities within the cluster, and ( ) the average distance of object to points in the nearest groups besides its own closest is defined by the group minimizing this average distance. Then the silhouette index can be calculated via [2] A point is well clustered if ( ) is large, and the best-separated clusters have an index equal to one. The group size that produces the highest average silhouette index is chosen as the best number of clusters. Meanwhile, the medoid-based shadow value (MSV) for an object is defined such as below [26] , where ( , ( ) ) is the distance between object to the first closest medoid and ( , ′ ( ) ) is the distance between object to the second closest medoid.

A medoid-based Deviation Ratio Index to determine the number of clusters (proposed method)
We introduce a novelty procedure to determine the number of clusters in a dataset through the medoid-based deviation ratio index. We were inspired by the Variance Ratio Criterion (VRC), which uses an object distance matrix of all pairs of objects as a basis to derive the VRC [7] . We also consider the medoid-based shadow value (MSV) concept that implements the first and second closest centroid to develop the MSV [26] . Meanwhile, the MSV adapt the Silhouette index [2] and centroid-based shadow value. We only use a distance matrix with size , i.e. the distance of all objects to each final medoid, to construct our proposed method. Suppose there are objects with observations on the same variables for each individual and separated into groups. Based on final medoids for a specific size of clusters, , the deviation ratio, ( ) , is formulated as follows.
where ( ) is the sum of the distance from all objects to their medoids (within-group), such as Eq. (11) . While ( ) is the sum of the distance from all objects to the medoids besides its medoids (between-group). The formula of ( ) is as follows, where is the medoids of the other group. Furthermore, the deviation ratio index ( ( ) ) is defined as the comparison of the deviation ratio of a cluster of size to a group of size ( + 1 ) . The deviation ratio index for a group with a size of is formulated as follows, The optimal number of groups is determined as the smallest so that the deviation ratio index, ( ) , is less than one. Another way we may start with = 2 and add a cluster until the value of the ( ) is less than ( + 1 ) or ( ) < ( + 1 ) . This way can be faster than the fuzzy C-means (FCM) clustering algorithm. The FCM algorithm uses the range between two and ( √ ) as a basis for selecting the optimal number of groups [4] . The reason we formulate ( ) using ( ) as the quantifier is to anticipate extreme data. The intended data extreme is that each group is perfectly separated, so the value of ( ) is zero. Meanwhile, if the group size increases, there is a tendency for deviation within groups, ( ) , to decrease. At the same time, the value of ( ) will tend to get bigger. According to this fact, we formulate the ( ) as a comparison between the ( ) of group and the ( + 1 ) of one larger group. Therefore, based on the parsimony principle, the best group size is the first smallest group size with a ( ) value of less than one. This index is not defined for = 1 .

An illustrative example
The primary purpose of presenting the following examples is to clarify the implementation of our proposed method, making it easy to replicate on any data set. We use secondary data from Reference [11] on the sub-section of policy about environmental health in Asia Pacific regions. We will group it into three clusters. In this example we subtract the data with the smallest data and then divide it by its range to standardize data as Eq. (2) [14] . The results of standardizing for four environmental health scores, i.e. air  quality (AQ), water and sanitation (WS), heavy metals (HM), and waste management (WM) scores [11] , are in columns (3) to (6) in Table 2 . Then we apply the k-medoids based on block deviation (BlockD-KM). Table 2 shows the summary of group enumerations based on the criterion of deviation ratio index. Steps 1 and 2 of BlockD-KM generate ordered objects based on the standard deviation, such as columns (2), (7) and (8). We intentionally display the standard deviation in four decimal places, such as column (7). This presentation is to make it easier to observe the difference of the first smallest standard deviation of objects. We can ignore sorting the thing based on the sum of the data as the results in column (8). The reason is that none of the three entities with the first smallest standard deviation is identical. Then the step 3, based on the block of deviation, we obtain object one or Timor Leste (TLS), object two or Japan (JPN), and object three or Philippines (PHL) as initial medoids such as in column (8). According to the Euclidean distance of the object to the nearest medoid, steps 4 of BlockD-KM produces the initial groups (IG), such as column (9), with a total deviation or ( ) of 8.0. To get the set of medoids that does not change, we update the current medoid of each group based on the object that minimizes the average distance to other things in its group. The first iteration in the fifth and sixth steps of the BlockD-KM does not change the medoid of group one and group three. Meanwhile, for group two, the object with the smallest average distance from members in the group is the 15th object or Indonesia (IDN). This new medoid causes a change in the members of each group as in column (10) with ( ) of 7.1. Then we update the medoid again, resulting from iteration one. In iteration two, the medoid for group two again changed from Indonesia (object 15th) to China (object 13th). While the medoids of groups one and three remained the same. The set of new medoids from iteration two produces the members of each group, such as column (11) with ( ) value of 6.7. In the same way, updating the medoid is continued in the third iteration by looking for objects with the smallest average distance to group members. Iteration three doesn't change the medoid, so the BlockD-KM stage is complete. This dataset needed three iterations to achieve an unchanged medoids as the final medoid. Furthermore, to obtain the criterion values of the deviation ratio, we calculate the ( ) value such as in column (15) and the ( ) value such as in column (16). According to Eq. (20) , with the input ( ) value of 6.71 and ( ) value of 14.6, the deviation ratio is 0.013. In the same way, the values of ( ) and ( + 1 ) for two to ten groups are shown in Fig. 1 . Meanwhile, the deviation ratio index ( ) values are shown in Fig. 2 . According to Fig. 1 or Fig. 2 , the optimal number of clusters for environmental health data in Asia Pacific countries is three groups.
The members of the set final medoids are the first object ( 1 ) Timor Leste (TLS), object three ( 3 ) Philippines (PHL), and object thirteen ( 13 ) China (CHN). Then based on the distance object to the closest final medoid, we obtain the members of each cluster.

Method validation
We use eight actual datasets to evaluate the proposed method. All real data sets taken from the machine learning repository of the University of California, Irvine (UCI), i.e. Breast Cancer data, Heart Disease data, Iris data, Wine data, Soybean (small) data, Ionosphere data, Vote data, and Credit Approval data [12] . As in an illustrative example, we use k-medoids based on block-deviation (BlockD-KM) to clusters of all datasets. Even though we can stop the process until the ( + 1 ) is more than ( ) , in this paper, we execute by repeat procedure for two to ten clusters for each dataset. We applied the Euclidean distance for Breast Cancer and Wine data. For Iris and Ionosphere data, we implement Canberra distance. Then we use the Simple Matching distance for Vote and Soybean data. Meanwhile, we use the Gower distance for the Heart Disease and Credit Approval data. According to these constraints, we get the deviation ratio index for each dataset, such as Fig. 3 to Fig. 10 .
According to Fig. 3 , two groups are the smallest group sizes that produce a deviation ratio index below one. Therefore, we conclude that Breast Cancer data is classified into two groups. This size corresponds to the number of the class should be. Referring to Fig. 4 , we obtain three groups for the Wine data. This size group was drawn based on the smallest group size with a DRI value of less than one achieved for three groups. The size of this group also coincides with a known group. Then Fig. 5 shows the plot of the deviation ratio index for Iris data. The smallest class that produces a DRI of less than one for Iris data is three groups conforming with real clusters. For ionosphere data, we conclude that the number of groups for this data is two, such as in Fig. 6 .
The value of the medoid-based deviation ratio index for Soybean (small) data is such in Fig. 7 . We conclude that the optimal number of clusters for this data is four. Meanwhile, we conclude that the group size is two for categorical Vote data. This conclusion refers to the smallest group size that produces a DRI value below one is a group size of two, as shown in Fig. 8 . As with fourth numerical data above, the number of clusters formed of Soybean (small) and Vote data also conform to the actual groups.   The plot of the deviation ratio index for Heart Disease data is such in Fig. 9 . We achieve the number of clusters for this data that correlates with actual clusters, i.e. two groups. Furthermore, Fig. 10 shows that the smallest group with a DRI value below one is when two groups are stamped for Credit Approval data. This size dovetails with the actual cluster.
To complete our proposed method, with the same constraint, we also calculate the maximum of the VRC and average silhouette index for eight real datasets, such as columns (6) and (7) in Table 3 . We also present other methods to determine the number of clusters that uses the same actual data set, such as in column (8). The MSV focuses on visualizing and validating grouping results, not specifically on determining the number of groups. However, because this index is similar to the silhouette index [26] , we try to display the group size with the highest MSV average, such as in columns (9).
According to column (3) in Table 3 , we can see that the clustering accuracy of Breast Cancer, Wine, Iris, and Soybean data reaches more than 90%. The clustering accuracy produced by the BlockD-KM is classified as very compared to other grouping methods [23] .   Even though the clustering accuracy of Ionosphere, Vote, Heart Disease and Credit Approval data is lower than before, we can see that the new method can precisely predict the number of groups. Meanwhile, we can see that the Calinski-Harabaz index incorrectly predicted three out of eight (37.5%) datasets. This index mispredicts the number of clusters for the Wine, Iris and Ionosphere data. Two out of eight (25%) datasets, namely Wine and Iris datasets, were also miss predicted by the silhouette index. The Krzanowski-Lai index and the PCA for determining the number of clusters also miss indicated Breast Cancer, Wine, and Iris data, such as in column (8). Suppose we use the MSV value to evaluate the number of clusters; similar to the silhouette index, the MSV value is incorrect in predicting the Wine, Iris, and Soybean data. In comparison, we can see that the new method, a medoid-based Deviation Ratio Index, can correctly predict all eight datasets.
In agreement with eight real datasets, i.e. Breast Cancer, Wine, Iris, Ionosphere, Soybean, Vote, Heart Disease, and Credit Approval data, we can see that the proposed method is better than the other methods. Though there is no satisfactory probabilistic theory to  [3] , (c) Reference [8] , (d) Reference [4] .

Table 4
The number of clusters in three artificial data.

Type
The number of trials that produce many clusters of n.a.: not applicable.
justify the use of ( ) or ( ) , the criterion has encouraging results, such as in column (5) in Table 3 . It should be noted that using different proximities or normalization methods can produce estimates of the number of groups that may be different.
Furthermore, to evaluate medoid-based DRI, we construct three kinds of artificial datasets. The first synthetic data consists of two binary variables into two groups, which generated looks like Vote data. Then we develop the second artificial data, namely three clusters in two dimensions. The dimensions are standard normal variables with centred at (0,0), (0,5) and (5, − 3). The last artificial data is a mixed variable: two binary data, one ordinal data and one numerical data. We apply the constraint as in the validation method section. The summary of the number of clusters based on the medoid-based deviation ratio index is in Table 4 . Table 4 shows the number of correctly identified groups (column 8) of 115 out of 150 (76.67%) of the artificial datasets. Of the 35 incorrect decisions, 23 values were too large by 1, 8 were too large by two or more, three were too few by one, and one was too few by two or more. Accumulation of correct predictions and predictions of less than one or more than one over reached 94% (141 of 150). The number of groups determined by our proposed method is quite encouraging.
Finally, we conclude that the newly proposed method, a medoid-based Deviation Ratio Index, is comparable to other methods. Comparisons of the DRI method with the Calinski-Harabaz index and Silhouette index on eight real datasets concluded that the new method is better than both methods. The experiment results also show that the medoid-based Deviation Ratio Index effectively determines the number of clusters. A medoid-based DRI's strength is that it is easy to calculate, suitable for all data types, applicable for small or large data sizes, and flexible for any clustering methods. It should be noted that to use the medoid-based DRI with the BlockD-KM method, the suitability in choosing the transformation method and the proximity measure will affect the prediction accuracy of the number of clusters.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability
Data will be made available on request.