Comparative Analysis Clustering Algorithm for Government’s Budget Performance Data

The government's budget performance is a benchmark for the government's success in optimizing people's money to achieve national goals. Even though performance measurement has reached the Work Unit level, the data formed still do not have a specific grouping, in the sense of unstructured data. The purpose of this research is to find the best clustering algorithm for classifying budget performance data. The data used is budget performance data for 19,460 Indonesian Government Work Units. The data is sourced from the SMART application and the OM SPAN application. This research uses a comparative study approach for the K-Means algorithm, DBSCAN, and agglomerative hierarchical clustering (AHC). Evaluation of the clustering results formed using the Davies-Bouldin Index (DBI) method. The AHC algorithm with k = 6 achieved the lowest DBI value of 0.3583472. The DBI value for the DBSCAN algorithm with MinPts = 10 is 0.5398259. However, the AHC algorithm is not good in terms of ease of implementation. Therefore, the K-means algorithm with parameters k = 10 is the best alternative. The K-Means algorithm gets a DBI value of 1.052678. The K-Means algorithm produces 10 clusters. Based on knowledge extraction, it is determined that cluster 2 and cluster 5 are ideal clusters in terms of budget performance. While the clusters that require attention are cluster 1, cluster 3, cluster 4, and cluster 8.


INTRODUCTION
Government budget performance refers to the evaluation of the utilization of ministry or agency budgets as recorded in budget documents.The government's budget achievements serve as a yardstick for evaluating the government's ability to efficiently utilize public funds to accomplish national objectives.Spending performance measurements serve the purpose of retrospective analysis as well as future forecasting.Retrospectively, the government archives and preserves historical data regarding past activity.Evaluating historical budget performance can serve as the foundation for future policy implementation.
Budget performance is a measure of the effectiveness of the government's fiscal policies [13].To obtain a credible measurement, it is necessary to pay attention to the characteristics of government organizations.In 2021, the government of Indonesia had 19,460 Work Units carrying out government tasks.The average national spending performance score for the Work Unit level reached 87.40 and was categorized as "Good".This result is slightly lower than the average expenditure performance score at the Ministry/Agency level, which reached 92.34 and is categorized as "Excellent".
Even though performance measurement has reached the Work Unit level, the data formed still does not have a specific grouping, in the sense of unstructured data.This will certainly make it difficult for regulators to adopt fiscal policies that are following their characteristics so that ◼579 ◼ISSN: 1978 Title of manuscript is short and clear, implies research results (First Author) performance achievement remains at an optimum level.It is at this point that data science is needed.There is a clustering algorithm in data science that groups data based on cluster structure into data sets with the greatest similarities in the same cluster and the greatest differences in different clusters [26].Theoretically, clustering algorithms are divided into centroid-based clustering, density-based clustering, distribution-based clustering, and hierarchical clustering.
For the centroid-based clustering category, the K-means algorithm is a popular algorithm.The K-means algorithm groups N data points into k clusters by minimizing the sum of the squares of the distance between each point and the centroid (mean of the nearest cluster) [24].Determining the value of k becomes crucial in this algorithm.Several previous studies have suggested improving the K-means algorithm with attribute reduction, a better initialization technique [24], the canopy algorithm [25], k-means [26], ball k-means [27], and firefly algorithms [ 28].
Density-based clustering organizes data based on the density of points in the data space, rather than just areas of the same density.However, this algorithm has trouble with data that has different densities and high dimensions.The DBSCAN algorithm is an alternative that is often used [15].The advantage of this algorithm is its ability to detect outliers [17].Previous fields of study that used this algorithm include inductive technology [11], urban rail passenger aggregation distribution [12], and crowdsourcing logistics pricing [14].DBSCAN enhancement was carried out with neighbor similarity, a fast nearest neighbor query [15], and network space [16].
Hierarchical clustering is a mathematical model or exploratory tool to demonstrate categorizing large volumes of different groups or tree form data sets based on similarities without prior knowledge [3].Hierarchical clustering is divided into two groups, agglomerative (AHC) and divisive hierarchical clustering (DHC) [6].AHC is an algorithm that has been developed in many previous studies in the areas of hotspot clustering [1], student activity [2], and judicial practice [4].Several recent studies emphasize the development of more efficient algorithms [5] [6] [7].
This research focuses on using unsupervised learning to get the best grouping of budget performance measurement data that doesn't yet have cluster data.Previous studies have generally focused on using only one clustering algorithm in handling data.For example, research [18] uses K-Means to classify GRDP Growth Rate data, research [12] uses DBSCAN for urban rail passenger aggregation distribution, and research [2] uses AHC to categorize learning activities in online learning.The data studied was generally observational data in the fields of education, health, and law.This study investigates budget performance data, which has previously been reported only within a limited scope.

RESEARCH METHOD
The data used for the clustering analysis is the Government of Indonesia's budget data at the Work Unit level in 2021.The data used is secondary data owned by the Ministry of Finance in the SMART application.The raw set of data used is 19,460 observations with 10 attributes, namely Work Unit code (kd_ori_wu), Work Unit location (loc_wu), personnel expenditure budget (b51_wu), goods expenditure budget (b52_wu), capital expenditure budget (b53_wu), budget absorption ( n_real), consistency of fund withdrawal plan (n_consist), achievement of output volume (n_cro), and efficiency value (n_ne).To obtain the government's budget performance clustering, this research will test three clustering algorithms, namely the K-Means algorithm, the DBSCAN algorithm, and the AHC algorithm.The K-Means algorithm is a popular clustering algorithm that is used to divide data sets into several clusters according to the proximity of data points.The K-Means algorithm is run based on the following steps: 1.The initial input data includes variable D as a collection of input data, D={x1,x2,...,xn}, and the i th data,, xi, ∈ xi ∈ R d (d-dimensional space).2. Initial parameters, namely K as the desired number of clusters, C is defined as a collection of cluster centers, C={c1,c2,...,cK}, and m is the number of iterations or convergence criteria.3. Define the cluster center, ck as the k th cluster center, ∈ ck ∈ R d .4. Define a cluster using the formula: Title of manuscript is short and clear, implies research results (First Author) Sk : k th cluster xi, : i th data cj : The j th set of cluster centers K : number of clusters 5. Perform the algorithm iteration for each t value: a. Data sharing with clusters: b. Cluster Center Update: Ca and Cb, from the cluster list.Add Cnew to the cluster list.Repeat these steps until there is only one cluster remaining.5.The results obtained are in the form of a cluster hierarchy that forms an agglomeration tree.
To evaluate the results of clustering, one of the recommended methods is the Davies-Bouldin Index (DBI) method.The DBI method uses cohesion and separation values to generate an index.The cohesion value is the closeness of the data to the centroid of its cluster that is followed.Separation is the distance between centroids in the cluster.The smaller the DBI value (as long as it is greater than zero), the better the cluster formation.The formula for calculating DBI is as follows: DBI = (4) with Ri,j can be obtained through the following equation: with SSWi, SSWj dan SSBi,j obtained through the following equation: SSWi = , = (,) With:  , : the ratio between cluster  and cluster    ,   : sum of squares within cluster  and  , : sum of the square between cluster  and  (  ,   ) : distance of the  th data point to the j th centroid (,) : distance from the  th centroid to the  th centroid  : number of clusters , : number of data in the th and th clusters

Identification Problem and Literature Review
Based on observations on the official website of the Directorate General of Budget, Ministry of Finance, and Minister of Finance Regulation Number 22/PMK.02/2021,no grouping of budget performance data was found for the Work Unit level.Although there will be 19,460 work units carrying out government tasks in 2021.This makes budget performance data unstructured.The implication is that budget performance data cannot be used to generate knowledge that will later be useful for decision-making.
To handle unstructured data, unsupervised learning methods can be used.Clustering algorithms are the best alternative to handle this data.Based on previous research, there are three popularly used algorithms: K-Means [24]

Data Collection
Based on the SMART application data for 2021, 19,460 Work Units were obtained.The Work Units consist of 1,474 Work Units located in Jakarta (code 1), 17,784 Work Units spread across 34 provinces (code 2 -35), and 211 Work Units overseas (code 50 -59).Each Work Unit has three main attributes, namely the location of the Work Unit, budget attributes, and budget performance attributes.The OM SPAN application report is the source of data records for budget attributes.From the SMART application reports, data recordings regarding budget performance attributes are obtained.Data is initially stored in tabular form and subsequently converted to.csv format.The budget attributes are translated into personnel expenditure attributes (b51_wu), goods expenditure attributes (b52_wu), capital expenditure attributes (b53_wu), total budget (budget_wu), and budget blocks (block_wu).Meanwhile, the budget performance attributes consist of budget realization (n_real), disbursement plan consistency (n_consist), achievement of output realization (n_cro), and efficient value (n_ne).

Data pre-Processing
Before using the data in the clustering algorithm, data pre-processing is first carried out.The first stage will be data cleaning by selecting complete data records so that fields containing N/A are not processed further.At this stage, 294 incomplete data fields were found, leaving 19,166 data fields.In the second stage, the most relevant attributes will be selected to form the basis of the clustering algorithm.For the budget attribute, the total budget attribute (budget_wu) and budget block (block_wu) were selected.For the budget performance attribute, the attributes of achievement of output realization (n_cro), and efficient value (n_ne) were selected.

Determination of Parameters k, eps, and MinPts
For the centroid-based clustering algorithm, determining the value of k is crucial.In the K-Means algorithm, determining the value of k can affect the performance of the clusters formed [28].The k parameter is also used in the AHC algorithm to determine cluster boundaries.To determine the optimum k parameter, you can use the Elbow Method and the Silhouette Method.Figure 2 shows the results of calculating the k parameters from the dataset.Based on Figure 2, the Elbow method shows the optimum k parameter when k = 10.Meanwhile, the Silhouette Method shows the optimum k parameter when k = 6.Therefore, the two k values will be used in the K-means and AHC algorithms to get the best clusters.
The DBSCAN algorithm does not use k parameters, but eps and MinPts.To determine the optimum eps and MinPts values, the Knee Method can be used.DBSCAN empirically employs MinPts = 4 [12].However, the minimum MinPts value is d + 1, In this case, the MinPts value = 6.In the previous calculation, the k = 6 and k = 10 values were obtained, these two values will be used to determine the optimum eps and MinPts values.Based on Figure 3  In the K-Means algorithm, for k = 6, the smallest cluster size is 15 and the largest cluster size is 6,651.When k = 10, the smallest cluster size is 6 and the largest cluster size is 4,782.These results indicate that at k = 10, the data distribution tends to be better because the distribution of data in each cluster is more even, even though there are two clusters whose size values differ greatly from those of the other clusters.
The second algorithm that will be used is the DBSCAN algorithm.The results obtained after running the DBSCAN algorithm are as follows: At MinPts = 6, the clusters formed are 3 clusters with 151 data points of noise.When MinPts = 10, the number of clusters formed increases to 5 with 309 data points of noise.At the two MinPts values, cluster 1 still has the largest cluster size.The third algorithm that will be used is the AHC algorithm.The results obtained after running the AHC algorithm are as follows: The number of clusters formed using the AHC algorithm is the same as the number of clusters formed using the K-Means algorithm.For cluster size, the AHC algorithm is the same as the DBSCAN algorithm; the largest is in cluster 1, with a value that is much different from the other clusters.

Evaluation Model
In the previous stage, six clustering models had been obtained.To evaluate the results of clustering, one of the recommended methods is the Davies-Bouldin Index (DBI) method.The DBI values of the three algorithms used are as follows: Based on Table 9, the parameter k = 10 will generally produce a smaller DBI value, except for AHC.When using the smallest DBI value approach, the best algorithms are AHC, DBSCAN, and then K-Means.According to the statistical approach, the AHC algorithm with parameters k = 6 produces the best cluster for budget performance datasets.These results are because the AHC algorithm forms cluster 1, which has the same characteristics as the data as a whole.This implies that the frequency distribution in cluster 1 is close to the total data (19148 of 19166 data).When compared to the study [8], which obtained three large clusters], the AHC algorithm results are still not optimal.
DBI results for the K-Means algorithm show the highest value among the other two algorithms.Based on the statistical approach, the resulting clusters are not as good as the other two algorithms.However, the K-Means algorithm has the advantage of a more even frequency distribution in each cluster.For k = 10, only cluster 4 and cluster 10 have a very small frequency distribution.The DBSCAN algorithm is moderate, with the advantage of being able to detect noise (outliers) from the dataset.The DBSCAN algorithm's features can be used to build the next stage of machine learning [22].
These results must be tested further concerning the parameters of ease of implementation for regulators.Clusters resulting from the AHC algorithm have the potential to cause noncooperative behaviors because, in the context of large-scale group decision-making, policies are only taken based on characteristics that are too general [21].A study [9], which concluded a simple baseline for low-budget active learning for complex data such as image data, confirms that the K-Means algorithm is an alternative for classification purposes.The ten clusters formed by the K-Means algorithm do have their complexity in terms of defining the characteristics of each cluster.But this is a big plus because it lets regulators make policies that fit the specifics of the clusters in question.Furthermore, the K-Means algorithm has been widely implemented to cluster data related to state finances, including Personal Income Levels in Romania [10], capital allocation for Small and Medium-Sized Enterprises (MSMEs) [19], operating cash flow [20], and determinants of SMEs' performance [23].Based on these things, we choose the K-Means algorithm as an alternative to obtain knowledge from budget performance datasets.has a low-efficiency value, even though the actual output is high.These results need further identification.Regulators need to check the validity and completeness of the achievement data inputted by the Work Unit.This is because there is a possibility that the low score is due to administrative negligence in inputting performance achievement data.The low-efficiency score in cluster 8 also needs to be viewed with skepticism so that regulators are not mistaken in formulating policies.

CONCLUSIONS
Comparing the three clustering techniques, the AHC algorithm with k = 6 has the lowest DBI, 0.3583472.However, the AHC method is difficult to implement.AHC method results accrue in cluster 1 (19148 out of 19166 data points) due to their frequency distribution.Same with DBSCAN results; frequency distribution accumulates in cluster 1. Policy design may be difficult because decision-makers have trouble distinguishing data features.This implies a biased policy.Thus, the optimal option is K-means with parameters k = 10.The K-Means algorithm's DBI is 1.052678.The K-Means method creates 10 clusters.
Based on knowledge extraction, it is determined that cluster 2 and cluster 5 are ideal clusters in terms of budget performance.While the clusters that require attention are 1, cluster 3, cluster 4, and cluster 8.We suggest further identification related to the completeness of the Work Unit performance achievement data to find out the possibility of administrative errors during the input process of budget performance achievements.
For further research, we suggest comparing the results with data from subsequent years to measure the consistency of the clustering algorithm.Research can be continued by using clustering results to predict performance values and classification algorithms.Associative algorithms can also be used to determine the best policy mix in the budgeting sector.

IJCCS 4 .
Point Q is a direct neighbor of point P if Q is a neighbor of P and P is the core of the point.7. Define Boundary Points (Border): P is a boundary point if P is not a core point but has a direct neighbor who is a core point.8. Define noise: P is noise if P is not a core point and there are no other points that are direct neighbors of P. 9. Algorithm iteration: select points P from D randomly.If P has not been reached, calculate Nε(P).If |Nε(P)| < MinPts, mark P as noise.If |Nε(P)| ≥ MinPts, form a new cluster and add all reachable points of P to the cluster.Repeat this step for all points newly added to the cluster.Algorithm iteration: select the two closest clusters, defined by Ca and Cb , which are the two clusters that have the closest distance based on Ddist.Merge the two clusters into a new cluster, Cnew=Ca ∪ Cb.Update Ddist to account for Cnew as a single entity.Eliminate the old clusters, The DBSCAN algorithm is a clustering algorithm that groups data according to density.The DBSCAN algorithm can be explained in the following steps: 1.The initial input data includes variable D as a collection of input data, D={x1,x2,...,xn}, and the i th data, xi ∈ R d (d-dimensional space).2. Initial parameters, namely ε, are the maximum distance between two adjacent data points, and MinPts is the minimum number of points in the ε-circle of a point so that the 1.The initial input data includes variable D as a collection of input data, D={x1,x2,...,xn}, and the i th data, xi ∈ R d (d-dimensional space).2. Determine the distance matrix Ddist(i,j) to express the distance between xi and xj. 3. Define initial clusters, with each point xi initially considered a separate cluster.

Table 2 .
Summary of the Working Unit data

Table 3 .
K-Means Cluster Size

Table 5 .
AHC Cluster Size

Table 6
Title of manuscript is short and clear, implies research results(First Author)

Table 7 .
Cluster Attributes MeansTitle of manuscript is short and clear, implies research results(First Author)