A New Clustering Algorithm and Its Application in Assessing the Quality of Underground Water

Cluster analysis, which is to partition a dataset into groups so that similar elements are assigned to the same group and dissimilar elements are assigned to different ones, has been widely studied and applied in various fields. ,e two challenging tasks in clustering are determining the suitable number of clusters and generating clusters of arbitrary shapes. ,is paper proposes a new concept of “epsilon radius neighbors” which plays an essential role in the cluster-forming process, thereby determining both the number of clusters and the shape of clusters, automatically. Based on “epsilon radius neighbors,” a new clustering algorithm in which the epsilon radius value is adapted to the characteristics of each cluster in the current partition is proposed. Recently, clustering has been widely applied in environmental applications, including underground water quality monitoring. However, the existing studies have simply applied conventional clustering techniques, in which the abovementioned two challenging tasks have not been solved already.,erefore, in this paper, the proposed clustering algorithm is applied in assessing the underground water quality in Phu My Town, Ba Ria-Vung Tau Province, Vietnam. ,e experimental results on benchmark datasets demonstrate the effectiveness of the proposed algorithm. For the quality of underground water, the new algorithm results in four clusters with different characteristics.,rough this application, we found that the new algorithmmight provide valuable reference information for underground water management.


Introduction
Cluster analysis is to discover the underlying structure of a dataset by partitioning the data into groups so that similar elements are assigned to the same group and dissimilar elements are assigned to different ones [1][2][3][4][5]. Recently, along with the development of big data, cluster analysis has been extensively studied and widely applied in various fields, such as physics, biology, economics, engineering, sociology, and data mining. [6]. For solving the problem of clustering, several approaches have been proposed in the literature, which includes: nonhierarchical clustering (k-means, kmeans ++, etc. [7,8] and other variances), hierarchical clustering [9], clustering for probability functions [1], or fuzzy clustering [10]. Among the abovementioned approaches, k-means clustering is the most well known and widely applied in various fields. However, the k-means algorithm and its extensions usually require a user-defined number of clusters that is often unknown in practice. (i) Furthermore, the k-means algorithm constructs spherical clusters, which is unsuitable for arbitrary-shaped clusters. (ii) e above two problems have been the major drawbacks of clustering so far, which lead to many difficulties and challenges in solving this problem [6].
For (i), to determine the suitable number of clusters, the most commonly used approach is running the clustering algorithm several times with different number of clusters each time, and evaluating them based on a number of internal validity measures, such as S-index, F-index, Dunn index, and Xie-Beni index [11][12][13][14]. is approach can investigate the suitable number of clusters, but it repeats the clustering process many times to find the best number of clusters, thereby increasing the amount of time and space required, according to [6]. Moreover, the abovementioned evaluation indices are distance-based measures; therefore, they can only evaluate the qualities of spherical clusters and cannot be used for arbitrary-shaped clusters. In [15], Mavridis et al. proposed the algorithm PFClust (Parameter Free Clustering). e term "parameter free" means that the algorithm can automatically determine the number of clusters without requiring any user-defined parameters. For this purpose, PFClust performs an agglomerative algorithm on many subdatasets that are randomly sampled several times. Given an internal validity measure and a set of threshold corresponding to the number of clusters, the suitable threshold is then chosen based on the distribution of the given internal measure for all possible clustering results. In comparison to other conventional clustering algorithms, PFClust can result in a little better performance; however, it repeats the process of sampling and evaluating internal measures of the given thresholds in several times. Consequently, PFClust tends to be more time-consuming and expensive than other clustering methods. References [16][17][18] found the optimal partition by combining the metaheuristic optimization method and the clustering. ese studies used the abovementioned internal validity measures as objective functions that need to be optimized to find the best clustering solution. It is well known that the metaheuristic optimization method, e.g., the genetic algorithm, results in an extreme computational cost, which reduces the efficiency of the algorithm. Furthermore, in spite of outputting the number of clusters and partitioning automatically, the metaheuristic optimization method requires a few of its own user-defined parameters that have effects on the optimal solution. As a result, avoiding the challenge of specifying the number of clusters, k leads to the challenge of specifying many other parameters. In [19], an automatic clustering algorithm was conducted using a function of force that can control the movements of the objects. e farther the distance, the weaker the force between two objects. In the end, each object converges to the center of the cluster it belongs to. Since the computing of force also requires a user-defined parameter denoted λ and the value of λ also has effects on the number of clusters, the attempt to overcome the problem of [16][17][18] of this algorithm is not too significant.
For (ii), DBSCAN [20], a density-based algorithm, is the most well-known method to construct arbitrary-shaped clusters. e algorithm utilizes two connectivity functions termed as density-reachable and density-connected, and each data instance is indicated as either a core point or a border point. e algorithm works to expand core points to form a cluster around itself. A drawback of DBSCAN is that when clusters of different densities exist, only particular kinds of noise points are captured [21]. Besides, two userdefined parameters regarding the minimum size of clusters and the radius need to be carefully turned. e other approaches, such as kernel k-means [22] and spectral clustering [23] can construct arbitrary-shaped clusters; these methods, however, also require a predefined number of clusters.
Because of the abovementioned drawbacks, an investigation of a new clustering method which can automatically determine the number of clusters and the clusters' shape is necessary.
is paper proposes a new clustering method based on a new definition called "ε-radius neighbors" of a given point x 0 . ε-radius neighbors play a key role in constructing clusters with arbitrary shapes. When any new ε-radius neighbor is not found, the algorithm stops processing the current cluster and thereby the number of clusters is automatically determined. Furthermore, the radius ε can be adapted to specific cluster density, which is an advantage of the proposed methods in comparison with DBSCAN.
e quality of underground water depends on various factors, such as climate, characteristics of aquifers, pH, alkalinity, redox potential of the geological environment, initial sources, contamination due to human activities, and biological processes. e conventional methods of assessing the quality of groundwater are usually based on comparing the parameters representing water quality, which are collected by sensors, with the permitted standards. Clustering can help explain complex data matrix, analyze the similarities in water quality characteristics, and group them into clusters, thereby showing their general characteristics, as well as the causes that affect water quality. erefore, clustering has been widely applied in environmental applications, including underground water quality monitoring. Some studies, for example, [24][25][26][27][28], have applied clustering in order to classify the water qualities in the whole region and design a future spatial sampling strategy in an optimal manner, which can reduce the number of sampling stations and associated costs. However, the abovementioned studies simply applied conventional clustering methods, such as hierarchical clustering with Ward distance, and k-means clustering. ese methods, in general, have encountered the disadvantages, as mentioned in the previous parts. erefore, in this paper, the proposed clustering algorithm is applied in assessing the underground water quality in Phu My Town, Ba Ria-Vung Tau Province, Vietnam. is application is expected to produce more reliable and valuable information so that the administrators can monitor underground water behavior. e remainder of this paper is organized as follows. Section 2 presents the study area, the data collection, and the proposed method. e results and discussion are presented in Section 3 in which Section 3.1 is the validation of the proposed algorithms for different datasets and Section 3.2 is the application in assessing the underground water quality in Phu My Town, Ba Ria-Vung Tau Province, Vietnam. Finally, Section 4 is the conclusion.

e Proposed Clustering
Method. Let X � x 1 , x 2 , . . . , x n , x i ∈ R d be a set of n points, x 0 ∈ R d be a given point, and ε be an arbitrarily positive integer. A set S ⊆ X is called as ε-radius neighbors of x 0 if 2 Scientific Programming where d(x i , x 0 ) is the Euclidean distance between x i and x 0 . Obviously, ε-radius neighbors of x 0 are located in a hypersphere of radius ε around x 0 . As a result, a cluster can be extended by searching on the dataset and adding new objects pertaining to any hypersphere of radius ε around the current objects. is process still depends on the value of ε.
is parameter plays a role which is the same as the parameter ε in the well-known DBSCAN algorithm. e choice of this parameter has effects on the clustering result. A fixed value of ε has low generalization ability because different datasets and clusters with different densities in a dataset could require different values of ε. A natural strategy is simply to adapt ε using the current cluster density. For the sake of presentation, the set of pairwise distances in the current cluster is called the set of "historical extending." Based on the set of "historical extending" in the current cluster (samples), we can estimate the maximum "extending" of the entire cluster (population). In this case, two basic principles are as follows: (1) We know that if data has the normal distribution with mean μ and standard deviation σ, then 95% of the data values belong to the interval μ ± 2σ. If the two abovementioned parameters are unknown and data is enough large, we can estimate them from the sample data. For example, the mean and the adjusted standard deviation of the sample can be selected as alternatives for μ and σ, respectively. erefore, to estimate the maximum extending of the cluster (population), we can use the following formula: where d and s d are the mean and adjusted standard deviation of "historical velocities" in the currentprocessing cluster (sample). Obviously, about 97.5% of the extending pertaining to the true cluster (population) must be less than the extending estimated by formula (2), and thus, this formula can be used to approximate the maximum extending of the true cluster (population). (2) Let n be the sample size or the number of objects in the current-processing cluster and d and s d be the sample mean and adjusted standard deviation, respectively. Assuming that the value of n is large enough or d has the normal distribution with the mean μ(d) and the variance s 2 d /n. Consequently, with a significant level of 0.05, the mean of d belongs to the interval d ± (1.96s d / � n √ ). As a result, the maximum of the mean extending can be directly estimated using the following formula: e maximum value of the confidence interval is then used as the representative extending of the cluster.
It can be observed from formulas (2) and (3) that, in the earlier processing stage, when the sample size is too small, the standard deviation and the adaptive extending must be large. erefore, we can avoid unreasonable extending in the earlier processing stage when the current sample is not a good representation of the population. Meanwhile, in the later stage, the number of objects in the current-processing cluster or the sample size is large enough for maintaining a stable adaptive extending.
Based on formulas (2) and (3), we propose a new clustering method called adaptive radius clustering for automatically determining the number of clusters and clusters shapes. Let X � x 1 , x 2 , . . . , x n , x i ∈ R d be an original dataset of N objects. e new clustering algorithm is presented as the following pseudocode and in Figure 1.
are current-processing cluster obtained before and after an update, respectively.
Step 1. Get the first three objects of the cluster using the formulas below: v 3 � arg min which subject to Update In formulas (4), (5), and (6), argument "arg" of a function is the value that must be provided to obtain the function's result; that the sum of distances between it and other points is the minimum. In other words, v 1 is the centroid of the current dataset. Similarly, v 2 is the nearest point of v 1 and v 3 is the nearest point of v 1 when excluding v 2 . Formula (7) is defined to overcome the problem of bad initialization. For example, if v 2 and v 3 are two nearest neighbors of v 1 , but the corresponding distances are larger than the average of pairwise distances between points in the current dataset, then v 1 will be considered as a single cluster and the current extending process will be stopped.
In the abovementioned formulas, d is the Euclidean distance between any two d-dimensional points. In some illustration below, for the sake of visualization, x will be chosen as a 2-dimensional point (x 1 and x 2 ) so that we can draw the scatter plot of data. In fact, x 1 and x 2 not only can be the coordinates but also can be other informations such as height, weight, Ca 2+ , Mg 2+ , and Na + . Furthermore, x can be a d-dimensional vector, in general. Certainly, we can calculate Euclidean distance between two d-dimensional points x and y using the following formula: In addition, because variables measured at different scales do not contribute equally when calculating the distance, the data are normalized into [0, 1] interval using the following formula: where x ij is the value of variable j (j � 1, d) at the point i (i � 1, n), z ij is the normalized value of variable j at point i and min i (x ij ) and max i (x ij ) are the minimum and maximum value of variable j, respectively.
Step 2. For each v i ∈ C new 1 , compute the adaptive ε-radius and the corresponding ε-radius neighbors S i using Definition 1 and either formula (2) or formula (3); update C new 1 and centroid new by the following formulas: In this step, formulas (2) and (3) are utilized to compute the adaptive ε-radius and the corresponding ε-radius neighbors S i . Note that, the two abovementioned formulas are now just some options that need to be tested. In the numerical results, after applying both, the best option will be selected in the application.
Step 3. If C new 1 /C old 1 ≠ ∅, then C old 1 : � C new 1 and centroid new : � centroid new ∪ C new 1 /C old 1 . Repeat Step 2 and Step 3 until centroid new � ∅, then stop the current-processing cluster.
Step 4. Repeat the three steps above until all objects are assigned to their clusters. e main idea of the proposed algorithm is that from a number of points initialized using formulas (4), (5), and (6) Start Initialize data, C 1 new = Ø, I = 1 Get v 1 , v 2 , v 3 by formulas (4), (5), (6) Formula (7) is true new using their є-radius neighbors computed by formula (2)   subject to (7), the cluster can automatically expand based on formulas (2) or (3). When the cluster does not extend more, the abovementioned process will repeat over the rest of the data until all points in the data are assigned to a specific cluster. With formulas (2) or (3), the ε-radius neighbor can adapt to different cluster densities; hence, the proposed algorithm can determine the number of clusters and find clusters of arbitrary shapes in cases of both balanced and imbalanced cluster densities.
is is an advantage of the proposed algorithm in comparison to conventional methods, such as k-means, k-medoids, and DBSCAN.

Study Area and Data Used.
e clustering method proposed above will be applied in assessing the underground water quality in Phu My Town, Ba Ria-Vung Tau Province, Vietnam.
e study area and data used are described as follows. Phu My town is the most concentrated industrial area and is one of the most developed areas in Ba Ria-Vung Tau province, Vietnam. To serve economic development, the demand for water in this area is quite high, but the sources of surface water from rivers and lakes do not meet the demand. According to the 2012 survey data of the Department of Natural Resources and Environment of Ba Ria-Vung Tau province, the total volume of underground water exploitation in this town had accounted for 18,608,430 m 3 /year (mainly from Phu My-My Xuan water station and Toc Tien Water Plant). Groundwater exploitation has been reported to be mainly in the Pleistocene aquifer, which is composed of coarse-grained soil of Cu Chi Formation, u Duc Formation, and Trang Bom Formation with the main minerals: fluorite-apatite, feldspar, gypsum, tourmaline, montmorillonite, ilmenite, and some other impurities.  Figure 2, and the detailed dataset is presented in Table 1.

Data
In this study, the contribution of variables is the same when calculating distance, that is, the proposed method considers the equal importance for each chemical parameter. In case in which some chemical parameters are more important than the others, the proposed method can be performed by using the weighted Euclidean distance instead of using the standard Euclidean distance. Also, note that, in this application, well's location is not considered as a variable, that is, the wells will only be grouped by their chemical parameters. e algorithm thereby will not be too focused on location, but more on chemical properties. Naturally, if wells in the same region have the same chemical properties, they will be assigned to the same cluster. As a result, we have wells sorted by locations. In contrast, through the clustering results, we can still identify wells that are in the same region, but have different chemical properties, or wells that are in different regions, but have similar chemical properties. In such cases, the corresponding explanation will also be provided.

Numerical Example.
In this section, a simple dataset is used in order to illustrate the proposed algorithm in detail. e dataset consists of 20 bivariate points presented in Table 2; the normalized data points are presented in Figure 3.
Using formulas (4), (5), and (6), we found the three initial points v 1 , v 2 , and v 3 of the first cluster, which are represented by red in Figure 4. It can be seen from Figure 4 that the distance between these three points is really small in comparison with the distance between all points; therefore, condition (7) is satisfied and we can use these three points for extending the cluster. Now, we use the points in the processing cluster to build up the cluster itself. For example, in Figure 5, starting from the green point, v 2 , using formula (3), we calculate the adaptive radius and determine the three new ε-neighbors, based on the circle formed. After that, the processing cluster will be extended by adding these three new points, and the point v 2 will no longer be used to extend the cluster in the next steps. Using another point in the processing cluster, for example, the green point in Figure 6, we also calculate the adaptive radius and determine the new ε-neighbors, based on the circle formed.
Repeat the abovementioned process until the processing cluster cannot be extended more, that is, all points in the processing cluster have been used for the extending process and we cannot find any new points linked to them, as shown in Figure 7. Figure 7 completely determines the first cluster; we can repeat the abovementioned process for the remainder of the dataset and obtain the final partition, as shown in Figure 8.  Scientific Programming clusters and results in the spherical clusters, while the DBSCAN is a density-based clustering algorithm that is suitable for clusters of arbitrary shapes. (iv) SU: an automatic clustering algorithm recently presented by [19] for determining the number of clusters, automatically.

Experiments in Benchmark
In this paper, the Adjusted Rand Index, ARI [29,30], is employed to evaluate the performance of the five compared methods. ARI is an external measure that can make the comparison between the partition produced by a clustering algorithm (P) and the actual partition (Q), where "groundtruth" labeling is known. Particularly, given P and Q, the formulation of ARI is defined as follows:  1  42  72  11  41  58  2  44  71  12  41.5  59  3  46  73  13  42.5  59  4  47  72  14  43  60  5  49  71  15  45  61  6  51  71  16  45.5  61  7  52  70  17  47  61  8  54  69  18  48  61  9  55  68  19  49  61  10  57  67  20     Scientific Programming where a is the number of pairs of elements in the same cluster in P and Q, b is the number of pairs of elements in the same cluster in P, but in different clusters in Q, c is the number of pairs of elements in a different cluster in P, but in the same cluster in Q, and d is the number of pairs of elements in a different cluster in both P and Q. e closer the ARI is to 1, the better the clustering result is (it can be seen from formula (12) that when P and Q are the same, b � c � 0 and ARI � 1). Table 3 intuitively presents the clustering results of the five tested algorithms on the four used datasets. Remarks: (i) For the nonspherical clusters, the performance of the DBSCAN is better than that of SU and k-means algorithms. is result is reasonable because DBSCAN can easily group the data points into arbitrary shape clusters, based on the density and the connection rather than the distance between them. ARC2 algorithm, in general, is quite efficient in terms of ARI and outperforms the DBSCAN on two of the three datasets. Meanwhile, the ARC1 achieves the largest ARI values, which indicates the best performance in terms of clustering accuracy.
(ii) For the spherical or Gaussian clusters, most of the methods render good performance, in which ARC1, SU, and DBSCAN are the proper methods. e kmeans algorithm also provides the best result, for k � 3; however, when k is randomly changed and does not satisfy k � 3, this method shows poor performance. Tables 3 and 4 also show that the ARC2 performs better than the k-means; however, it is not good enough for the Gaussian clusters. (iii) In summary, it can be claimed that ARC1 is an effective algorithm. Specifically, the ARC1 can automatically determine the number of clusters and has notably larger ARI values or notably better clustering results for any given dataset.

Application for Underground Water Quality Assessment.
In this section, we cluster the samples of groundwater quality parameters provided by the Department of Natural Resources and Environment of Ba Ria-Vung Tau Province. e study area and data used have been presented in Section 2. e clustering results in Figure 9 showed that the 17 monitoring wells are classified into 4 groups based on the water quality characteristics: (i) Cluster 1: NB3A, QT5B, NB4 (ii) Cluster 2: NB3B, NB1B, NB1A, QT11 (iii) Cluster 3: QT7B, NB2C, VT4B, VT6, QT5A, NB2A, VT4A, VT2B, VT2A (iv) Cluster 4: QT7A A comparison of some parameters among clusters is shown in Figure 10. We have the following remarks: (i) Cluster 4 consists of only 1 well, QT7A, with very high parameter values. is result demonstrates that the water quality in this well is really bad compared to the remaining clusters. In addition, it can be seen from Table 1 and Figure 10(a) that QT7A has more salt ions (Mg 2+ , Na + , K + , Ca 2+ , HCO − 3 , NH + 4 , Cl − , SO 4 2− , and Nitrite) compared to the remaining clusters. According to National Technical Regulation on Groundwater Quality of Vietnam, the permitted standard for Cl − is 250 mg/l and for SO 2− 4 is 400 mg/l. erefore, the Cl − and SO 4 2− values of QT7A exceed the permitted standards 3.78 and 1.3 times, respectively.
is demonstrates that QT7A may be overaffected by saline intrusion because this well is located near the saline boundary. Additionally, it can be seen in Figure 9 that two wells QT7A and QT7B are located in the same region, but they belong to different clusters. Actually, they are both contaminated wells, but they have different depths, representing separate aquifers. As a result, QT7A exhibits a higher level of contamination than QT7B.
(ii) For the three remaining clusters, it can be seen from Figures 9 and 10(b) that Cluster 1 consists of three wells, with high HCO 3 − values. To our knowledge, the two wells, NB3A and QT5B, are located near My Xuan B1 industrial zone, and the well NB4 is located near Toc Tien landfill. As a result, those wells may be contaminated by the waste discharge process of the abovementioned industrial zone and landfill. (iii) Cluster 2 consists of four wells with relatively good quality. In this cluster, most of the parameter values are lower than those of other clusters and are within safe ranges. It can be concluded that the wells of Cluster 2 are not affected by agricultural activities as well as saline intrusion. (iv) Cluster 3 consists of eight wells with higher values of Mg 2+ , Na + , K + , Ca 2+ , Cl − , and SO 4 2− compared to those of Cluster 1 and Cluster 2. Especially, Cl − value exceeds the permitted standard at 2/8 wells.
is indicates a number of wells in Cluster 3, which are located near the coast as well as salinity boundaries, are capable of being affected by salinity intrusion. In addition, as shown in Figure 10(b), in Cluster 3, the average value of NO − 3 is higher than that of Cluster 1 and Cluster 2. is demonstrates   may be seriously affected by organic matter from the residual feed; therefore, the NO − 3 value reaches 7.77 times higher than the permitted standard.

Conclusion
Based on the definition of epsilon radius neighbors, this paper has proposed a new clustering algorithm that can automatically determine the number of clusters and can find clusters with different sizes, shapes, and densities. e radius or extending is adapted to the current-processing cluster and has good generalization ability. e proposed algorithm is tested on benchmark datasets and is then applied to underground water quality assessment in Phu My Town, Ba Ria-Vung Tau province, Vietnam. For the experiments with many datasets, the ARC1 algorithm exhibits a better performance than the other tested algorithms in terms of the Adjusted Rand index. e ARC2 algorithm performs better than the conventional clustering algorithms in the case of nonspherical clusters but worse in the case of spherical clusters. For the underground water quality assessment in Phu My Town, Ba Ria-Vung Tau province, Vietnam, the proposed algorithm indicated that there are four clusters of water quality that represent different source contributions.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.