Model-based clustering of hydrochemical data to demarcate natural versus human impacts on bedrock groundwater quality in rural areas, South Korea
Introduction
Groundwater chemistry is inherently controlled by water–rock interactions in the aquifer (Stumm and Morgan, 1996, Langmuir, 1997, Appelo and Postma, 1999). However, the natural quality of groundwater has progressively deteriorated in many countries due to diverse human impacts (Foster and Chilton, 2003, Morris et al., 2003). In South Korea, groundwater in fractured bedrock (gneisses and granitoids which occupy ca. 70% of the Korean Peninsula) forms the most important source of water supply, especially to rural communities where surface water reservoirs are generally absent. The deterioration in the quality of bedrock groundwater has partly resulted from a lack of effective management schemes, and thus the Korean government has recently made great efforts to sustainably manage bedrock aquifers (e.g., ‘Master Plan for Managing Groundwater Resources’; MOCT and KOWACO, 2002). To properly manage available water resources, there is a need for a quantitative understanding of water quality deterioration from diffusive or non-point pollution sources, as well as the background chemistry associated with natural geochemical processes.
Through careful examination of the spatial variations in hydrochemical data, it is possible to estimate the relative contribution of natural versus anthropogenic effects on groundwater quality. Especially for large and/or complicated datasets, multivariate statistics (e.g., clustering) have been widely used for these determinations (Suk and Lee, 1999, Güler et al., 2002, Farnham et al., 2003, Güler and Thyne, 2004, Cloutier et al., 2008, Yidana, 2010). Diverse clustering methods such as fuzzy c-means (FCM) clustering, hierarchical clustering, and k-means clustering have been used to separate groundwater samples into homogeneous groups that reflect the different source contributions to groundwater chemistry.
For successful application of clustering techniques, selection of suitable algorithms for the hydrochemical dataset is important (Templ et al., 2008). In hierarchical clustering and partitioning (or iterative relocation) methods, distance measurements (e.g., Euclidean distance) between observations are used. However, these methods each present their own challenges in determining the number of clusters and initial cluster centers. As an alternative technique, model-based clustering was recently adopted and can provide a principled statistical assessment of the practical problems that arise in applying clustering methods (McLachlan and Peel, 2000, Fraley and Raftery, 2002). The model-based clustering algorithms are not based on the distance measure, but use a probability mixture model to define each cluster as a subpopulation in the dataset, with the aim of optimizing the fit between the model and dataset. The implicit assumption of this method is that each cluster is represented by a parametric probability density, and the entire cluster structure can be modeled by a finite mixture.
Since the model-based clustering is based on a probability model, clustering results can be largely affected by the dimension (number of variables) of the input dataset. This suggests that the clustering structure of interest may be contained in a subset of the available variables and that some variables may be useless or even have a disadvantage in detecting a reasonable clustering result (Law et al., 2004, Tadesse et al., 2005, Raftery and Dean, 2006, Joo et al., 2009, Maugis et al., 2009). Thus, it is very important to assess the relevant variables. The model-based clustering can also provide a rigorous framework to assess the role of each variable in the clustering process (McLachlan and Peel, 2000). Therefore, model-based clustering has been shown to be a powerful tool for classification in many research fields (McLachlan and Basford, 1988, Banfield and Raftery, 1993). However, the practical applications have rarely been reported in hydrochemical or geochemical studies (Templ et al., 2008).
This study was based on a hydrochemical investigation of bedrock aquifers in representative rural areas of South Korea and applied model-based clustering with a normal (Gaussian) mixture model to separate the contributions of natural and anthropogenic processes on observed groundwater quality. To obtain reliable clustering performance, detailed hydrochemical knowledge of the aquifer was incorporated into the clustering process, and the clustering results were validated using statistical criteria. This case study demonstrated the significant influence of the dimensional structure and selection of optimal variables on the clustering results and also suggested the important role of hydrochemical interpretation in better predicting the cluster structure in advance of clustering analysis.
Section snippets
Study areas
Hydrochemical surveys were conducted in selected rural areas on the outskirts of Boeun and Naju cities, South Korea (Fig. 1a) where bedrock groundwater forms an important source of the domestic and agricultural water supply. In both cities, about 70% of total households receive public main water supply, while in suburban areas with active agricultural activities, water use depends on groundwater from bedrock aquifers. The total populations and population densities of Boeun and Naju cities are
General hydrochemical characteristics
The descriptive statistics of 18 physicochemical variables of 102 bedrock groundwater samples are summarized in Table 1. The measured pH conditions were mostly neutral; the mean pH was 7.0 with a small standard variation (SD = 0.99). Total dissolved solid (TDS) values were generally low to moderate (74.0–648.3 mg/L, with a mean of 236.2 mg/L and SD of 99.2 mg/L). The TDS concentrations showed a linear relationship with the observed electrical conductivity (EC) (slope = 0.7, R2 = 0.9), which ranged from
Summary and conclusions
In this study, we examined the use of diverse clustering methods to better evaluate intermixed sources and processes affecting observed groundwater chemistry, which can be a crucial basis for water management. In particular, the distinction of waters influenced by natural water–rock interactions versus anthropogenic impacts was attempted using model-based clustering of a hydrochemical dataset of bedrock groundwater (n = 102) from rural areas of South Korea. Among 28 geometrical forms available in
Acknowledgments
This work was supported by a research grant from the National Institute of Environmental Research (NIER Contract No. 20120539270-00). Partial support was also provided by the National Research Foundation of Korea Grant (University-Institute cooperation program) funded by the Korean Government (Ministry of Science, ICT & Future Planning). The first author (KH Kim) thanks a financial support by a 2013 Korea University Grant. Prof. Rodney Grapes helped to improve an early version of this
References (66)
- et al.
The chemistry of Norwegian groundwaters: I. The distribution of radon, major and minor elements in 1604 crystalline bedrock groundwaters
Sci. Total Environ.
(1998) - et al.
Model-based cluster and discriminant analysis with the MIXMOD software
Comput. Stat. Data Anal.
(2006) - et al.
Gaussian parsimonious clustering models
Pattern Recog.
(1995) - et al.
Hydrogeochemistry of sodium-bicarbonate type bedrock groundwater in the Pocheon spa area, South Korea: water-rock interaction and hydrologic mixing
J. Hydrol.
(2006) - et al.
Fluorine geochemistry in bedrock groundwater of South Korea
Sci. Total Environ.
(2007) - et al.
Hydrochemical and stable isotopic assessment of nitrate contamination in an alluvial aquifer underneath a riverside agricultural field
Agric. Water Manage.
(2009) - et al.
Sources and biogeochemical behavior of nitrate and sulfate in an alluvial aquifer: hydrochemical and stable isotope approaches
Appl. Geochem.
(2011) - et al.
Hydrogeochemical interpretation of South Korean groundwater monitoring data using self-organizing maps
J. Geochem. Explor.
(2014) - et al.
Multivariate statistical analysis of geochemical data as indicative of the hydrogeochemical evolution of groundwater in a sedimentary rock aquifer system
J. Hydrol.
(2008) - et al.
Factor analytical approaches for evaluating groundwater trace element chemistry data
Anal. Chim. Acta
(2003)
Potassium adsorption ratios as an indicator for the fate of agricultural potassium in groundwater
J. Hydrol.
Hydrochemical and multivariate statistical interpretations of spatial controls of nitrate concentrations in a shallow alluvial aquifer around oxbow lakes (Osong area, central Korea)
J. Contam. Hydrol.
Co-contamination of arsenic and fluoride in the groundwater of unconsolidated aquifers under reducing environments
Chemosphere
Variable selection in model-based clustering: a general variable role modeling
Comput. Stat. Data Anal.
Groundwater chemistry and water–rock interactions at Stripa
Geochim. Cosmochim. Acta
Selection of threshold values in geochemical data using probability graphs
J. Geochem. Explor.
Reaction path modeling of hydrogeochemical evolution of groundwater in granitic bedrocks, South Korea
J. Geochem. Explor.
Cluster analysis applied to regional geochemical data: problems and possibilities
Appl. Geochem.
Hydrogeochemistry in the Vouga River basin (central Portugal): pollution and chemical weathering
Appl. Geochem.
Groundwater classification using multivariate statistical methods: Southern Ghana
J. Afr. Earth Sci.
Standard Methods for the Examination of Water and Wastewater
Geochemistry, Groundwater and Pollution
Model based Gaussian and non Gaussian clustering
Biometrics
Assessing a mixture model for clustering with the integrated completed likelihood
IEEE Trans. Pattern Anal. Mach. Intell.
Mineralogic controls on the composition of natural waters dominated by silicate hydrolysis
Am. J. Sci.
Nitrates in Groundwater
Hydrogeochemistry of seepage water collected within the Youngcheon diversion tunnel, Korea: source and evolution of SO4-rich groundwater in sedimentary terrain
Hydrol. Process.
Batch dissolution of granite and biotite in water: implication for fluorine geochemistry in groundwater
Geochem. J.
Petrography and mineral chemistry of the granitic rocks in the Poeun Sogrisan area
Korea. J. Petrol. Soc. Korea
Statistics and Data Analysis in Geology
Maximum likelihood from incomplete data via the EM algorithm
J. Roy. Stat. Soc. Ser. B (Methodol.)
Applied Multivariate Data Analysis
Cited by (40)
Hyperspectral retrievals of suspended sediment using cluster-based machine learning regression in shallow waters
2022, Science of the Total EnvironmentCitation Excerpt :Box plots of the physical factors according to cluster type are shown in Fig. 14 (c); all physical factors except for the bottom type slightly differed between the two clusters. Additionally, we statistically investigated this difference in physical factors to cluster types using the Mann–Whitney U test (Helsel, 1987), which is a nonparametric test of a null hypothesis based on a ranked sum, verifying the statistical difference of the significant variables in a clustered dataset (Kim et al., 2014; Rosner and Grove, 1999). The mean, standard deviation, and U test p-values of all physical factors in each cluster are listed in Table 7.
Evaluation of natural background levels of high mountain karst aquifers in complex hydrogeological settings. A Gaussian mixture model approach in the Port del Comte (SE, Pyrenees) case study
2021, Science of the Total EnvironmentCitation Excerpt :The probabilistic GMM framework estimates the optimal number of clusters and provides for every spring the probability of belonging to these clusters (soft assignment). This approach is more interesting that the classical clustering approaches, in which the number of clusters is assumed fixed, and every spring is assigned to one and only one of the previously assumed clusters (hard assignment) (Kim et al., 2014). From an hydrochemical point of view, the soft assignment often provides the more interesting interpretation because the method reveals if one observation is influenced by several factors (Templ et al., 2008).
Streamflow forecasting using extreme gradient boosting model coupled with Gaussian mixture model
2020, Journal of HydrologyCitation Excerpt :GMM assumes that data set follow a mixture model of probability distributions so that each cluster is represented by a parametric probability density and the entire cluster structure can be modeled by a finite mixture. GMM has been shown to be a powerful tool for clustering in many research fields (Kim et al., 2014; Niknejad et al., 2015; Qiu et al., 2019; Yang et al., 2012). However, the practical applications have rarely been reported in hydrological studies.