Elsevier

Journal of Hydrology

Volume 519, Part A, 27 November 2014, Pages 626-636
Journal of Hydrology

Model-based clustering of hydrochemical data to demarcate natural versus human impacts on bedrock groundwater quality in rural areas, South Korea

https://doi.org/10.1016/j.jhydrol.2014.07.055Get rights and content

Highlights

  • This study aims to demarcate natural versus anthropogenic impacts on groundwater.

  • Clustering with a Gaussian mixture model was conducted for hydrochemical data.

  • Better selection of optimal variables and cluster numbers is crucial.

  • Bivariate normal mixture model using nitrate and fluoride was very robust.

  • This method can be used to evaluate contamination susceptibility and background levels.

Summary

Improved evaluation of anthropogenic contamination is required to sustainably manage groundwater resources. In this study, we investigated the hydrochemical measurements of 18 parameters from a total of 102 bedrock groundwater samples from two representative rural areas in South Korea. We used model-based clustering with a normal (Gaussian) mixture model to differentiate the contributions of natural versus anthropogenic processes to the observed groundwater quality. Water samples varied in hydrochemistry from a Ca–Na–HCO3 type to a Ca–HCO3–Cl type. The former type reflected derivation of major ions largely from water–rock interactions, while the latter type recorded varying degrees of anthropogenic contamination. Among the major dissolved ions, fluoride and nitrate were shown to be good indicators of the two types, respectively. The results of model-based clustering showed that the bivariate normal mixture model, which was based on the covariance of nitrate and fluoride, was more robust than multivariate analysis, and provided better discrimination between the anthropogenic and natural groundwater groups. Model-based clustering to measure the degree of cluster membership for each sample also showed a gradual change in groundwater chemistry due to mixing between the two water groups. This study provided an example of the successful application of model-based clustering to evaluate regional groundwater quality and demonstrated that better selection of the dimensional structure (i.e., selection of optimal variables and number of clusters) based on hydrochemistry was crucial in obtaining reasonable clustering results.

Introduction

Groundwater chemistry is inherently controlled by water–rock interactions in the aquifer (Stumm and Morgan, 1996, Langmuir, 1997, Appelo and Postma, 1999). However, the natural quality of groundwater has progressively deteriorated in many countries due to diverse human impacts (Foster and Chilton, 2003, Morris et al., 2003). In South Korea, groundwater in fractured bedrock (gneisses and granitoids which occupy ca. 70% of the Korean Peninsula) forms the most important source of water supply, especially to rural communities where surface water reservoirs are generally absent. The deterioration in the quality of bedrock groundwater has partly resulted from a lack of effective management schemes, and thus the Korean government has recently made great efforts to sustainably manage bedrock aquifers (e.g., ‘Master Plan for Managing Groundwater Resources’; MOCT and KOWACO, 2002). To properly manage available water resources, there is a need for a quantitative understanding of water quality deterioration from diffusive or non-point pollution sources, as well as the background chemistry associated with natural geochemical processes.

Through careful examination of the spatial variations in hydrochemical data, it is possible to estimate the relative contribution of natural versus anthropogenic effects on groundwater quality. Especially for large and/or complicated datasets, multivariate statistics (e.g., clustering) have been widely used for these determinations (Suk and Lee, 1999, Güler et al., 2002, Farnham et al., 2003, Güler and Thyne, 2004, Cloutier et al., 2008, Yidana, 2010). Diverse clustering methods such as fuzzy c-means (FCM) clustering, hierarchical clustering, and k-means clustering have been used to separate groundwater samples into homogeneous groups that reflect the different source contributions to groundwater chemistry.

For successful application of clustering techniques, selection of suitable algorithms for the hydrochemical dataset is important (Templ et al., 2008). In hierarchical clustering and partitioning (or iterative relocation) methods, distance measurements (e.g., Euclidean distance) between observations are used. However, these methods each present their own challenges in determining the number of clusters and initial cluster centers. As an alternative technique, model-based clustering was recently adopted and can provide a principled statistical assessment of the practical problems that arise in applying clustering methods (McLachlan and Peel, 2000, Fraley and Raftery, 2002). The model-based clustering algorithms are not based on the distance measure, but use a probability mixture model to define each cluster as a subpopulation in the dataset, with the aim of optimizing the fit between the model and dataset. The implicit assumption of this method is that each cluster is represented by a parametric probability density, and the entire cluster structure can be modeled by a finite mixture.

Since the model-based clustering is based on a probability model, clustering results can be largely affected by the dimension (number of variables) of the input dataset. This suggests that the clustering structure of interest may be contained in a subset of the available variables and that some variables may be useless or even have a disadvantage in detecting a reasonable clustering result (Law et al., 2004, Tadesse et al., 2005, Raftery and Dean, 2006, Joo et al., 2009, Maugis et al., 2009). Thus, it is very important to assess the relevant variables. The model-based clustering can also provide a rigorous framework to assess the role of each variable in the clustering process (McLachlan and Peel, 2000). Therefore, model-based clustering has been shown to be a powerful tool for classification in many research fields (McLachlan and Basford, 1988, Banfield and Raftery, 1993). However, the practical applications have rarely been reported in hydrochemical or geochemical studies (Templ et al., 2008).

This study was based on a hydrochemical investigation of bedrock aquifers in representative rural areas of South Korea and applied model-based clustering with a normal (Gaussian) mixture model to separate the contributions of natural and anthropogenic processes on observed groundwater quality. To obtain reliable clustering performance, detailed hydrochemical knowledge of the aquifer was incorporated into the clustering process, and the clustering results were validated using statistical criteria. This case study demonstrated the significant influence of the dimensional structure and selection of optimal variables on the clustering results and also suggested the important role of hydrochemical interpretation in better predicting the cluster structure in advance of clustering analysis.

Section snippets

Study areas

Hydrochemical surveys were conducted in selected rural areas on the outskirts of Boeun and Naju cities, South Korea (Fig. 1a) where bedrock groundwater forms an important source of the domestic and agricultural water supply. In both cities, about 70% of total households receive public main water supply, while in suburban areas with active agricultural activities, water use depends on groundwater from bedrock aquifers. The total populations and population densities of Boeun and Naju cities are

General hydrochemical characteristics

The descriptive statistics of 18 physicochemical variables of 102 bedrock groundwater samples are summarized in Table 1. The measured pH conditions were mostly neutral; the mean pH was 7.0 with a small standard variation (SD = 0.99). Total dissolved solid (TDS) values were generally low to moderate (74.0–648.3 mg/L, with a mean of 236.2 mg/L and SD of 99.2 mg/L). The TDS concentrations showed a linear relationship with the observed electrical conductivity (EC) (slope = 0.7, R2 = 0.9), which ranged from

Summary and conclusions

In this study, we examined the use of diverse clustering methods to better evaluate intermixed sources and processes affecting observed groundwater chemistry, which can be a crucial basis for water management. In particular, the distinction of waters influenced by natural water–rock interactions versus anthropogenic impacts was attempted using model-based clustering of a hydrochemical dataset of bedrock groundwater (n = 102) from rural areas of South Korea. Among 28 geometrical forms available in

Acknowledgments

This work was supported by a research grant from the National Institute of Environmental Research (NIER Contract No. 20120539270-00). Partial support was also provided by the National Research Foundation of Korea Grant (University-Institute cooperation program) funded by the Korean Government (Ministry of Science, ICT & Future Planning). The first author (KH Kim) thanks a financial support by a 2013 Korea University Grant. Prof. Rodney Grapes helped to improve an early version of this

References (66)

  • J. Griffioen

    Potassium adsorption ratios as an indicator for the fate of agricultural potassium in groundwater

    J. Hydrol.

    (2001)
  • K.H. Kim et al.

    Hydrochemical and multivariate statistical interpretations of spatial controls of nitrate concentrations in a shallow alluvial aquifer around oxbow lakes (Osong area, central Korea)

    J. Contam. Hydrol.

    (2009)
  • S.H. Kim et al.

    Co-contamination of arsenic and fluoride in the groundwater of unconsolidated aquifers under reducing environments

    Chemosphere

    (2012)
  • C. Maugis et al.

    Variable selection in model-based clustering: a general variable role modeling

    Comput. Stat. Data Anal.

    (2009)
  • D.K. Nordstrom et al.

    Groundwater chemistry and water–rock interactions at Stripa

    Geochim. Cosmochim. Acta

    (1989)
  • A.J. Sinclair

    Selection of threshold values in geochemical data using probability graphs

    J. Geochem. Explor.

    (1974)
  • K.Y. Sung et al.

    Reaction path modeling of hydrogeochemical evolution of groundwater in granitic bedrocks, South Korea

    J. Geochem. Explor.

    (2012)
  • M. Templ et al.

    Cluster analysis applied to regional geochemical data: problems and possibilities

    Appl. Geochem.

    (2008)
  • C.H. van der Weijden et al.

    Hydrogeochemistry in the Vouga River basin (central Portugal): pollution and chemical weathering

    Appl. Geochem.

    (2006)
  • S.M. Yidana

    Groundwater classification using multivariate statistical methods: Southern Ghana

    J. Afr. Earth Sci.

    (2010)
  • APHA, AWWA, WEF

    Standard Methods for the Examination of Water and Wastewater

    (2001)
  • C.A.J. Appelo et al.

    Geochemistry, Groundwater and Pollution

    (1999)
  • J.D. Banfield et al.

    Model based Gaussian and non Gaussian clustering

    Biometrics

    (1993)
  • C. Biernacki et al.

    Assessing a mixture model for clustering with the integrated completed likelihood

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2000)
  • C.J. Bowser et al.

    Mineralogic controls on the composition of natural waters dominated by silicate hydrolysis

    Am. J. Sci.

    (2002)
  • L.W. Canter

    Nitrates in Groundwater

    (1997)
  • Cao, Y., 2010. Bivariant Kernel Density Estimation (V2.0). Matlab File Exchange....
  • G.T. Chae et al.

    Hydrogeochemistry of seepage water collected within the Youngcheon diversion tunnel, Korea: source and evolution of SO4-rich groundwater in sedimentary terrain

    Hydrol. Process.

    (2001)
  • G.T. Chae et al.

    Batch dissolution of granite and biotite in water: implication for fluorine geochemistry in groundwater

    Geochem. J.

    (2006)
  • W.S. Cho et al.

    Petrography and mineral chemistry of the granitic rocks in the Poeun Sogrisan area

    Korea. J. Petrol. Soc. Korea

    (1994)
  • J.C. Davis

    Statistics and Data Analysis in Geology

    (1986)
  • A.P. Dempster et al.

    Maximum likelihood from incomplete data via the EM algorithm

    J. Roy. Stat. Soc. Ser. B (Methodol.)

    (1977)
  • B. Everitt et al.

    Applied Multivariate Data Analysis

    (2001)
  • Cited by (40)

    • Hyperspectral retrievals of suspended sediment using cluster-based machine learning regression in shallow waters

      2022, Science of the Total Environment
      Citation Excerpt :

      Box plots of the physical factors according to cluster type are shown in Fig. 14 (c); all physical factors except for the bottom type slightly differed between the two clusters. Additionally, we statistically investigated this difference in physical factors to cluster types using the Mann–Whitney U test (Helsel, 1987), which is a nonparametric test of a null hypothesis based on a ranked sum, verifying the statistical difference of the significant variables in a clustered dataset (Kim et al., 2014; Rosner and Grove, 1999). The mean, standard deviation, and U test p-values of all physical factors in each cluster are listed in Table 7.

    • Evaluation of natural background levels of high mountain karst aquifers in complex hydrogeological settings. A Gaussian mixture model approach in the Port del Comte (SE, Pyrenees) case study

      2021, Science of the Total Environment
      Citation Excerpt :

      The probabilistic GMM framework estimates the optimal number of clusters and provides for every spring the probability of belonging to these clusters (soft assignment). This approach is more interesting that the classical clustering approaches, in which the number of clusters is assumed fixed, and every spring is assigned to one and only one of the previously assumed clusters (hard assignment) (Kim et al., 2014). From an hydrochemical point of view, the soft assignment often provides the more interesting interpretation because the method reveals if one observation is influenced by several factors (Templ et al., 2008).

    • Streamflow forecasting using extreme gradient boosting model coupled with Gaussian mixture model

      2020, Journal of Hydrology
      Citation Excerpt :

      GMM assumes that data set follow a mixture model of probability distributions so that each cluster is represented by a parametric probability density and the entire cluster structure can be modeled by a finite mixture. GMM has been shown to be a powerful tool for clustering in many research fields (Kim et al., 2014; Niknejad et al., 2015; Qiu et al., 2019; Yang et al., 2012). However, the practical applications have rarely been reported in hydrological studies.

    View all citing articles on Scopus
    View full text