Using ensemble models to identify and apportion heavy metal pollution sources in agricultural soils on a local scale
Graphical abstract
Introduction
Soil is a large and long-term sink for ubiquitous heavy metals and related compounds. In agricultural soils, the accumulation of heavy metals is a growing public concern because it threatens environmental health; elevated heavy metal uptake by crops may also affect food quality and security (Harmanescu et al., 2011, Wu et al., 2015). An important prerequisite in the control and remediation of heavy metal contaminated soils is determining the source of contamination (Lin et al., 2010, Zhou et al., 2007b). On a local scale, agricultural soils become contaminated by accumulated heavy metals released from multi-phase and diverse natural and anthropogenic sources (Gellrich and Zimmermann, 2007). Heavy metals in agricultural soils primarily originate from the weathering of parent materials but can also be accumulated from industrial emissions, such as mine tailings, disposal of high metal wastes and sewage sludge, and agricultural sources, such as livestock manure, inorganic fertilizers, lime, agrochemicals, irrigation water, atmospheric deposition and pesticides (Hu and Cheng, 2013, Khan et al., 2008, Mohammed et al., 2011). Every decision regarding the application of any measures in soil quality and management must be based on reliable information on the extent and sources of heavy metal pollution in the given area (Zovko and Romic, 2011). Therefore, the identification and apportionment of heavy metal pollution sources in agricultural soils on the local scale is crucial. The high spatial heterogeneity of heavy metals in soils, the complexity and diversity of pollution sources and the lack of long-term monitoring data have challenged researchers to assess multi-source and multi-phase heavy metal pollution in agricultural soils on a local scale; exploring suitable methods to address this challenge is imperative. To this end, models can serve as powerful tools for source identification and apportionment.
There are two competing modeling methods: the traditional approach (build one robust model) and the more recent ensemble learning approach (build many models and average the results). Numerous reports have shown that multivariate analysis and GIS are useful tools for the identification of probable pollution sources and the potential risks of heavy metals (Facchinelli et al., 2001). For example, multivariate analyses that have been applied to exclusively predict soil pollution sources include principle component analysis (Micó et al., 2006, Yongming et al., 2006), clustering analysis (Bhuiyan et al., 2010, Soares et al., 1999) and discriminant analysis (Qishlaqi and Moore, 2007). GIS-based models together with multivariate analysis have also been developed for mapping and evaluating the sources and distributions of heavy metal contaminants, such as those in Fragkos et al., 1998, Zhou et al., 2007a and Facchinelli et al. (2001). Stochastic models, such as conditional inference tree and finite mixture distribution model, have been used to differentiate the effects and contributions of natural background and human activities across large-scale regions (Hu and Cheng, 2013, Lin et al., 2010). These modeling approaches are referred to as “traditional approaches”. Conventional multivariate analysis can help identify the pollution sources and distinguish natural versus anthropogenic contributions based on associations. However, they are sensitive to outliers and the non-normal distributions of geochemical datasets; examining the probability distributions of all variables is essential, and transforming the data consequently changes the original data (Micó et al., 2006). GIS methodologies can help predict the point sources that are responsible for particular areas of contamination. The accuracy of such maps depends fundamentally on the accuracy of the dispersion model. This model includes deductive components for assessing the sources of heavy metals that leads to low prediction accuracy and large uncertainty (Fragkos et al., 1998). The common methods combining multivariate analysis, geo-statistics and GIS can qualitatively predict the potential pollution sources of heavy metals, but are unable to quantitatively apportion the contributions from the different sources. Furthermore, models of the identification and apportionment of heavy metal pollution sources have seldom been established at the local scale. The ensemble models provided in this study are superior in their quantitative assessment of the complex sources of multi-phase heavy metal pollution in agricultural soils on a local scale.
Stochastic gradient boosting (Friedman, 2006) (SGB) is a recent advance in ensemble methods. This technique has emerged as one of the most powerful methods for predictive data mining in recent years (Hastie et al., 2009). SGB produces the greatest increase in model accuracy by the gradient descent of the loss function in iterative tree construction (Friedman, 2001). Even though SGB models are complex, their predictive performance is superior to most traditional models (Friedman, 2006). The application of SGB to the interpretation of complex spatial patterns of ecological and remote sensing data has gained increasing attention in recent years (De'ath, 2007, Lawrence et al., 2004). To date, there have been no published applications of SGB in environmental soil science. SGB was used in the present study for the first time to identify and apportion the multi-source and multi-phase pollution from cadmium (Cd) and lead (Pb) in agricultural soils at the local scale. The interaction effects between predictors were also detected to render reliable variable selection. The ensemble-based random forest (RF) method was adopted as a supplemental tool to assess the diverse sources and their importance. In a random forest, each node is split using the best of a subset of predictors that are randomly chosen at that node. This somewhat counterintuitive strategy performs very well compared to many other data mining techniques, including discriminant analysis, support vector machines and neural networks, and is robust against over-fitting (Breiman, 2001). In addition, it is very user-friendly in the sense that it has only two parameters (the number of variables in the random subset at each node and the number of trees in the forest) (Hothorn et al., 2006). Thus, RF was employed as a robust tool for comparative analysis in this study. Our case study was located in Dongtang Township in the North of Guangdong Province, China, which contains the largest lead and zinc mining and smelting base in Asia (Wang et al., 2012); children living there reportedly had considerably high blood Pb levels (Van Kerckhove, 2012).
Section snippets
Field sampling and chemical analyses
The study region (Fig. 1) is bound by the latitudes of 25°1′7″ N and 25°9′8″ N and the longitudes of 113°32′46″ E and 113°43′46″ E in the Northern Guangdong Province, covering more than 1.92 × 102 km2 of land surface. A total of 250 samples of surface soils (0–20 cm deep) with agricultural use were collected along with corresponding samples of surface water (10–15 cm below the water surface) and atmosphere. The heavy metal concentrations (Cd and Pb) in the soils were measured following the
Descriptive statistics and spatial patterns of heavy metal contents
Descriptive statistics of the heavy metal concentrations in the agricultural soils of Dongtang are described in the Supplementary material. The heavy metals in soils originate from several inputs, including the natural background, mining and smelting activities, atmospheric deposition, agrochemicals, water inflow and socio-economic activities. Because we cannot assess the source contributions through concentration measurements alone, spatial distribution maps of Pb and Cd in air, soil and
Model validation and reliability
In this paper, an ensemble-based method framework estimated the pollution sources of soil heavy metals on the local scale. This framework is referred to as a stochastic gradient boosting method, intended to be a powerful alternative to conventional methods, such as clustering analysis and artificial neural nets. By applying a gradient-descent algorithm, SGB analysis allowed the parameters of the models to vary in function space and established considerably stronger relationships with soil heavy
Conclusions
The sources of heavy metal pollution in soil were quantitatively assessed on the local scale using SGB and RF ensemble models. The models were verified using rigorous cross-validation procedures. The ensemble models produced good results for the multi-source and multi-phase heavy metal pollution in agricultural soils at the local scale. The results of SGB and RF consistently showed that anthropogenic sources contributed the most to the concentrations of Pb and Cd in the agricultural soils of
Acknowledgments
The current work was financially supported by, the National Natural Science Foundation of China (41330857), the Guangdong Province Foundation (CSJ143356) and the “863” Program (2013AA06A209).
References (50)
- et al.
Heavy metal pollution of coal mine-affected agricultural soils in the northern part of Bangladesh
J. Hazard. Mater.
(2010) - et al.
Multivariate statistical and GIS-based approach to identify heavy metal sources in soils
Environ. Pollut.
(2001) - et al.
Investigating the regional-scale pattern of agricultural land abandonment in the Swiss mountains: a spatial statistical modelling approach
Landsc. Urban Plan.
(2007) - et al.
Health risks of heavy metals in contaminated soils and food crops irrigated with wastewater in Beijing, China
Environ. Pollut.
(2008) - et al.
Classification of remotely sensed imagery using stochastic gradient boosting as a refinement of classification tree analysis
Remote Sens. Environ.
(2004) - et al.
Combining a finite mixture distribution model with indicator kriging to delineate and map the spatial patterns of soil heavy metal pollution in Chunghua County, central Taiwan
Environ. Pollut.
(2010) - et al.
Assessing heavy metal sources in agricultural soils of an European Mediterranean area by multivariate analysis
Chemosphere
(2006) - et al.
Predicting tree species presence and basal area in Utah: a comparison of stochastic gradient boosting, generalized additive models, and tree-based methods
Ecol. Model.
(2006) - et al.
Variable selection bias in regression trees with constant fits
Comput. Stat. Data Anal.
(2004) - et al.
Sediments as monitors of heavy metal contamination in the Ave river basin (Portugal): multivariate analysis of data
Environ. Pollut.
(1999)