Using ensemble models to identify and apportion heavy metal pollution sources in agricultural soils on a local scale

doi:10.1016/j.envpol.2015.06.040

Environmental Pollution

Volume 206, November 2015, Pages 227-235

https://doi.org/10.1016/j.envpol.2015.06.040 Get rights and content

Highlights

•
Ensemble models including stochastic gradient boosting and random forest are used.
•
The models were verified by cross-validation and SGB performed better than RF.
•
Heavy metal pollution sources on a local scale are identified and apportioned.
•
Models illustrate good suitability in assessing sources in local-scale agricultural soils.
•
Anthropogenic sources contributed most to soil Pb and Cd pollution in our case.

Abstract

This study aims to identify and apportion multi-source and multi-phase heavy metal pollution from natural and anthropogenic inputs using ensemble models that include stochastic gradient boosting (SGB) and random forest (RF) in agricultural soils on the local scale. The heavy metal pollution sources were quantitatively assessed, and the results illustrated the suitability of the ensemble models for the assessment of multi-source and multi-phase heavy metal pollution in agricultural soils on the local scale. The results of SGB and RF consistently demonstrated that anthropogenic sources contributed the most to the concentrations of Pb and Cd in agricultural soils in the study region and that SGB performed better than RF.

Graphical abstract

Introduction

Soil is a large and long-term sink for ubiquitous heavy metals and related compounds. In agricultural soils, the accumulation of heavy metals is a growing public concern because it threatens environmental health; elevated heavy metal uptake by crops may also affect food quality and security (Harmanescu et al., 2011, Wu et al., 2015). An important prerequisite in the control and remediation of heavy metal contaminated soils is determining the source of contamination (Lin et al., 2010, Zhou et al., 2007b). On a local scale, agricultural soils become contaminated by accumulated heavy metals released from multi-phase and diverse natural and anthropogenic sources (Gellrich and Zimmermann, 2007). Heavy metals in agricultural soils primarily originate from the weathering of parent materials but can also be accumulated from industrial emissions, such as mine tailings, disposal of high metal wastes and sewage sludge, and agricultural sources, such as livestock manure, inorganic fertilizers, lime, agrochemicals, irrigation water, atmospheric deposition and pesticides (Hu and Cheng, 2013, Khan et al., 2008, Mohammed et al., 2011). Every decision regarding the application of any measures in soil quality and management must be based on reliable information on the extent and sources of heavy metal pollution in the given area (Zovko and Romic, 2011). Therefore, the identification and apportionment of heavy metal pollution sources in agricultural soils on the local scale is crucial. The high spatial heterogeneity of heavy metals in soils, the complexity and diversity of pollution sources and the lack of long-term monitoring data have challenged researchers to assess multi-source and multi-phase heavy metal pollution in agricultural soils on a local scale; exploring suitable methods to address this challenge is imperative. To this end, models can serve as powerful tools for source identification and apportionment.

There are two competing modeling methods: the traditional approach (build one robust model) and the more recent ensemble learning approach (build many models and average the results). Numerous reports have shown that multivariate analysis and GIS are useful tools for the identification of probable pollution sources and the potential risks of heavy metals (Facchinelli et al., 2001). For example, multivariate analyses that have been applied to exclusively predict soil pollution sources include principle component analysis (Micó et al., 2006, Yongming et al., 2006), clustering analysis (Bhuiyan et al., 2010, Soares et al., 1999) and discriminant analysis (Qishlaqi and Moore, 2007). GIS-based models together with multivariate analysis have also been developed for mapping and evaluating the sources and distributions of heavy metal contaminants, such as those in Fragkos et al., 1998, Zhou et al., 2007a and Facchinelli et al. (2001). Stochastic models, such as conditional inference tree and finite mixture distribution model, have been used to differentiate the effects and contributions of natural background and human activities across large-scale regions (Hu and Cheng, 2013, Lin et al., 2010). These modeling approaches are referred to as “traditional approaches”. Conventional multivariate analysis can help identify the pollution sources and distinguish natural versus anthropogenic contributions based on associations. However, they are sensitive to outliers and the non-normal distributions of geochemical datasets; examining the probability distributions of all variables is essential, and transforming the data consequently changes the original data (Micó et al., 2006). GIS methodologies can help predict the point sources that are responsible for particular areas of contamination. The accuracy of such maps depends fundamentally on the accuracy of the dispersion model. This model includes deductive components for assessing the sources of heavy metals that leads to low prediction accuracy and large uncertainty (Fragkos et al., 1998). The common methods combining multivariate analysis, geo-statistics and GIS can qualitatively predict the potential pollution sources of heavy metals, but are unable to quantitatively apportion the contributions from the different sources. Furthermore, models of the identification and apportionment of heavy metal pollution sources have seldom been established at the local scale. The ensemble models provided in this study are superior in their quantitative assessment of the complex sources of multi-phase heavy metal pollution in agricultural soils on a local scale.

Stochastic gradient boosting (Friedman, 2006) (SGB) is a recent advance in ensemble methods. This technique has emerged as one of the most powerful methods for predictive data mining in recent years (Hastie et al., 2009). SGB produces the greatest increase in model accuracy by the gradient descent of the loss function in iterative tree construction (Friedman, 2001). Even though SGB models are complex, their predictive performance is superior to most traditional models (Friedman, 2006). The application of SGB to the interpretation of complex spatial patterns of ecological and remote sensing data has gained increasing attention in recent years (De'ath, 2007, Lawrence et al., 2004). To date, there have been no published applications of SGB in environmental soil science. SGB was used in the present study for the first time to identify and apportion the multi-source and multi-phase pollution from cadmium (Cd) and lead (Pb) in agricultural soils at the local scale. The interaction effects between predictors were also detected to render reliable variable selection. The ensemble-based random forest (RF) method was adopted as a supplemental tool to assess the diverse sources and their importance. In a random forest, each node is split using the best of a subset of predictors that are randomly chosen at that node. This somewhat counterintuitive strategy performs very well compared to many other data mining techniques, including discriminant analysis, support vector machines and neural networks, and is robust against over-fitting (Breiman, 2001). In addition, it is very user-friendly in the sense that it has only two parameters (the number of variables in the random subset at each node and the number of trees in the forest) (Hothorn et al., 2006). Thus, RF was employed as a robust tool for comparative analysis in this study. Our case study was located in Dongtang Township in the North of Guangdong Province, China, which contains the largest lead and zinc mining and smelting base in Asia (Wang et al., 2012); children living there reportedly had considerably high blood Pb levels (Van Kerckhove, 2012).

Section snippets

Field sampling and chemical analyses

The study region (Fig. 1) is bound by the latitudes of 25°1′7″ N and 25°9′8″ N and the longitudes of 113°32′46″ E and 113°43′46″ E in the Northern Guangdong Province, covering more than 1.92 × 10² km² of land surface. A total of 250 samples of surface soils (0–20 cm deep) with agricultural use were collected along with corresponding samples of surface water (10–15 cm below the water surface) and atmosphere. The heavy metal concentrations (Cd and Pb) in the soils were measured following the

Descriptive statistics and spatial patterns of heavy metal contents

Descriptive statistics of the heavy metal concentrations in the agricultural soils of Dongtang are described in the Supplementary material. The heavy metals in soils originate from several inputs, including the natural background, mining and smelting activities, atmospheric deposition, agrochemicals, water inflow and socio-economic activities. Because we cannot assess the source contributions through concentration measurements alone, spatial distribution maps of Pb and Cd in air, soil and

Model validation and reliability

In this paper, an ensemble-based method framework estimated the pollution sources of soil heavy metals on the local scale. This framework is referred to as a stochastic gradient boosting method, intended to be a powerful alternative to conventional methods, such as clustering analysis and artificial neural nets. By applying a gradient-descent algorithm, SGB analysis allowed the parameters of the models to vary in function space and established considerably stronger relationships with soil heavy

Conclusions

The sources of heavy metal pollution in soil were quantitatively assessed on the local scale using SGB and RF ensemble models. The models were verified using rigorous cross-validation procedures. The ensemble models produced good results for the multi-source and multi-phase heavy metal pollution in agricultural soils at the local scale. The results of SGB and RF consistently showed that anthropogenic sources contributed the most to the concentrations of Pb and Cd in the agricultural soils of

Acknowledgments

The current work was financially supported by, the National Natural Science Foundation of China (41330857), the Guangdong Province Foundation (CSJ143356) and the “863” Program (2013AA06A209).

References (50)

M.A. Bhuiyan et al.
Heavy metal pollution of coal mine-affected agricultural soils in the northern part of Bangladesh
J. Hazard. Mater.
(2010)
A. Facchinelli et al.
Multivariate statistical and GIS-based approach to identify heavy metal sources in soils
Environ. Pollut.
(2001)
M. Gellrich et al.
Investigating the regional-scale pattern of agricultural land abandonment in the Swiss mountains: a spatial statistical modelling approach
Landsc. Urban Plan.
(2007)
S. Khan et al.
Health risks of heavy metals in contaminated soils and food crops irrigated with wastewater in Beijing, China
Environ. Pollut.
(2008)
R. Lawrence et al.
Classification of remotely sensed imagery using stochastic gradient boosting as a refinement of classification tree analysis
Remote Sens. Environ.
(2004)
Y.P. Lin et al.
Combining a finite mixture distribution model with indicator kriging to delineate and map the spatial patterns of soil heavy metal pollution in Chunghua County, central Taiwan
Environ. Pollut.
(2010)
C. Micó et al.
Assessing heavy metal sources in agricultural soils of an European Mediterranean area by multivariate analysis
Chemosphere
(2006)
G.G. Moisen et al.
Predicting tree species presence and basal area in Utah: a comparison of stochastic gradient boosting, generalized additive models, and tree-based methods
Ecol. Model.
(2006)
Y.S. Shih et al.
Variable selection bias in regression trees with constant fits
Comput. Stat. Data Anal.
(2004)
H. Soares et al.
Sediments as monitors of heavy metal contamination in the Ave river basin (Portugal): multivariate analysis of data
Environ. Pollut.
(1999)

S. Wong et al.

Heavy metals in agricultural soils of the Pearl River Delta, South China

Environ. Pollut.

(2002)

Q. Wu et al.

Heavy metal contamination of soil and water in the vicinity of an abandoned e-waste recycling site: implications for dissemination of heavy metals

Sci. Total Environ.

(2015)

H. Yongming et al.

Multivariate analysis of heavy metal contamination in urban dusts of Xi'an, Central China

Sci. Total Environ.

(2006)

F. Zhou et al.

Spatial distribution of heavy metals in Hong Kong's marine sediments and their human impacts: a GIS-based chemometric approach

Mar. Pollut. Bull.

(2007)

J.-M. Zhou et al.

Soil heavy metal pollution around the dabaoshan mine, Guangdong Province, China

Pedosphere

(2007)

L. Breiman

Using Adaptive Bagging to Debias Regressions

(1999)

L. Breiman

Random forests

Mach. Learn.

(2001)

L. Breiman

Manual on Setting up, Using, and Understanding Random Forests V3.1

(2002)

G. De'ath

Boosted trees for ecological modeling and prediction

Ecology

(2007)

R. Diaz-Uriarte et al.

Gene selection and classification of microarray data using random forest

BMC Bioinforma.

(2006)

C. Donisa et al.

Heavy metal pollution by atmospheric transport in natural soils from the northern part of eastern Carpathians

Water Air Soil Pollut.

(2000)

A.H. Fielding et al.

A review of methods for the assessment of prediction errors in conservation presence/absence models

Environ. Conserv.

(1997)

C. Fragkos et al.

GIS Techniques for Mapping and Evaluating Sources and Distribution of Heavy Metal Contaminants

(1998)

J.H. Friedman

Stochastic gradient boosting

Comput. Stat. Data Anal.

(1999)

J.H. Friedman

Greedy function approximation: a gradient boosting machine

Ann. Stat.

(2001)

Cited by (0)

View full text

Using ensemble models to identify and apportion heavy metal pollution sources in agricultural soils on a local scale

Highlights

Abstract

Graphical abstract

Introduction

Section snippets

Field sampling and chemical analyses

Descriptive statistics and spatial patterns of heavy metal contents

Model validation and reliability

Conclusions

Acknowledgments

J. Hazard. Mater.

Environ. Pollut.

Landsc. Urban Plan.

Environ. Pollut.

Remote Sens. Environ.

Environ. Pollut.

Chemosphere

Ecol. Model.

Comput. Stat. Data Anal.

Environ. Pollut.

Environ. Pollut.

Sci. Total Environ.

Sci. Total Environ.

Mar. Pollut. Bull.

Pedosphere

Using Adaptive Bagging to Debias Regressions

Random forests

Mach. Learn.

Manual on Setting up, Using, and Understanding Random Forests V3.1

Boosted trees for ecological modeling and prediction

Ecology

Gene selection and classification of microarray data using random forest

BMC Bioinforma.

Heavy metal pollution by atmospheric transport in natural soils from the northern part of eastern Carpathians

Water Air Soil Pollut.

A review of methods for the assessment of prediction errors in conservation presence/absence models

Environ. Conserv.

GIS Techniques for Mapping and Evaluating Sources and Distribution of Heavy Metal Contaminants

Stochastic gradient boosting

Comput. Stat. Data Anal.

Greedy function approximation: a gradient boosting machine

Ann. Stat.