Elsevier

Chemical Engineering Journal

Volume 223, 1 May 2013, Pages 747-754
Chemical Engineering Journal

Machine learning models for predicting PAHs bioavailability in compost amended soils

https://doi.org/10.1016/j.cej.2013.02.122Get rights and content

Highlights

  • Machine learning (ML) predicting PAHs bioavailability in compost amended soil.

  • ML links soil/compost properties with bioavailability and risk assessment.

  • ML allows quick estimation of the end-point of long-term bioremediation activities.

  • Complex interactions between soil and compost highlight the superiority of ML model.

Abstract

Compost addition to polluted soils is a strategy for waste reuse and soil remediation, while bioavailability is a key parameter for environmental assessment. Empirical data from an 8-month microcosm experiment were used to assess the ability and performance of six machine learning (ML) models to predict temporal bioavailability changes of 16 polycyclic aromatic hydrocarbons (PAHs) in contaminated soils amended with compost. The models included multilayer perceptrons (MLPs), radial basis function (RBF), support vector regression (SVR), M5 model tree (M5P), M5 rule (M5R) and linear regression (LR). Overall, the performance of the six models, determined by 10-fold cross validation method, was ranked as follows: RBF > M5P > SVR > MLP > M5R > LR. Results further demonstrated that the ML models successfully identified the relative importance of each variable (i.e. incubation time, organic carbon content, soil moisture content, nutrient levels) on the temporal bioavailability change of individual PAH. Such models can potentially be useful for predicting the concentration of a wide range of pollutants in soils, which could contribute to reduce chemical monitoring at site and help decision making for remediation end points and risk assessment.

Introduction

An increasing amount of land in the UK is classified as brownfield, much of which comprises highly disturbed soils with diffuse sources of pollution and highly irregular contaminant distributions. There is a continued interest in the possibility of using brownfield and contaminated sites for amenity and non-food crop such as biomass [1]. However, establishment of vegetation on these sites is poor due to nutrient deficiency, phytotoxicity of contaminants and poor soil physical conditions [2]. Addition of mature composts, which is beneficial to soils in terms of physical properties, nutrient availability and microbial activity, has the potential to promote plant development and restore degraded land into productive use [3]. Yet, the influence of adding large amount of compost to contaminated soils especially on the fate of polycyclic aromatic hydrocarbons (PAHs) remains poorly studied. Adding compost to PAH–contaminated soils can significantly affect the soil matrix–PAH interactions and therefore change the biodegradation and bioavailability of PAHs in the soil [4]. In the UK and increasingly across the world, there is a demand for promoting the development of bioavailability concept in soil assessment [5]. In addition, the remediation end-points are shifting from being based on the total concentration of the chemicals of concern to the concentration likely to pose significant risks, the bioavailable fractions [6].

During the last decades, a number of bioassay protocols have been developed for estimating bioavailable concentration of PAHs. However, the recently developed chemical-based approaches are expected to replace time-consuming biological assays [5]. The hydroxypropyl-β-cyclodextrin (HPCD) extraction, one of the most commonly used chemical approaches, has been successfully applied to predict microbial degradation of a range of PAHs under laboratory [7], [8], [9], [10] and field conditions [11]. Although it has been widely recognised that bioavailability is significantly influenced by soil properties [12], our understanding of the interactions between soil properties and organic contaminants and the prediction of their bioavailability using HPCD extraction is still limited. This study focuses on linking mathematically bioavailable concentration and soil properties with low measuring and computing efforts. To the best of our knowledge, there is still no model capable of forecasting PAH bioavailability spatially and temporally due to the following challenges: (i) lack of substantive bioavailability data for various PAHs in authentic contaminated soils [5]; (ii) conventional models often assume a specific form of mathematical equation and statistical regression is used to determine the unknown parameters in the equation; and to date there is no empirical equation due to the lack of knowledge on the influence of compost amendment on PAH bioavailability [13]; (iii) statistical models become rapidly complex due to the highly non-linear relationship between multiple variables which make difficult to develop a universal form of models that could be employed to capture features of the compost-amended soils.

Alternatively, machine learning (ML) is a data based technique allowing computers to ‘learn’ and ‘recognise’ the patterns of the empirical data [14]. The core of learning is to generalise from its experience via inductive inference, which means to distinguish the given data based on their different patterns, extract from the data something more general and make useful decisions in new cases. The process to achieve pattern recognition is referred to as ‘training’. A training data set is used to train various models by minimising an error function. The performance of each model is then compared by evaluating the error function using an independent validation data set. The model having the smallest error with respect to the validation set is selected. The performance of the selected model is further confirmed by error evaluation on a third independent set of data (test set), after which the model is ready for use to predict the output when presented with new input variables [15].

ML has found its wide spectrum of applications in machine perception, syntactic pattern recognition, DNA sequence classification and bioinformatics [14]. Recently ML, especially artificial neural networks (ANNs) models, have been used to forecast environmental chemical issues such as biosorption of Zn, As and Cu [16], [17], [18], formation and emissions of PAHs during combustion process [19], [20], biochemical oxygen demand in wetlands [21] and wastewater treatment plants [22], and separation of toluene and n-heptane mixtures [23]. However, ANN models have been criticised for their vulnerability to over-fitting (estimating smaller error than the true error).

In contrast, the support vector regression (SVR) model is an excellent generalisation performance method that can perform accurately on new data by providing an unbiased estimate of the generalisation error on the test set [24]. Most recently, it was used to assess chemical biodegradability in the environment with predictive accuracies outperforming the commonly used statistical methods [25]. Besides, M5 Model Trees and M5 Model Rules are algorithms that learn efficiently and can tackle tasks with high dimensionality expressed by simple rules that humans can understand [26], [27].

One major concern for environmental scientists about these methods is that they have been labelled as ‘black box’ approaches because they provide little explanatory insight into the relative influences of the independent variables in the prediction process. Generally, researchers have reported the high performance of ML techniques but not commented on the underlying relationship between the input variables and the modelled output [28], [29]. In recent studies, approaches were developed to interpret the contribution of variables to MLP [30] and SVR [28], [29] models, which were also used in this study, providing a transparent approach for the environmental community.

The objectives of this study were to (i) link the bioavailable concentration of 16 PAHs with fundamental soil physicochemical properties and (ii) determine whether ML models could assist in predicting PAH bioavailability in compost-amended soils and potentially help decision making when dealing with contaminated land risk assessment and remediation end points.

Section snippets

Experiments

Empirical data from an 8-month microcosm experiment were used to assess the performance of six ML models to predict temporal changes of PAH bioavailability. Details of the measurement of soil and compost properties (Table 1), compost addition and incubation process, and the determination of PAH concentrations are available in our previous study [4].

Briefly, three soils (one diesel spiked soil and two soils genuinely contaminated by coal tar and coal ash, respectively) were amended with two

Overall performance of each model

An overview of the independent variables used for each ML model is presented in Table 1. The initial total PAH concentration was characterised as the input parameter with the greatest coefficient of variation followed by the proportion of silt in the soil texture, time, soil organic carbon content (OrgC) and nutrient levels. Using the optimal parameters (Table 2), the estimated RMSE for each model is presented in Table 3. Results indicated that the performance of the ML models was between 28%

Conclusions

As ML is an assumption-free data based approach, it distinguishes the data pattern without relying on prior knowledge on bioavailability, which makes its prediction more realistic than conventional statistical methods. Under our experimental conditions, the overall performance ranking of the ML models was as follows: RBF > M5P > SVR > MLP > M5R > LR. However M5P model was recommended for future application due to its simple and understandable rules. The superiority of ML models to capture the highly

Acknowledgements

This research was supported by National Hi-Technology Research & Development Program of China (2009AA063102), Program for Changjiang Scholars and Innovative Research Team in University (IRT0936), and Municipal Natural Science Foundation of Tianjin (No. 12JCQNJC05300).

References (48)

  • A. Duran et al.

    Simulation of atmospheric PAH emissions from diesel engines

    Chemosphere

    (2001)
  • F. Inal

    Artificial neural network predictions of polycyclic aromatic hydrocarbon formation in premixed n-heptane flames

    Fuel Process. Technol.

    (2006)
  • C.S. Akratos et al.

    An artificial neural network model and design equations for BOD and COD removal prediction in horizontal subsurface flow constructed wetlands

    Chem. Eng. J.

    (2008)
  • F.S. Mjalli et al.

    Use of artificial neural network black-box modeling for the prediction of wastewater treatment plants performance

    J. Environ. Manage.

    (2007)
  • F. Farshad et al.

    Separation of toluene/n-heptane mixtures experimental, modeling and optimization

    Chem. Eng. J.

    (2011)
  • S. Wu et al.

    Support vector regression for warranty claim forecasting

    Eur. J. Oper. Res.

    (2011)
  • B. Bhattacharya et al.

    Neural networks and M5 model trees in modelling water level–discharge relationship

    Neurocomputing

    (2005)
  • A. Etemad-Shahidi et al.

    Comparison between M5’model tree and neural networks for prediction of significant wave height in Lake Superior

    Ocean. Eng.

    (2009)
  • B. Ustun et al.

    Visualisation and interpretation of support vector regression models

    Anal. Chim. Acta

    (2007)
  • G. Postma et al.

    Opening the kernel of kernel partial least squares and support vector machines

    Anal. Chim. Acta

    (2011)
  • J.D. Olden et al.

    Illuminating the “black box”: a randomization approach for understanding variable contributions in artificial neural networks

    Ecol. Model.

    (2002)
  • D. Fletcher et al.

    Forecasting with neural networks. An application using bankruptcy data

    Inform. Manage-amster

    (1993)
  • N.R. Couling et al.

    Biodegradation of PAHs in soil: influence of chemical structure, concentration and multiple amendment

    Environ. Pollut.

    (2010)
  • N. Amellal et al.

    Effect of soil structure on the bioavailability of polycyclic aromatic hydrocarbons within aggregates of a contaminated soil

    Appl. Geochem.

    (2001)
  • Cited by (66)

    View all citing articles on Scopus
    View full text