Full length articleA machine learning approach for characterizing soil contamination in the presence of physical site discontinuities and aggregated samples
Introduction
Rehabilitation of contaminated soils in urban areas is in high demand because of the appreciation of land value associated with the increased urbanization. A common technique to rehabilitate a contaminated site is to remove contaminated soil and either treat or burry it in designated sites. Because there are important costs associated with this activity, it is essential to characterize spatial contaminant concentration in order to classify soil as either contaminated or non-contaminated based on the applicable legislation. Any cubic meter unnecessarily removed (i.e. false ) or any cubic meter wrongly left in place (i.e. false ) will increase the overall rehabilitation costs. Thus, there are financial incentives to minimize soil characterization uncertainties.
In the field of geostatistics, several researchers such as Boudreault et al. [1] and Goovaerts [2], [3] have employed the Kriging theory to characterize the spatial distribution of contaminant concentration. Historically, Kriging was proposed by Krige and later formalized by Matheron [4]. More recently, the research community has turned toward Machine Learning methods [5]. Most researchers in this field have employed Artificial Neural Networks (ANN) [6], [7], [8], [9]. ANN is a powerful tool, however it requires lots of data (up to millions of data points) to perform well [10]. This condition is seldom met in practice. In the field of Machine learning, other techniques analogous to Kriging have recently been the object of numerous publications under the name of Gaussian Process Regression, (GPR) [11]. Authors such as MacKay [12] and Rasmussen & Williams [13] have presented modern techniques to calibrate parameters efficiently, process small and large datasets, and provide enhanced formulations that increase the robustness toward numerical instabilities. These latest developments are implemented in several open-source packages such as GPML (Gaussian Process Machine Learning) [14] and GPStuff [15], both running on the Matlab/Octave language. The motivation for this paper is that current ANN and GPR formulations cannot handle two particular situations that are common during site characterization: (1) experimental conditions where contaminant concentration is quantified from aggregated soil samples and (2) the effect of physical site discontinuities. Note that even if geostatistics methods can handle aggregated soil samples using Block Kriging [16], it cannot handle the effect of physical site discontinuities.
This paper proposes a new unified formulation based on the GPR method to address the two limitations identified above. The paper is organized as follows: Section 2 introduces the standard mathematical formulation of Gaussian Process Regression along with specificities associated with soil characterization applications. Section 3 presents the two extensions to the standard GPR formulation that are proposed in this paper. The first extension account aggregated soil samples by creating virtual points that are employed to model the average contaminant concentration. The second extension proposes a new covariance function that can employ discrete attributes corresponding to physical site discontinuities. The justification for these two new probabilistic formulations comes from a case study where both features are present. In Section 4, an empirical analysis compares the performance of these new extensions with the baseline GPR model.
Section snippets
Gaussian process regression for contamination concentration characterization
This Section summarizes the theory behind Gaussian Process Regression [11], [13]. SubSection 2.1 presents aspects related to the model definition, SubSection 2.2 presents the formulation for estimating the conditional probability of a Gaussian process given observations, and SubSection 2.3 presents the procedure for calibrating hyper-parameters. All subsections are presented in the context of soil contamination concentration characterization.
New extensions to Gaussian process regression
This section describes two new extensions to GPR. The first extension presented in SubSection 3.1 allows modeling observations that are obtained from aggregated soil samples. The second extension is presented in SubSection 3.2 and allows modeling site featuring discrete physical discontinuities.
Empirical validation
This section compares the performance of the standard GPR model with the two improvements proposed. SubSection 4.1 presents the site and dataset employed for performance comparison purpose, SubSection 4.2 presents the method employed for quantifying the predictive capacity of each model configuration, and SubSection 4.3 presents the comparison procedure along with results. All analyses presented in this section have been performed using the GPML package [14] in which we implemented the new
Discussion
The results obtained in Section 4.3 demonstrates that two new extensions to the GPR outperform the standard GPR in the presence of aggregated soil samples and site discontinuities. This conclusion can be reached based on the LOO-CV results presented in Table 3. In Fig. 6, Fig. 7, the reader can appreciate the effects of the formulation proposed on results. Nevertheless, these contaminant iso-surfaces cannot be employed to compare the performance of model formulations since the true contaminant
Conclusion
Two improvements to the standard Gaussian Process Regression are proposed in this paper. The first one can take into account the observation method employing aggregated samples. The second method allows considering physical discontinuities between sub-regions within a site. The two new probabilistic formulations proposed outperformed the standard GPR model. Although the gain in the prediction performance is small, the fundamental hypotheses employed to model site discontinuities and soil
Acknowledgements
The authors would like to thank Yvon Courchesne from WSP group for his help in the project and from Denis Marcotte for his comment on the manuscript preliminary version. The project was funded by the Fonds de recherche du Quebec - Nature et technologies (FRQNT, Project #2017-NC-197235).
References (23)
- et al.
Quantification and minimization of uncertainty by geostatistical simulations during the characterization of contaminated sites: 3-d approach to a multi-element contamination
Geoderma
(2016) Geostatistics in soil science: state-of-the-art and perspectives
Geoderma
(1999)Geostatistical modelling of uncertainty in soil science
Geoderma
(2001)Air quality prediction in milan: feed-forward neural networks, pruned neural networks and lazy learning
Ecol. Modell.
(2005)Multivariable geostatistics in s: the gstat package
Comput. Geosci.
(2004)Le krigeage universel, Tech. rep.
(1969)- et al.
Machine learning for spatial environmental data
Theory Appl. Softw.
(2009) Spatial predictions of soil contamination using general regression neural networks
Syst. Res. Inform. Sci.
(1999)- et al.
Predicting particulate matter (pm 2.5) concentrations in the air of shahr-e ray city, iran, by using an artificial neural network
Environ. Quality Manage.
(2016) - et al.
Comparison of four machine learning methods for predicting pm10 concentrations in Helsinki, Finland
Water Air Soil Pollut.: Focus
(2002)