Prediction of soil cadmium distribution across a typical area of Chengdu Plain, China

A suitable method and appropriate environmental variables are important for accurately predicting heavy metal distribution in soils. However, the classical methods (e.g., ordinary kriging (OK)) have a smoothing effect that results in a tendency to neglect local variability, and the commonly used environmental variables (e.g., terrain factors) are ineffective for improving predictions across plains. Here, variables were derived from the obvious factors affecting soil cadmium (Cd), such as road traffic, and were used as auxiliary variables for a combined method (HASM_RBFNN) that was developed using high accuracy surface modelling (HASM) and radial basis function neural network (RBFNN) model. This combined method was then used to predict soil Cd distribution in a typical area of Chengdu Plain in China, considering the spatial non-stationarity of the relationships between soil Cd and the derived variables based on 339 surface soil samples. The results showed that HASM_RBFNN had lower prediction errors than OK, regression kriging (RK) and HASM_RBFNNs, which didn’t consider the spatial non-stationarity of the soil Cd-derived variables relationships. Furthermore, HASM_RBFNN provided improved detail on local variations. The better performance suggested that the derived environmental variables were effective and HASM_RBFNN was appropriate for improving the prediction of soil Cd distribution across plains.

geogenic sources, can have substantial impacts on soil heavy metal content and can lead to high local variability in soil heavy metal distributions 9,11 . Due to the limitations of the classical methods, a more effective spatial distribution modelling method is needed for predicting soil heavy metal distributions across plains.
In recent studies, a methodological framework that predicts the spatial distribution of soil properties based on both the environmental correlation between a soil variable and environmental parameters and the spatial autocorrelation in the residuals of the soil variable has been proven to be effective for obtaining more accurate spatial information on soil properties and has received increasing attention 5,6,[12][13][14][15][16][17][18] . With this framework, a combined method is developed based on the premise that the deterministic component of the targeted soil variable caused by correlated environmental factors can be explained by a regression model while the spatially varying but dependent component can be described by the prediction residuals of the linear regression model and captured by the classical methods such as ordinary kriging (OK) 6,13,[15][16][17][18][19] . For instance, regression kriging (RK), which has been widely employed in many studies 6,15,16,19 , is the typical combined method that uses multiple linear regression model (MLR) to capture the relationships between soil and the environmental factors and uses OK to interpolate regression residuals to prediction grids.
However, the commonly used factors, such as terrain factors, land use and soil type, cannot effectively reproduce the spatial variability of the soil heavy metals in plains because of the gently undulating terrain, the relatively homogeneous parent material and soil type, and other anthropogenic factors such as traffic road 9,11 and different rotation systems of farmland 11,20,21 that have great impacts on soil heavy metals. Therefore, new environmental covariates rather than the above commonly used environmental factors should be employed as auxiliary variables in predicting soil heavy metal distributions across plains. Moreover, the relationships between soils and environmental factors are often nonlinear and spatial non-stationary, suggesting that the nonlinear relationships vary across space 6,12 ; thus, a single linear regression model is unlikely to effectively capture such complex relationships for all subareas in a regional study 14 . In addition, although OK is the most commonly used classical method in soil science and can provide the best linear unbiased estimates, this method is based on an assumption (intrinsic stationarity) that may not be met in practice 22 . Recent studies have found that the radial basis function neural network (RBFNN) approach can perform better than MLR due to its capacity to capture the complex relationships between soils and the environmental factors 12,15 , and a new approach, called high accuracy surface modelling (HASM), developed on the basis of a fundamental theorem of surfaces by Yue et al. [23][24][25][26] , can outperform the three classical methods for predicting soil properties 16,22 . Both approaches provide new tools for predicting soil heavy metal distributions across plains based on the methodological framework of the combined methods.
Cadmium (Cd) is an extremely important pollutant among the various heavy metal elements because of its high transfer rate from soil to plants and strong bio-toxicity 27,28 . Soil Cd has become an important environmental pollutant around the world 2, 4, 28-30 and was also found to be a serious pollutant on the Chengdu Plain of China 11,31,32 . Previous studies have shown that geological origin, road distribution and crop rotation systems had great influences on soil Cd in the farmland of the Chengdu Plain 11,32 . This study aimed to develop a method to predict soil Cd distribution in a central area of the Chengdu Plain. The specific objectives were (1) to derive new environmental variables from the factors noted above; (2) to develop a combined method (HASM_RBFNN) using HASM and RBFNN to predict the spatial distribution of soil Cd that considers the nonlinearity and the spatial non-stationarity of the relationships between soil Cd and the derived environmental covariates; and (3) to evaluate its performance compared with that of the OK, RK and HASM_RBFNN s which did not consider the spatial non-stationarity of the relationships.

Results
Correlation between soil Cd and the environmental covariates. The relationships between soil Cd and the environmental factors are shown in Fig. 1a-f. Soil Cd content was negatively correlated with the distance to the Minjiang River; soil Cd content declined significantly with increasing distance to the Minjiang River within 10 km of the Minjiang River (Fig. 1a). Moderate-Resolution Imaging Spectroradiometer (MODIS) normalized difference vegetation index (NDVI) also showed a negative correlation with soil Cd (Fig. 1b). Primary and secondary roads had impacts on soil Cd up to approximately 1.2 km and 0.2 km from the roads, respectively, and the impacts were more significant within 0.5 km and 0.1 km of these types of roads ( Fig. 1c and d), which led to positive correlations between soil Cd and the densities of the two grades of roads within a 500 × 500 m and a 100 × 100 m grid, respectively ( Fig. 1e and f).
According to regression analysis (Table 1), the four derived factors all had significant impacts on soil Cd (p < 0.05 or p < 0.01). Distance to the Minjiang River, MODIS NDVI and the densities of primary and secondary roads contributed 22.0%, 12.8%, 1.4% and 5.7% of soil Cd variability, respectively. However, the correlations changed in different subareas. Within 10 km of the Minjiang River, soil Cd showed significant negative correlations with MODIS NDVI and the distance to the Minjiang River and positive correlations with the densities of both grades of roads, while soil Cd only showed significant correlations with MODIS NDVI and the density of primary roads beyond 10 km of the Minjiang River ( Table 2).
Comparison of the prediction accuracies of different methods. The prediction errors, including the mean absolute error (MAE), root mean square error (RMSE) and mean relative error (MRE), of different methods for the independent validation points are listed in Table 3. The results indicated that HASM_RBFNN could achieve the smallest prediction errors, followed by HASM_RBFNNs, RK and OK, indicating that HASM_RBFNN was the most accurate method and the derived factors and the selected approached for establishing the combined method could contribute to the improved prediction accuracy of soil Cd distribution across the study area.
Comparison of the prediction maps created by different methods. The spatial distribution maps of soil Cd predicted by the four methods are illustrated in Fig. 2 (a-d). The prediction maps of soil Cd distribution obtained from the four methods exhibited similar spatial patterns, which showed that soil Cd was relatively higher in the subarea that is closest to the Minjiang River. However, differences among the prediction results of the four methods were obvious (Fig. 2). HASM_RBFNN obtained the largest prediction ranges that were closest to the observed values, followed by HASM_RBFNNs and RK, while OK had the narrowest prediction ranges among the methods. The prediction map produced by OK showed rather gradual transitions with limited detail and could not accurately reproduce the local variability (Fig. 2d). Conversely, HASM_RBFNN, HASM_RBFNNs and RK performed better, with more detail in the prediction results ( Fig. 2a-c), indicating that the methods utilizing the derived environmental covariates as auxiliary variables could significantly improve the performance of local variability reproduction. Moreover, OK produced soil Cd maps with much larger areas of high value (>0.32 mg·kg −1 ), indicating that the points with high values of soil Cd had great impacts on the prediction results   Table 3. Prediction errors of the different methods for the independent validation points. MAE, mean absolute error; RMSE, root mean square error; MRE, mean relative error; OK, ordinary kriging; RK, regression kriging; HASM_RBFNN, the combined method (HASM_RBFNN) developed using high-accuracy surface modelling (HASM) and radial basis function neural network (RBFNN) modelling, taking into account the spatial nonstationarity of the relationships between soil Cd and the auxiliary variables; HASM_ RBFNNs, the combined method (HASM_RBFNN), without taking into account the spatial non-stationarity of the relationships. surrounding these points, which overestimated the soil Cd contents of these areas. Prediction maps by HASM_ RBFNNs and RK showed some areas with soil Cd exceeding 0.32 mg·kg −1 along roads in the eastern region of the study area, which was inconsistent with the measured data ( Fig. 2e), while the prediction map by HASM_RBFNN was more consistent with the actual soil Cd distribution (Fig. 2a).

Discussion
Effects of environmental factors on soil Cd. According to the semivariogram analysis (Table 4), the ratio of nugget to sill was 0.437, suggesting that soil Cd in this study area was determined by the combined effects of natural and anthropogenic sources. Regression analysis ( Table 1) further indicated that a natural factor (distance to the Minjiang River) was a more important factor than the anthropogenic sources including roads and MODIS NDVI that could reflect the differences between rice-wheat and rice-rapeseed rotation systems based on our calculation method. This result was in agreement with previous studies on the Chengdu Plain 11, 32, 33 .
The soil parent material of the study area mainly arrives via river transportation from the Longmen Mountains, located to the northwest, where the geological background value of soil Cd is 0.376 mg.kg −1 ; in fact, the Cd contents of Carboniferous, Devonian and Sinian outcrops in this mountain range can be up to 0.659 mg. kg −1 11, 34 . A previous study also showed that Cd contents in the first terrace and stream sediment of the Minjiang River were 0.27 and 0.53 mg.kg −1 , respectively 35 . The distance from the Minjiang River reflects the differences of the sedimentation process of the parent material and the formation time of the soil. The shorter the distance, the more recent the soil deposition and the more similar the soil to the parent material, which may account for the fact that soil Cd content negatively correlated with the distance to the Minjiang River (Fig. 1a).
The differences between the two rotation systems were related to the different management measures pertaining to fertilizers, pesticides and the aboveground straws. For example, wheat straw was often returned to the field, while rapeseed straw was always removed in this area. Furthermore, previous studies have indicated that the Cd content in rapeseed is larger than in wheat 21,22 . The different management measures for aboveground straw, different Cd content and the different biomass of wheat and rapeseed may partially result in higher soil Cd content in the rice-wheat rotation systems regardless of any differences in fertilizer management 11 . In this study, the average MODIS NDVI, calculated from the MODIS NDVI of February, July and December of 2006 to 2012, could reflect the differences in the vegetation cover between the two rotation systems as well as soil conditions, where high NDVI values correspond to rice-rapeseed rotation systems and low NDVI values may correspond to rice-wheat rotation systems. This condition led to the negative correlation between soil Cd and MODIS NDVI (Fig. 1b).
High values of soil Cd were found along roads ( Fig. 1c and d), which is consistent with other studies 9, 36 . For instance, Zhang reported that Cd is a priority concern as it has the highest contamination factor along the Qinghai-Tibet highway 36 . Khan found that soil Cd in the roadside soils is related to the road grades; the soil Cd is highest in the roadside soils of primary roads, followed by those of secondary and tertiary roads 9 . In the present study, primary roads were found to have a more far-reaching impact on soil Cd ( Fig. 1c and d), mainly because of the heavier traffic flows. However, the density of secondary roads was much higher than that of primary roads in the study area (Fig. 3c), which resulted in the fact that the density of secondary roads could explain more of the soil Cd variability than primary roads ( Table 1).
The effectiveness of the environmental covariates for improving the prediction. Soil is the product of complex interactions between environmental factors, such as terrain factors, land use and parent material 6,12 . The spatial distribution of soil properties may vary significantly within a short horizontal distance due to various environmental factors 13,16 . It is difficult to obtain accurate predictions in the absence of the environmental auxiliary variables.
Many researchers have shown that the use of the auxiliary information could improve the accuracy of the predictions 5, 12-17 . However, the commonly used factors are not the most effective auxiliary variables for the prediction methods in this plain area due to the gently undulating terrain, homogeneous parent material and soil type in our study area. The results of a correlation analysis suggested that the geological origin might determine the overall spatial trends of the soil Cd distribution, while crop rotation systems and traffic contributed further local variability across the study area (Tables 1 and 4). These obviously influential factors cannot be ignored in an effort to produce a more accurate spatial distribution map of soil Cd. In this study, the environmental covariates were derived from these obviously influential factors and used as auxiliary variables for the prediction methods. The results showed that the methods that employed the derived environmental factors as auxiliary variables (including RK, HASM_RBFNN and HASM_RBFNNs) obtained a higher degree of accuracy and greater detail than the prediction results from OK, which only predicted soil Cd from the neighbouring sampling points (Table 3 and Fig. 2), a finding that was consistent with previous studies 5, [12][13][14][15][16][17] . This result suggested that the derived environmental variables were effective for improving the prediction results and it was feasible for our approach to derive the auxiliary variables from the obviously influential factors.  The performance of HASM_RBFNN for reducing predictive error. HASM_RBFNN showed the best performance among the four prediction methods (Table 3 and Fig. 2), which could be attributed to the following factors. First, the RBFNN model has been proven to be more effective than MLR in capturing the relationships between soil and the environmental factors 12,15 . Other researchers have also found that MLR was not appropriate because the MLR model with inclusion of auxiliary information may deteriorate the spatial structure of the target soil property 37 . In the study area, the relationships between soil Cd and the environmental variables were complex and included a curvilinear relationship between soil Cd and the distance to the Minjiang River (Table 1 and Fig. 1), which suggested that the artificial neural network model was more appropriate. Second, the output of HASM satisfies the iteration stopping criterion, which is determined by the application requirement for accuracy 5 . Typically, soil heavy metal contents of samples from high pollution risk areas are local spatial outliers 38 . This phenomenon was also found in the much higher contents of soil Cd along the roads (Fig. 1c and d) in our study. OK has a smoothing effect and predicts soil Cd content based on the neighbouring soil samples, which underestimates the local high values and overestimates the values around the samples with higher values 1 . In contrast, HASM is a new technique based on a fundamental theorem of surfaces that can generate less error in areas with high local variability through its algorithm and a large enough iteration number 16,22 , which leads to both HASM_RBFNN and HASM_RBFNN s having larger prediction ranges than those of RK and OK and smaller areas with high values (>0.32 mg·kg −1 ) in the prediction maps (Fig. 2). Finally, the spatial non-stationarity of the relationships between soil Cd and the derived environmental covariates was considered in HASM_RBFNN. The superiority of this consideration could be easily demonstrated from the prediction results along the roads in our study. Although the density of primary roads had significant impacts on soil Cd (Table 2), the cardinal values of soil Cd content were largely different in the two subareas due to the different geological background values (Figs 1a and 3e). This condition resulted in overestimation along primary roads beyond 10 km of the Minjiang River and underestimation along primary roads within 10 km of the Minjiang River when only one model was used to capture the relationships between soil Cd and the environmental covariates across the entire area. This underestimation finally led to an inaccurate estimation of soil Cd along the roads and narrower prediction ranges of HASM_RBFNN s and RK than that of HASM_RBFNN.

Limitations.
Although the selected environmental factors had significant impacts on soil Cd distribution, the relatively low correlation coefficients between soil Cd and the four factors suggested the complexity of the relationships between soil Cd and the environmental factors, and other influential factors, such as fertilization management for specific locations and the distribution of chemical factories that could lead to high local variations of soil Cd on the Chengdu Plain 33 , were not included due to a lack of data. Employing more relevant environmental factors as auxiliary variables could further improve the prediction accuracy. Moreover, the resolution of environmental covariates and the best size of the grid that was used to calculate the road density should be further determined based on the prediction results. These limitations should be considered in the future studies.

Methods
HASM_RBFNN and HASM_RBFNN s . The observation of soil Cd at the soil sampling point is divided into two components, which can be expressed as where Z(x i , y j ) is the measured soil Cd content at sampling point (x i , y j ); (x i , y j ) are the coordinates; f(x i , y j ) is the predicted soil Cd content based on thevarious environmental factors, while r(x i , y j ) is the residual that is the spatially variable but dependent component; the residual is computed by subtracting f(x i , y j ) from the original value of soil Cd content. The two components are assumed to be mutually independent and can be predicted by RBFNN and HASM, respectively. The RBFNN model was used to predict f(x i , y j ) from the environmental covariates as follows: where X 1 , X 2 , X 3 and X 4 represent the distance to the Minjiang River, the densities of primary and secondary roads, and NDVI at sampling point (x i , y j ), respectively. The RBFNN model includes three different layers 12 : a layer of input neurons providing input variables to the network, a hidden layer of RBF neurons that are directly connected to the output layer, and a layer of output neurons. The Gaussian function is the most commonly used RBF as the activation function of the hidden layer and can be expressed as follows: where ψ i (x) is the radial basis activation function of the hidden layer, x is the input vector, u i is the central vector of the ith hidden node, and σ is the width of the basis function (spread constant). The number of hidden layer neurons and the width are the two key parameters that must be configured for specific studies. The output layer neuron is a weighted linear combination of RBFs in the hidden layer and can be calculated as follows: where y j (x) denotes output values of the jth node in the output layer, n is the number of hidden nodes, w ji is the connecting weight between the jth hidden node and ith output node, and b j is the bias parameter of the jth output node.
HASM was used to predict the spatial distribution of r(x i , y j ), which was developed in terms of a fundamental theorem of surfaces 5, 16, 24-26 where a surface (z = (x, y, u(x, y))) can be uniquely defined by the first and second fundamental coefficients, which are formulated as follows: The basic theoretical equations of HASM could be formulated as  + u x h y ( , ) and − u x h y ( , ) could be formulated as the following Taylor expansio'n in series, Formulation (7) minus formulation (8) gives that,  For sufficiently small h, the finite difference approximation of u x y ( , ) x and u x y ( , ) y could be expressed as, y Formulation (11) plus formulation (12) gives that, Therefore, For sufficiently small h, the finite difference approximation of u x y ( , ) xx and u x y ( , ) yy could be expressed as, where ≥ n 0, < < + i 0 SCIenTIfIC REPORTS | 7: 7115 | DOI:10.1038/s41598-017-07690-y j  n  i  j  n  i  j  n  i j  n  i j  n  i j  n  i j  n  i  j  n  i  j  n   i j  n  i j  i j  n  11  2  ,   ,  1 ,  1 ,  ,  , 1  , 1  ,  1 ,  1 , , , , 2 , which means that the sampled value is u i j , at the k th sampling point x y ( , ) i j . To solve the algorithm (21) which is the least squares problem, a positive weight λ is introduced on the basis of the well known method of weights. The parameter λ is the weight of the sampling points and determines the contribution of the sampling points to the simulated surface. For sufficiently large λ, the algorithm (21) can be transferred into unconstrained least squares approximation, In terms of the Gauss-Codazii equation set, the iteration stopping criterion of HASM is formulated as x ; e t is the iteration stopping criterion of HASM determined by the application requirement for accuracy.
The application of HASM_RBFNN includes four steps: based on the modelling points, two specific RBFNN models were first trained using a different number of hidden layer nodes and spread constants for the two subareas including within and beyond 10 km of the Minjiang River. The best combinations of the two parameters were tested and determined for the RBFNN configurations, which presented a minimum value of RMSE for the validation points. Second, the two trained RBFNN models were used to predict f(x i , y j ) for the two subareas with the layers of environmental covariates and to calculate the prediction residuals of RBFNN for the modelling points. Third, HASM was then used to predict the spatial distribution of the prediction residuals of RBFNN. Finally, the RBFNN prediction was summed to the result of HASM as the final prediction of HASM_RBFNN.
To evaluate the performance of the method without considering the spatial non-stationarity of the relationships between soil Cd and the derived environmental covariates, HASM_RBFNN s was established, which only trained one specific RBFNN model for the entire study area.
OK. OK is the commonly used and classical method in soil science that is based on observations of a target soil variable and of corresponding spatial positions. In this study, the experimental semivariogram was fitted using authorized theoretical models, including linear, Gaussian, spherical and exponential models. The model with the smallest residual sum of squares (RSS) was chosen to provide the key parameters for spatial prediction by the Kriging procedure in the Geostatistical Analyst extension in ArcGIS. The semivariogram parameters of the best model are listed in Table 4. For the number of the closest samples of OK, we chose the best one from 5 to 30 with a 5 step interval. RK. RK is a commonly used method that can introduce auxiliary environmental variables using a regression model into the kriging system 14,19 . The implementation of RK involves three steps: establishing a multiple linear regression between the target variable and auxiliary variables, computing the regression residuals by semivariogram and OK, and summing the regression prediction and the OK prediction of the residuals. The process of RK in this study can be summarized as follows: where Z RK is the predicted values of soil Cd content using RK, Z R is the predicted values of soil Cd content by a special multiple linear regression that used the four derived environmental covariates as independent variables, and ε OK is the kriging values of the regression residuals by OK with the semivariogram model parameters computed from the residuals ( Table 4).
Assessment of the performance. The prediction performance of each method was evaluated by the difference between the observations and predictions at validation sites using common indices, including the MAE, RMSE and MRE, which were defined as follows:  (Fig. 3a). The study area encompasses an area of 480.3 km 2 , and the elevation ranges from 532 to 673 m, with higher elevations in the northwest and lower elevations in the southeast (Fig. 3b). Q4 grey alluvium is the main parent material (more than 98%), and paddy soil (Fe-leachi-Stagnic Anthrosols) is the only soil type in this area according to the National Soil Census data. Farmland and built-up areas are the two main land use types, which account for 71.1% and 25.8% of the entire area, respectively. Rice-wheat rotation and rice-rapeseed rotation are the two main cropping systems in the farmland. Due to the high soil fertility, farmers may plant other crops after the rice harvest and before planting wheat or rapeseed. This area is characterized by developed transport due to the developed economy and the low relief. Road types include expressways, provincial roads, county roads, town roads and country roads, which have different traffic flow levels.
Soil samples and analysis. A total of 339 sampling sites were determined on the basis of a 1.5 × 1.5 km 2 grid from January to March in 2013 and were also away from built-up areas, woodland and water areas. Information regarding each site's geographic coordinates and road conditions were carefully recorded (Fig. 3b). At each site, a topsoil sample (0-20 cm) was collected with three replicates around the sampling site. Each sample was air dried and passed through a 0.149 mm sieve. The soil Cd content of each sample was determined using graphite oven atomic absorption spectrometry after the soil sample had been digested using a four-acid mixture containing HCl, HNO 3 , HF, and HClO 4 . National standard reference materials, blank value assays and parallel determinations were used to verify the accuracy and precision of the measurements. To evaluate the performance of the prediction methods, 80% of the samples (273 samples) were randomly selected as modelling points using the create subset function of the Geostatistical Analyst in ESRI ArcGIS and the other 20% (66 samples) were used as validation points (Fig. 3b).
Derivation of environmental covariates. Based on previous studies 11,32 , the geological origin, road distribution and crop rotation systems were selected as the environmental variables in this study. As the Chengdu Plain is an alluvial plain, distance to the main river can reflect the differences of sedimentation processes of the parent materials and the differences of the development levels of the soils 11 . In this study, the distance to the Minjiang River from each grid of the study area was calculated by buffer analysis in ArcGIS and was used as an auxiliary environmental variable. Road distribution data (.shp format) were obtained from the transportation department of Sichuan Province; these data contain information on the name, grade, date of construction, and other characteristics for each road. According to traffic flow, roads in this area were classified into two grades. The first grade includes expressways, provincial roads and county roads, while the second grade includes town and country roads. Correlation analysis showed that the primary and secondary roads had significant impacts on soil Cd content up to 500 m and 100 m away from the roads, respectively ( Fig. 1c and d). The density of primary roads within 500 × 500 m grids and the density of secondary roads within 100 × 100 m grids were calculated based on the road distribution data and were used as two other auxiliary variables. The sixteen-day NDVI from MODIS bands was used to represent the different crop rotation systems. Considering the difference of vegetation cover for the two rotation systems, MODIS NDVIs of February (when there is a large difference in vegetation cover between wheat and rapeseed crops), July (when rice has the largest vegetation cover) and December (probable crop between rice harvest and wheat or rapeseed planting) from 2006 to 2012 were selected to calculate the average NDVI value, which was then used as an auxiliary variable. All datasets of the environmental variables were resampled to a 10-m resolution in consideration of the computation time of the prediction methods.