Spatiotemporal modeling and prediction of soil heavy metals based on spatiotemporal cokriging

Soil heavy metals exhibit significant spatiotemporal variability and are strongly correlated with other soil heavy metals. Thus, other heavy metals can be used to improve the accuracy of predictions when performing spatiotemporal predictions of soil heavy metals within a given area. In this study, we propose the spatiotemporal cokriging (STCK) method to enable the use of historical sampling points and co-variables in the spatial prediction of soil heavy metals. Moreover, experimental spatiotemporal (ST) semivariogram and ST cross-semivariogram computational methods, a fitting strategy to the ST semivariogram and ST cross-semivariogram models based on the Bilonick model, and the STCK interpolation algorithm are introduced; these methods are based on spatiotemporal kriging (STK) and cokriging (CK). The data used in this study consist of measurements of soil heavy metals from 2010 to 2014 in Wuhan City, China. The results show that the behavior of predictions of the concentrations of heavy metals in soils is physically more realistic, and the prediction uncertainties are slightly smaller, when STCK is used with greater numbers of co-variables and neighboring points.

Soil plays a very important role in the food chain and hence is a key pathway through which humans come into contact with most pollutants 1 . This statement is especially true for heavy metals, which have also been identified as co-factors in many diseases 2,3 . Therefore, there is considerable interest in the best way to monitor soil quality to ensure that soil is managed sustainably 4 . In recent years, an increasing amount of concern has been directed toward the spatial distribution of soil contamination [5][6][7] . Previous studies have carried out multivariate analyses, analyses of various pollutant indices, and geostatistical analyses to evaluate the degree of soil pollution by heavy metals using sampling data collected during individual periods. Additionally, some researchers have begun to address the concern over the spatiotemporal (ST) variability in soil heavy metals 4,8 and have performed statistical analyses of data collected during field surveys conducted in different years that were performed to characterize the spatial and temporal changes in the concentrations of heavy metals in soils. However, the sampling and analysis procedures used when the status of soil heavy metals within a given area must be continuously monitored are expensive and time-consuming. Therefore, spatiotemporal interpolation is necessary because it enables the use of previous soil sampling points to predict present-day spatial distributions with fewer soil samples.
Spatiotemporal kriging (STK) is a tool that is used to analyze and map ST phenomena using point observations 910 . The technique is currently used in many research problems and fields, such as the interpolation of soil water and salinity content [11][12][13] , climatology 14 , and air quality monitoring 15 . Most studies use only measurements of the variable of interest. However, after soil samples have been obtained, various heavy metals in the soil can be measured simultaneously. In addition, many studies indicate that correlations exist among the various heavy metals found in soils [16][17][18] . The use of such relationships in interpolation via cokriging (CK) may decrease prediction uncertainties 19 .
Based on the above discussion, we believe that it is possible to combine historical sampling points and co-variables to perform predictions of heavy metals. Thus, we propose a spatiotemporal cokriging (STCK) method that is based on STK and CK. The main objective of this study is to predict the ST distribution of soil heavy metals within a study area using STCK. To achieve this objective, the following steps are followed. Sample collection and analysis. An extensive investigation of the soil within the study area was carried out in October 2010. In total, 124 topsoil samples were collected at depths of 0-20 cm within the study area. We found that the soil pollution in this area was serious. To monitor the degree of soil contamination, we collected topsoil samples from the study area in October from 2011 to 2014. Forty-five, 48, 55, and 48 soil samples were collected in 2011, 2012, 2013 and 2014, respectively. The spatial distribution of soil sampling points are shown in Fig. 1. At each sampling point, 5 sub-samples were collected at random and mixed to obtain a composite soil sample. Any foreign debris present in the soil samples was manually removed during sample collection. The coordinates of the sample locations were recorded with a GPS. All of the soil samples were air-dried at room temperature and passed through a 100-mesh nylon sieve, which included 100 holes within an area of 1 square inch. The prepared soil samples were then stored in polyethylene bottles for analysis.
The concentrations of heavy metals, including copper (Cu), cadmium (Cd), lead (Pb), and zinc (Zn), were measured in the soil samples. Approximately 0.5 g of each prepared soil sample was digested using a mixture of nitric acid (HNO 3 ) and perchloric acid (HClO 4 ) in a Teflon beaker on a hot plate. The total concentrations of Cd, Cu, Cr, Pb and Zn in the digested solutions were measured using inductively coupled plasma mass spectrometry (ICP-MS; TMO, USA). The accuracy and precision of the measurements were tested using standard reference materials (GGS-3) obtained from the National Center for Standard Reference Materials of China. All of the soil

Methods
Spatiotemporal Cokriging. To begin, we consider two ST variables, Z u (x) and Z v (x), which we denote u and v, respectively; both of these variables obey the intrinsic hypothesis. To distinguish between space and time, let Z(x) = {Z(s, t)|s ∈ S, t ∈ T} be a variable that is defined on a geographical domain S ∈ R 2 and a time interval T ∈ R. The aim of this study is to predict the attribute u at a spatiotemporal point (s 0 , t 0 ) where u was not measured. The prediction is based on measurements of u and v at n ST points (s i , t i ), i = 1… n. Note that not all u and v are observed at the same ST points; however, some ST points where u and v can be measured are required.
Under appropriate stationarity assumptions, an estimate of the ST semivariogram may be obtained from the measurements by computing the experimental semivariogram ˆh h ( , ) where h S and h T are the S and T lags, respectively, and N u (h S , h T ), N v (h S , h T ), and N uv (h S , h T ) are the numbers of pairs in the ST lag for u, v and uv, respectively. Fitting the models to the ST experimental semivariogram and cross-semivariogram has some additional problems over conventional semivariogram and cross-semivariogram modeling; these problems arise due to the distinct differences between the variations in S and T 10 . In this study, we use an extension of the separate-sum models proposed by Bilonick 19 , in which geometric and zonal anisotropy are applied to the problems arising from the differences in S and T variability. In the Bilonick model, the semivariogram is divided into three parts: an S part, a T part and an ST part that includes only geometric anisotropy and neglects zonal anisotropy. Assuming that these three parts are mutually independent, the semivariogram and cross-semivariogram are the sum of three components: The ST lag h ST is obtained by introducing a geometric anisotropy ratio α: The advantage of the Bilonick model is that it has S, T and ST components that can be interpreted fairly easily in a physical sense. The disadvantage is that the estimation of the model parameters is challenging. Prior studies estimate the ratio α along with other parameters of the semivariogram 10,11 . In this study, if the ratio αin every semivariogram or cross-semivariogram is estimated, then each αvalue cannot possibly be the same. This outcome may not be in accordance with the physical significance, given that the spatiotemporal ratios for every variable and every pair of co-variables are not the same. In addition, the ratio αis a very important parameter because it determines how to obtain the spatiotemporal distance between two points in space and time; in particular, it determines which observed points in space and time are used as neighboring points when performing ST predictions. If the ratio αdiffers among the semivariograms or cross-semivariograms, different neighboring points will be determined in different variables or pairs of co-variables. Therefore, the ratio α in all of the semivariograms and cross-semivariograms will be considered to be a single parameter.
When models for the semivariogram and cross-semivariogram are obtained, ST cokriging can be performed. The aim is typically to estimate just one variable, which we may regard as the principal or target variable, at a spatiotemporal point x 0 (s 0 , t 0 ) using data that describe that variable and one or more other variables, which we regard as subsidiary variables. The equations used to perform cokriging in the ST domain are exactly the same as those used in standard S cokriging. The equations can be represented in matrix form. For simplicity, we consider only two variables, u and v. However, the matrices are easily extended to greater numbers of variables. Let Γ uv denote a matrix of semivariances (including cross-semivariances, in which u # v) between sampling points in a neighborhood. Let there be n u places where variable u has been measured and n v places where v has been measured. The order of the matrix is n u × n v : uv uv uv uv n uv uv uv n uv n u v n uv n n We denote by b uu and b uv the vectors of semivariances for variable u and the cross-semivariances: The matrix equation is then: The estimated value of variable u at the spatiotemporal point x 0 (s 0 , t 0 ) is then the linear sum: The estimation variance is obtained from: where λ is the vector of weights and Lagrange multipliers, and b is the right-hand side vector of the matrix equation (9).
Validation and comparison criteria. The results obtained through the use of STCK with different combinations of co-variables and different numbers of neighborhood points are compared. Soil heavy metals are predicted at each of the sites for which measurements are available using the leave-one-out method, which successively deletes the value of each location where the prediction was utilized. This procedure yields pairs of estimated and observed soil heavy metal concentrations. The root mean squared error (RMSE) is then computed from the pairs of estimated and observed soil heavy metals. The RMSE is defined as:

Results and Discussion
Descriptive statistics of soil heavy metals. Descriptive statistics of Cd, Pb, Cu and Zn in the soil samples for each year are presented in Table 1. For Cd, Cu, and Zn, the soil concentrations show steady increases from 2010 to 2014. However, the concentrations of Pb show increases from 2010 to 2013, followed by a small decrease in 2014. Thus, the concentrations of soil heavy metals in the study generally increase over the investigated period.
The normality of Cd, Pb, Cu and Zn at all of the sampling points is tested using the Kolmogorov-Smirnov (K-S) method. The K-S test is a nonparametric test of the equality of continuous, one-dimensional probability distributions that can be used to compare a sample with a reference probability distribution (in this case, a normal distribution). This test examines whether two independent distributions are similar or different by generating cumulative probability distributions for the two distributions. The maximum distance or maximum difference is then entered into the K-S probability function to calculate the probability value. Lower probability values (<0.05) means that it is less likely that the two distributions are similar. Conversely, the higher or closer to 1 the value is, the more similar the two distributions are. The results show that the K-S values are 0.001, 0.000, 0.000, and 0.211 for Cd, Pb, Cu and Zn, respectively. Therefore, the Cd, Pb and Cu data were transformed using base 2 logarithms to achieve normal distributions. Correlation coefficient analysis. The correlation coefficients were used to determine the relationships among different heavy metals in the soil samples and then to determine the co-variables for each heavy metal while performing the ST interpolation. The Pearson correlation coefficient is a statistical measure of the linear correlation between two variables. This metric takes values between +1 and −1, where 1 indicates total positive linear correlation, 0 indicates no linear correlation, and −1 indicates total negative linear correlation. In this study, Pearson correlation coefficients of the heavy metals in the soil samples are summarized in Table 2. Table 2 shows significantly positive correlations among the four heavy metals. Therefore, the three other types of heavy metals can be treated as co-variables when ST interpolation of one heavy metal is performed using STCK. Figure 2(a-d) shows the experimental semivariograms and fitting models for LogCu, LogCd, LogPb, and Zn. The semivariance displays similar behavior in the space and time directions. In the S direction, the semivariance increases continuously with increasing distance to 5000 to 6000 m and then decreases to approximately 8000 m. All of the semivariograms in the T direction show continuous and slow increases in semivariance for lags of 0 to 4 years. Figure 2(e-j) shows the experimental cross-semivariogram and fitting models for LogCd × LogCu, LogCd × LogPb, LogCd × Zn, LogCu × LogPb, LogCu× Zn, and LogPb × Zn. For LogCd × LogCu (Fig. 2e), LogCd × Zn (Fig. 2g), LogCu × LogPb (Fig. 2h), and LogCu × Zn (Fig. 2i), the cross-semivariance increases continuously with increasing distance to 5000 to 6000 m and is then steady in the S direction. For LogCd × LogPb (Fig. 2f) and LogPb × Zn (Fig. 2j), in the S direction, the cross-semivariance increases with increasing distance to approximately 2000 m and is then steady. In the T direction, all of the cross-semivariograms display continuous and slow increases for lags of 0 to 4 years.     The form of the ST cross-semivariance for LogCu × LogPb is defined as: The form of the ST cross-semivariance for LogCu × Zn is defined as:

Model fitting for ST semivariograms and cross-semivariograms.
Here, C 0 represents the nugget value of the model; C T represents the slope of T; C S and C ST represents the sill for S and ST; a S and a ST represent the range parameters of S and ST; and Furthermore, the ratio α should be identical in all of the models. The value of C T is easily calculated using the experimental data. Fitting these 10 models to the experimental data is difficult because 51 parameters must be estimated. We thus use a genetic algorithm to simultaneously estimate these parameters by minimizing a fitness function:∑ The method of fitting models using a genetic algorithm was introduced and is described in detail in Yang et al. 20 . The parameter values resulting from the model fitting procedure are shown in Table 3.

ST interpolation and accuracy evaluation.
Based on the methods introduced in section 3.1 and the models of the ST semivariograms and the ST cross-semivariograms (Table 3), STCK is performed for Cd, Cu, Pb and Zn. For example, in the most complex case, Cd is predicted, and all of the other heavy metals, including Cu, Pb, and Zn, are employed as co-variables. The matrix equation (9) is thus extended as follows: To determine the influence created by the number of neighboring points, we predict the unmeasured ST points using the 4 to 20 nearest ST sampling points around the predicted ST site. The ST distance is determined using formula (16). Because α = 2085 and the spatial distances between the sampling points visited in 2014 are between 400 and 1500 m, manyistorical sampling points are incorporated into the group with the nearest ST sampling points. Consequently, the results of STCK are simultaneously influenced by the historical pollution situation and the correlation factors. We also examine the behavior of the STCK prediction using a different combination of co-variables. For example, considering LogCd, the variations in the prediction variance obtained using different numbers of neighboring points and different combinations of co-variables are shown in Fig. 3.
Based on Fig. 3, we conclude that the use of greater numbers of co-variables and neighboring points results in reductions in the variance of predictions. The RMSE cross-validation criterion for Cd in 2014 is provided in Fig. 4. The results of comparing the RMSE are generally consistent with the results obtained for prediction variance. The use of additional co-variables results in reduced RMSE values. However, the use of more neighboring points does not always produce reduced RMSE values. In addition, as shown in  Table 3. Parameters of the Bilonick models. Fig. 4, the average RMSE of LogCd + LogCu < the average RMSE of LogCd + Zn < the average RMSE of LogCd + LogPb, indicating that the use of co-variables with relatively high correlation coefficients with the major variable results in greater prediction accuracies than the use of co-variables with relatively low correlation coefficients with the major variable. Figure 5 shows the results of STCK interpolation using three co-variables and 20 neighboring sampling points from 2010 to 2014. The variables, including Cd, Pb, and Cu, are back-transformed to their original scales. Our results reveal a general tendency for elevated concentrations of Cd, Cu and Zn to spread from the southwestern part of the study area to the entire area over time, whereas Pb contamination tends to be concentrated mostly in the northern and western parts. Thus, the ST distributions of heavy metals reveal trends in their ST evolution that can assist in identifying sources of pollution and the directions in which the heavy metals diffuse. For example, based on Fig. 5, we conclude that the sources of the Cd, Cu and Zn pollution are located within the southwestern   portion of the study area, i.e., the heavy industrial area of Wuhan City. In addition, the sources of Pb pollution are located within the northern and western parts of the study area.

Conclusions
In this paper, we present a procedure for carrying out ST predictions of heavy metals based on the STCK method. Soil heavy metals, including Cd, Cu, Pb and Zn, measured in the Qingshan district of Wuhan City in China from 2010 to 2014 are employed as experimental data. The Bilonick model is used to fit ST auto-and cross-variograms, and a genetic algorithm is used to estimate the relevant parameters. The logical ST auto-and cross-variogram models shown in Fig. 2 indicate that the Bilonick model adequately describes the ST variability.
The results of STCK show that the use of additional co-variables improves the ST prediction accuracy; the average RMSE decreases as more co-variables are employed. In addition, the use of co-variables with relatively high correlation coefficients with the major variable results in greater prediction accuracies than the use of co-variables with relatively low correlation coefficients with the major variable. Thus, the use of additional co-variables with relatively high correlation coefficients with the major variable significantly improves the prediction accuracy. In addition, the number of neighboring points affects the prediction accuracy significantly. The use of additional neighboring points results in reduced prediction variance and higher general prediction accuracy.
The results of ST predictions of heavy metals can illustrate trends in ST evolution and can help environmental scientists to infer the locations of pollution sources and the directions in which the heavy metals are diffusing. Suitable environmental governance measures must be proposed.