Estimation of Pb and Cd Content in Soil Using Sentinel-2A Multispectral Images Based on Ensemble Learning

: With the increasing economic growth in developing nations, soil heavy metal pollution has become a growing concern. Monitoring the heavy metal concentration in soil through remote sensing is crucial for safeguarding the ecological environment. However, the current indoor spectral measurement method has limitations, such as the discrete soil sampling space and weak spectral characteristics of soil heavy metals, leading to a poor robustness of remote sensing inversion models. This study presents a novel approach to address these challenges by incorporating a spatial feature of pollution sources and sinks to evaluate the spatial factors affecting pollutant diffusion and concentration. An integrated learning model, combining spatial and spectral information, is developed to estimate heavy metal content in soil using Sentinel-2A satellite data. A total of 235 soil samples were collected in Jiyuan, China, and the effective spectral transformation characteristics of Sentinel-2A data were screened. The impact of spectral characteristics, topographic characteristics, and spatial characteristics on retrieving soil heavy metal lead (Pb) and cadmium (Cd) content were analyzed. The optimal inversion method was determined through various integrated learning models, and the spatial distribution of heavy metals Pb and Cd was mapped. The results indicate that the accuracy of the inversion model was signiﬁcantly improved by incorporating terrain features and spatial features of pollution sources. The Blending integrated learning method showed a 65.9% and 73.2% reduction in the RMSE of Pb and Cd, respectively, compared to other regression models. With R 2 values of 0.9486 and 0.9489 for Pb and Cd, respectively, and a MAPE less than 0.2, the Blending model demonstrated high prediction accuracy.


Introduction
Human activities such as mining, metallurgy, and agriculture often result in the release of inorganic pollutants into the soil through atmospheric deposition and wastewater discharge, leading to persistent contamination by heavy metals. These pollutants are known to have low mobility, long persistence, and high resistance to degradation by microorganisms, posing significant threats to the environment [1]. In China, 19.4% of agricultural soils have been found to exceed recommended levels for heavy metal concentrations, making it vital to monitor and understand the distribution of soil heavy metal content [2]. Therefore, efficient and accurate measurement of soil heavy metal levels is crucial to prevent further contamination, safeguard agricultural production, and ensure ecological security.
The content of heavy metals in soil is an important indicator for evaluating soil quality. Traditional methods for investigating heavy metal pollution in soil, such as combining point sampling and chemical analysis, have a high accuracy in obtaining the content of heavy metals around sampling points [3]. However, this approach is costly in terms of time and resources, and it is not suitable for fine monitoring of heavy metal pollution over a large area [4]. The advent of remote sensing technology provides a new avenue for rapid and efficient monitoring of soil heavy metal pollution. Kooistra et al. [5] found that the spectral reflectance characteristics of heavy metals in remote sensing images can be used to estimate their content in soil. While hyperspectral-based predictions have shown promising results, the high cost of hyperspectral satellites and airborne systems make them a challenging option [6]. In comparison, multispectral remote sensing images offer wider spatial coverage, fast access, low cost, and rich information [7]. Researchers have utilized multispectral remote sensing data, such as Landsat, to investigate soil heavy metal pollution in critical regions [8]. However, traditional regression models used for heavy metal prediction often struggle to achieve a high inversion accuracy due to the weak spectral characteristics of soil heavy metals, the limited number of bands, and the low spectral resolution of Landsat images [9]. The Sentinel-2A multispectral remote sensing image offers more wavebands, which can be easily accessed [10,11]. These images have been widely used in various applications, including crop classification [12], land cover monitoring [13], vegetation extraction [14], and post-disaster assessment [15]. Sentinel-2A image has been successfully used to predict the content of arsenic and mercury in soil in the Taihu Lake area of China [16]. As such, Sentinel-2A images were utilized in this study to estimate soil Pb and Cd concentrations.
Existing research has attempted to monitor soil heavy metal pollution with hyperspectral remote sensing technology, focusing on spectral information enhancement transformations, feature selection, and inversion algorithms [17]. Mathematical transformation of spectra, including differential, inverse logarithmic, continuum removal, and multiple scattering correction, has been found to improve the correlation between spectra and heavy metals [18]. Xie et al. [19] demonstrated that differential spectroscopy is suitable for obtaining information on heavy metals in soils, and combining differences or ratios between bands can significantly improve the correlation. Choe et al. [20] found that the reflectance ratio (610 nm/500 nm), the area of the 2200 nm absorption peak, and the symmetry of the 2200 nm absorption peak were significantly related to the content of lead (Pb), zinc (Zn), and arsenic (As), respectively. These studies indicate that the choice of the best spectral data processing method depends on factors such as the soil type, the number of samples, and instrument parameters. In this study, we used three methods (inverse, logarithmic, and band ratio) to mathematically transform the spectral reflectance to enhance the correlation between spectral reflectance and heavy metal content.
Numerous studies have explored prediction models for soil heavy metal content, including linear models such as principal component regression (PCR), stepwise multiple linear regression (SMLR), and partial least squares regression (PLSR), and nonlinear models such as artificial neural network (ANN), support vector machine (SVM), random forest (RF), decision tree (DT), gradient-boosting decision tree (GBDT), fuzzy logic network (FNN), and genetic algorithm. Kemper et al. [21] demonstrated the feasibility of using multiple linear regression (MLR) and artificial neural network (ANN) models to establish heavy metal content prediction models based on reflectance spectra. Goodarzi et al. [22] estimated the content of Pb element in a mining area using MLR, PLSR, and ANN regression models. Although machine learning can be used to predict soil heavy metal content, the model may be less robust and may overfit. Additionally, the relationship between soil heavy metal content and factors such as soil physicochemical properties, topographic factors, and human activities is complex and closely intertwined. Thus, a comprehensive analysis of the influence of multiple factors is required to deepen the understanding of the causes and distribution patterns of soil heavy metal pollution.
In our study, we used 235 field samples collected in Jiyuan, Henan Province, and fused spatial and spectral information using Sentinel-2A multispectral remote sensing images. Through the Blending integration method, we developed an inversion model to estimate soil heavy metal content. Our study investigates the application of multispectral satellite remote sensing technology to estimate heavy metal content in soil and analyzes the spatial distribution of Pb and Cd in the study area. We improved the construction of the inversion model through the following methods.

1.
Introduction of topographic features. When using remote sensing technology to monitor heavy metal concentrations in soil, the weak characteristic response of heavy metals in remote sensing images, as well as the possibility of spectral overlap with other soil components, can complicate detection due to the typically low content of heavy metals in soil. However, research has shown that topographic features are related to soil heavy metal concentrations, providing an indirect means of estimating their distribution in target soil [23].

2.
Proposal to establish the spatial characteristics of pollution sources. The primary sources of heavy metal pollutants in the study area are atmospheric sediments, sewage discharge, and irrigation resulting from mining activities, metallurgy, and industrial production [24]. As a result, pollution sources are typically identifiable. In this study, we propose to develop spatial characteristics of pollution sources to quantify the impact factors of pollutant diffusion and concentration, based on the relationship between soil heavy metal pollution sources and the accumulation of heavy metal pollutants.

3.
Application of ensemble learning. To improve the accuracy of estimating soil heavy metal content by remote sensing, we introduce an ensemble learning method, which has shown superior performance in modeling research across various fields. The proposed hybrid integrated model incorporates both spectral and spatial information, enhancing the stability and generalization ability of the model. By combining multiple individual models, the ensemble method improves the accuracy of soil heavy metal content estimation.

Study Area
Jiyuan, located in the northwest of Henan Province, has a land area of 1899 km 2 with a geographic location of 34 • 0 ~35 • 15 N and 112 • 25 ~112 • 45 E. The region has a temperate monsoon climate with an annual temperature of 14.0 • C and an average annual rainfall of 567.9 mm. The landscape is characterized by higher elevation in the northwest and lower elevation in the southeast. Due to heavy metal smelters in the region relying mainly on comprehensive recycling of lead-acid batteries and mineral resources from other areas as raw materials, long-term metal smelting activities have resulted in the abnormal distribution of various heavy metal elements in the cultivated soil surrounding Jiyuan City, with lead and cadmium as the primary pollutants. The center of abnormal concentrations of lead and zinc in surface soil is situated in close proximity to the lead and zinc smelters. The average content of Pb in the high-value area is 611 mg/kg, which exceeds the third-level pollution standard by 1.22 times. The average content of Cd in the high-value area is 2.53 mg/kg, which exceeds the third-level pollution standard by 2.53 times.

Sampling and Laboratory Analysis
Due to the existence of various obstacles in the study area, such as buildings, rivers, and roads, it is planned to use a network model based on radial and circular complementarity to arrange sampling points (  In the laboratory, impurities such as grass roots and crop stalks are removed from soil samples collected in the field, and they are then packaged in lead cans and dried in an electric blast dryer for dehydration. The dried soil is ground and passed through a 100-mesh sieve. The content of lead and cadmium in soil samples was measured using an XSERIES-2 inductively coupled plasma mass spectrometer at the Rock and Mineral Testing Center of Henan Province. The measured quality was analyzed in accordance with national standards GB/T 14506.30-2010, DZ/T 0279.2-2016, and GB/T 14506.1-2010.

Satellite Data Acquisition and Processing
The Sentinel-2A satellite launched by ESA provides high-resolution multispectral imaging data in 13 bands in the visible, near infrared, and shortwave infrared regions. The ground resolution of these data is 10, 20, and 60 m, respectively, and the image width is 290 km [25,26]. Download a high-quality Sentinel-2 remote sensing image of the study area from the US Geological Survey Earth Explorer, with an image date of 30 October 2019. The obtained image data is at L1C, and after radiometric calibration and atmospheric correction, L2A atmospheric bottom reflectance data is obtained [27]. Resample the spatial resolution of 13 bands of Sentinel-2A to 10 m using the nearest neighbor method, and crop the image to the extent of the study area; this resulted in a high-quality multispectral remote sensing image suitable for our analysis.
This study focuses on the nine original spectral reflectance bands of Sentinel-2A, as four bands with specific application areas were excluded. The B1 band is primarily used for aerosol monitoring [28], while the B8 and B9 bands are typically employed in water vapor monitoring [29]. Additionally, the B10 band is primarily used for ocean current monitoring [30]. Remote sensing imaging may introduce errors in reflectance, which can be minimized through mathematical transformations [31]. This study applies inverse (1/B i ), logarithmic (ln B i ), and band ratio B i /B j transformations to the original spectral bands in order to reduce the impact of such errors.

Quantification of Spatial Features of Potential Pollution Sources
Inorganic pollutants generated by mining and metal smelting primarily constitute point source pollution, emanating from the plant and spreading outward. However, dispersion is also influenced by natural factors including wind direction, wind speed, topography, precipitation, and more. These spatial factors play a crucial role in determining the distribution of soil pollution concentrations. To accurately model pollution dispersion, this study considers incorporating relevant spatial features characterizing pollution sources and sinks.
The study area encompasses five major smelters responsible for lead, zinc, steel, and other metal production, the distribution of which is visually represented in Figure 1. These smelters are identified as potential pollution sources that contribute to heavy metal contamination in the soil across the study area. As pollutants disperse from these sources in all directions, the proximity and orientation of sample locations to these sources are key factors influencing the accumulation of pollutants. To accurately model pollutant dispersion and its potential impacts on the soil, spatial distance and azimuth between sources and sampling points were identified as the primary spatial characteristics. These characteristics were separately calculated for each of the five pollution sources.
(a) Spatial distance between pollution source and sampling point Spatial distance plays a significant role in the accumulation of pollutants. In order to better understand the impact of distance on the dispersion of pollutants from a source to its surroundings, a spatial distance feature is introduced. This feature is calculated based on the Euclidean distance formula [32], taking into account the planar coordinates of the sampling point and the source. The closer the distance to the pollution source, the more pollutants are accumulated, leading to an increase in the spatial distance feature value. However, the farther away from the pollution source, the less the impact, leading to a decrease in the spatial distance feature value, and this spatial relationship is quantified as distance and used for modeling. This approach helps to quantify the effect of distance on the spread of pollutants, providing valuable insights into pollution management strategies.
(b) Azimuth of the line connecting the pollution source and the sampling point Pollutant diffusion is significantly influenced by wind direction and speed in different directions deviating from the pollution source, resulting in significant changes [33]. If a sampling point is located downwind of the dominant wind direction of the pollution source, the cumulative concentration of pollutants at that point will be higher. Therefore, the orientation information of sampling points and pollution sources can reflect the differences in atmospheric diffusion conditions in different directions of pollution sources. Azimuth is a spatial feature that describes the direction of the sampling point relative to the spatial pollution source. It is calculated as the included angle between the ray pointing to sampling point A, starting from pollution source O, and the true north direction.
For the test area, a total of 5 potential pollution sources were identified, and for each of these sources, two spatial features, namely, distance inverse factor and azimuth factor, were calculated. This resulted in 10 spatial features of pollution sources for each sample. The distance inverse factor (SB_D, WY_D, YC_D, JY_D, JL_D) was computed by applying the Euclidean distance formula using the planar coordinates of the sampling point and the pollution source, and then calculating the reciprocal of the resulting distance value. The azimuth factor (SB_A, WY_A, YC_D, JY_D, JL_D) was calculated by determining the included angle between a ray pointing to the sampling point and starting from the pollution source and the true north direction. By incorporating these spatial features, the proposed model aims to quantify the impact factors of pollutant diffusion and concentration in the study area and ultimately improve the accuracy of soil heavy metal content estimation.

Topographic Influence Factor
Topography and terrain play a significant role in determining water flow direction and velocity, as well as controlling local airflow wind direction and velocity, which ultimately influence the diffusion and pooling of heavy metal pollutants. To analyze the influence of topographic factors on soil heavy metal contaminant concentrations, several features were introduced, including Elevation, Slope, Aspect [34], LS factor (LSF) [35], Morphometric features (MF), Generalized surface index (GSI), Wind exposure (WE), and Topographic wetness index (TWI) [36]. These eight topographic features were calculated based on DEM data with a grid size of 30 m. Interpolation was used to extract the corresponding sample point location features. These topographic features are crucial for understanding the local flow dynamics and their influence on soil heavy metal contaminant concentrations.

Ensemble Learning Methods
In the field of inversion research, the ensemble learning method has been proven to effectively enhance the prediction and generalization ability of models. The ensemble learning paradigm involves combining multiple models to solve the same problem and obtain better results. The first step in building an ensemble learning approach is to select the underlying model to be aggregated [37]. The resulting ensemble model that employs a single combination of basic learning algorithms is referred to as a homogeneous ensemble model, and it includes self-help aggregation bagging, boosting, and other models [38]. On the other hand, an ensemble model that combines different weak learners as basic models is called a heterogeneous ensemble model. The representative models of this approach are blend model and stack model [39].
Blending is a technique for combining multiple sub-learners to form a model ensemble. The first layer of the Blending model involves partitioning the training set into a training set and a test set. Multiple models, homogeneous or heterogeneous, can be selected in the first layer. These models are trained using the training sets, and their performance is evaluated on a validation set to obtain the prediction features. These prediction features are then used as the training set for the second layer, where the meta-learner (M0) is directly based on the prediction results of the first level ( Figure A1). In this study, the training set is further divided into a first-level training set and a first-level validation set, and sub-learner models such as CatBoost, XGBoost, LightGBM, GB, and KNN are trained on the first-level training set. The trained models are used to predict the validation and test sets. The prediction results of the validation set are used as new features to train the M0 model, while the prediction results of the first test set are used as the test set for the second input to obtain the final prediction results. The remaining 25% of the data is used for testing, while the original data is split 3:1 for training and testing.
Typically, targeting multiple disadvantaged learners (g 1 , g 2 , g 3 · · · g t ), uniform fusion is suitable when individual learners have similar performance.
When the performance of individual learners varies considerably, a weighted average is used to give each g i a different weight to achieve better results.
Use the mean square error to determine α t .
The implementation process of the stacking model ( Figure A2) is similar to simple averaging and weighted averaging, in which multiple models are aggregated using a multi-layer structure. This multi-layer stacking architecture is trained layer by layer greedily, first training shallow learner parameters and then gradually training deep learner parameters [40]. The characteristic of the stacking architecture is that it does not directly train all training datasets, but instead it uses k-fold cross validation method to divide the entire training dataset into k equal subsets. In one training session, one of the subsets is used as the test set, while the remaining K-1 subsets are used to train the model. After K cross validation sessions, the output of the Level 1 model on the entire training dataset is obtained [41].

Model Accuracy Measure
The metrics used to evaluate the model fit and generalization ability are the coefficient of determination (R 2 ), root mean squared error (RMSE), and mean absolute percentage error (MAPE).
The determination coefficient is used to express the degree of correlation between the predicted value of the model and the measured value. The value is between 0 and 1. The closer the value is to 1, the higher the degree of correlation. On the contrary, the closer the value is to 0, the lower the degree of correlation.
The root mean square error is a measure of the deviation of the model from the observed value to the true value. The smaller the RMSE value, the better the result.
MAPE avoids the situation in which positive and negative errors cancel each other out, and it can reflect the difference between the predicted and measured values more realistically. Usually, MAPE ≤ 0.2 indicates good model robustness; 0.2 < MAPE ≤ 0.5 indicates average model robustness.

Feature Importance Measure
Assessing the degree of influence of features on prediction results can be achieved using feature importance. A higher importance value for a feature indicates a greater significance in the prediction result. The SHAP method is an effective way of explaining models by measuring the impact of features on model output results through the calculation of importance value for each feature in the sample. It is a versatile method that can be applied to various models [42].
The calculation process of this method is as follows: firstly, the feature subset S is obtained by arranging and combining from the feature set to determine the decision path of the samples. Then, calculate the internal node coverage value according to the sample number of leaf nodes, and then combine the decision path and the internal node coverage value to derive the edge weight. If the feature in the current node is in S, directly follow the path to the next node, and the edge weight is set to 1. If the current node feature is not in S, it needs to traverse its left and right nodes at the same time, and the weight is set to the number of samples of corresponding child nodes or the number of parent nodes. Finally, we return from the leaf node to the root node, and the final value taken at the root node is used as the estimate of f x (S) [43]. The SHAP value Φ i of feature i is expressed by the formula: where N is the set of all features in the training set with dimension M, and S is the subset drawn from N with dimension |S|.

Kriging
After predicting the heavy metal content of the sample points, the Kriging interpolation method was used to draw a spatial distribution map of Pb and Cd in the studied soil, with the weight being the average of the observed values based on the degree of spatial correlation. The Kriging method is a spatially localized interpolation method based on the theory of variational functions and structural analysis [44]. Unbiased optimal estimation of regional variables in a limited area is applicable to the spatial correlation of regionalized variables. The realization principle is to use the original data of regionalized variables and the structural characteristics of variogram to conduct linear unbiased optimal estimation of unknown samples [45]. That is, the attribute value of a point is related to the attribute value of its surrounding points, and it can be inferred from the attribute value of its surrounding points. The estimated value z 0 of an unknown issue (x 0 , y 0 ) is a weighted sum of some known points.
where n is the number of points that can determine the attribute value of the point, z i is the attribute value of the i known point, and w i is the weight of the i known point.

Statistical Characteristics of Heavy Metal Content of Sample Soil
Descriptive statistical data for heavy metals in soil are as follows (Table 1), which includes the mean value, standard deviation, minimum value, maximum value, and coefficient of variation. The correlation coefficient between Pb and Cd is high, with a value of 0.866, indicating that they have a high degree of homology. The coefficient of variation (CV) of Pb is higher compared to Cd, suggesting that Pb concentrations are more variable across the sampled locations. The soil background values in Henan Province from the "Background Values of Elements in Chinese Soils" are taken as the reference values.

Spectral Feature Extraction
The correlation between the converted spectral bands and the measured values of Pb and Cd contents in soil heavy metals was analyzed using Pearson correlation. The absolute values of these results were used to evaluate the correlation between spectral bands and soil heavy metals (Figure 2).
When the significance level p is greater than 0.01, spectral bands highly correlated with heavy metal content are selected as the spectral characteristics of the model. In the original spectral segment of Sentinel-2A remote sensing image, the B6 to B8A bands are highly correlated with heavy metal content, with the highest correlation coefficient between the original spectral segment B7 and heavy metal content. After logarithmic transformation of spectral reflectance, the correlation coefficient has a slight improvement, and the spectral bands ln6~lnB8A have the highest correlation coefficient with heavy metal content. The overall correlation between the spectral factors and the contents of heavy metals Pb and Cd is negative after the reciprocal operation. However, neither of these conversion methods achieves additional spectral information for the reflection peak and absorption valley. The correlation coefficient band after the band ratio conversion fluctuates greatly between positive and negative values. After ratio calculation, there are 18 spectral factors with a significant level of Pb element greater than 0.01, and there are 13 spectral factors with a significant level of Cd element greater than 0.01.

Model Estimation and Comparisons
In this study, spectral parameters were selected as characteristic variables using correlation analysis. Spatial and topographical features were also introduced to develop a soil heavy metal content retrieval model based on ensemble learning. A total of 235 sampling points were measured, of which 176 were randomly selected for training, and the remaining 59 were used for prediction, with a ratio of 3:1. In addition to evaluating the proposed model, the results are also compared with traditional regression models such as SVM and RF. The heterogeneous integration learning model establishes an inversion model after adjusting the parameters of the training learner. First, adjust the parameters of the Level 0 learner, determine the parameters, and then adjust the Level 1 learner parameters. Finally, input the prediction dataset into the model to evaluate the model accuracy. Additionally, separately count the best performance of each sub learner.
The model evaluation results are summarized ( Table 2). In the inversion model established using spectral characteristics, the R 2 value of the Cd element SVM model is relatively low, which means that the model cannot quantitatively estimate the Cd concentration in soil. In addition, the MAPE values of lead and cadmium in all models are greater than 0.5, indicating that the inversion model constructed using Sentinel multispectral bands has low robustness and poor prediction accuracy. With the introduction of spatial and topographic features, various accuracy evaluation indicators have greatly improved. The R 2 , RMSE, and MAPE of the mixed model that performed best for Pb were 0.9486, 28.5708, and 0.1723, respectively. The R 2 , RMSE, and MAPE of the mixed model that performed best on Cd elements were 0.9489, 0.3169, and 0.1911, respectively. The MAPE values of the target elements are all less than 0.2, so the hybrid prediction model based on spectral, spatial, and topographic features has high accuracy and stability, and it can be well applied to the study area.

Feature Importance Analysis
SHAP values are used to evaluate the contribution of spectral, topographic, and spatial characteristics to the construction of heavy metal content inversion models. The article provides the SHAP values of 15 features (Figure 3), which are the features with the highest importance among the three modeling features. The spatial distance feature (SB_D) between the pollution source Shibin smelter and the sampling point is the feature with the greatest contribution in heavy metal Pb and Cd modeling. In addition, the overall SHAP value of spatial features is significantly higher than spectral and topographic features, and topographic features also have a certain positive impact on the model, but their overall impact is weaker compared to spatial and spectral features.

Spatial Distribution of Heavy Metal Contamination
In addition to evaluating the accuracy of the model, it is essential to analyze the spatial trend of heavy metal concentration for a more comprehensive understanding of the distribution of heavy metals Pb and Cd. To achieve this, spatial interpolation of soil heavy metal distribution was performed in the study area. This method provided a visual representation of the heavy metal concentration and its spatial distribution in the study area. To obtain accurate data, the features of the image of the study area were classified using support vector machine (SVM) and artificial vision interpretation [46]. This classification process involved dividing the image into several categories such as vegetation, water, buildings, roads, and farmland ( Figure 4). Through this process, the range of farmland in the image was determined, allowing us to conduct a more precise analysis of the distribution of heavy metals in farmland areas. The present study utilized a comprehensive approach to analyze the spatial distribution of heavy metal content in the farmland of the study area. The extraction of spectral, spatial, and topographic features of farmland pixels, along with the robust blending ensemble learning model, proved to be effective in predicting heavy metal content in the farmland.
To further evaluate the model performance, Kriging interpolation of measured heavy metal content was performed to obtain the spatial distribution of heavy metal content in the entire study area [47]. The spatial distribution of the predicted content of Pb and Cd based on Sentinel-2A (Figures 5a and 6a) was in good agreement with the actual measured value (Figures 5b and 6b), as verified by the comparison of the spatial distribution of measured and predicted heavy metal content. These findings suggest the feasibility of using Sentinel-2A remote sensing images to accurately predict heavy metal content in soils. Overall, the study provides an innovative and reliable approach for analyzing the spatial distribution of heavy metals in farmland using remote sensing images and machine learning models.  The correlation analysis results show that Pb and Cd have a high degree of homology, so the interpretation of Pb prediction results overlaps with the understanding of Cd prediction results. Based on the comparison between the measured heavy metal content and the spatial distribution of heavy metal content predicted by Sentinel-2A, it can be found that the soil heavy metal content in the northwest of the study area is relatively high, and the soil heavy metal content shows a downward trend from northwest to southeast.

Soil Heavy Metal Contents
In this study, we aimed to map the spatial distribution of heavy metal concentrations in Jiyuan and its surrounding areas using Sentinel-2A data. Our results showed that the coefficients of variation for heavy metals Pb and Cd were 109 and 96%, respectively, indicating that the distribution of heavy metal concentration is heavily influenced by human activities. These findings are consistent with those of previous studies. For example, Wang et al. [48] investigated the spatial distribution and accumulation of heavy metals in Chinese soil, and they found that the content of cadmium in soil is generally lower than the average concentration of 2.155 mg/kg observed in our study. Similarly, the research results of Pan et al. [49] on heavy metals in cities showed that the median concentration of Pb was slightly lower than the minimum concentration range of Pb measured in our study. Our statistical analysis revealed a high correlation coefficient of 0.866 between Cd and Pb content, indicating their strong homology and potential for synergistic effects, which is consistent with existing research on heavy metal correlations [50]. By comparing the spatial distribution of Pb content derived from Sentinel-2A data with that of Cd content derived from Sentinel-2A data, we found that the distribution of Cd content is similar to that of Pb content.

Comparison Prediction Performance of Models
Comparing the performance of traditional regression models and integrated models in predicting heavy metal concentrations in soil, it was found that the integrated model outperforms RF and SVM in predicting heavy metal content. This is because there is a non-linear relationship between modeling factors and heavy metals, while SVM is a linear regression algorithm that has a slightly weaker ability to quantitatively estimate the content of heavy metals in soil [51]. On the other hand, the RF and gradient boosting (GB) models are both regression models based on decision trees and perform well in predicting heavy metal content [52]. Studies have shown that models based on heterogeneous integration can overcome various problems caused by small sample sets and unbalanced data [53]. The Stacking model may use global statistical features when training on part of the data, leading to a distorted model performance. The Blending model only needs to train different base models on non-overlapping data, and the output values are weighted averages, avoiding information leakage. Model prediction results show that the Blending model based on heterogeneous ensembles performs best among all models, with R 2 values of 0.9486 and 0.9489 for the two heavy metals Pb and Cd, respectively.
A regression analysis was conducted to evaluate the performance of the optimal model in predicting heavy metal Pb and Cd concentrations. A 1:1 relationship plot was generated by plotting the predicted values against the measured values. As demonstrated in Section 3.3, the hybrid models that integrate spectral, spatial, and topographic features achieved the best accuracy evaluation indicators for heavy metal inversion. The x-axis of the plot indicates the measured values of heavy metals at the sample point, while the y-axis represents the predicted values of the hybrid model (Figure 7). In the heavy metal inversion scatter diagram, the predicted and measured values of the Blending model in the training and test sets are both almost a 1:1 straight line, indicating that the fitting effect is good [54], and the predicted values of soil Pb and Cd content obtained from inversion are relatively reasonable, which can be used to monitor the soil Pb and Cd content values in the study area and provide a basis for soil environmental monitoring.

Impact of Topography and Spatial Factors
After extracting spectral reflectance from sample points in Sentinel-2A multispectral remote sensing images, a soil heavy metal Pb and Cd content prediction model was developed by incorporating spatial and topographic characteristics of pollution sources and sinks. The model's performance was significantly improved compared to using Sentinel spectral single features, which supports our hypothesis. Inorganic pollutants generated during mining and metal smelting are predominantly point source pollution that spreads due to natural factors such as wind direction, wind speed, terrain, and precipitation [55]. Previous studies have investigated the contribution of source landscapes using atmospheric sedimentation and surface runoff as diffusion paths to the accumulation of heavy metals in soil [56]. The spatial distribution of heavy metals typically follows the terrain [57], but the study area's overall terrain is relatively flat, with no significant elevation fluctuations, and thus, topographic characteristics have a weak impact on the spatial heterogeneity of heavy metals. The spatial distribution of heavy metals in soil is influenced by a combination of environmental factors [58]. Soil samples were collected from cultivated land around towns in the study area, and the five metal smelters distributed in the area are the primary reasons for the high-value areas of heavy metals in the soil. Correlation analysis between heavy metal content and spatial and topographic features (Tables A1 and A2) reveals that ten spatial features are negatively correlated with heavy metal content. As the distance from the pollution source increases, areas with high heavy metal content gradually disappear.

Heavy Metal Concentration Distribution
The accumulation of heavy metals in soil can result from various sources, including fertilization, agricultural chemicals, atmospheric sedimentation, sewage irrigation, mining, metal ore, sludge application, smelting operations of industrial waste, fossil fuel refining and combustion, and remodification [59]. China's rapid industrialization and urbanization over the past three decades have been identified as the primary factor contributing to the increased presence of heavy metals in soil [60]. In particular, industrial emissions from heavy metal smelting are a significant source of heavy metal pollution in soil, as several metal smelters are distributed throughout the study area [61]. The spatial distribution of heavy metals in soil is characterized by diffusion around the center of metal smelters, likely resulting from atmospheric diffusion deposition and rainfall [62]. The target areas with high heavy metal content are located in the south of Kejing Town and the north of Chengliu Town, both of which are near metal smelters. During the smelting process of heavy metals, lead in wastewater, waste residue, and atmospheric dust gradually migrates to cultivated land and accumulates continuously, resulting in lead pollution of cultivated land soil [63]. According to a study by Zhang et al. [64], the pollution rate of heavy metals in farmland soils is the highest for Cd, followed by Hg, Cu, Ni, Zn, and Pb, while Cr has the lowest pollution rate. The total pollution rate of farmland soil is primarily attributed to Cd, Hg, Cu, and Ni. In the study area, sludge is being used for irrigation, gradually accumulating heavy metals such as Pb and Cd in the farmland. The use of pesticides, fertilizers, and plastic mulch can also cause the accumulation of heavy metals in soil [65,66]. Agricultural plastic films, for instance, contain heat stabilizers with heavy metals such as Cd and Pb, which can result in heavy metal pollution during their use in plastic greenhouses and mulching [67]. The widespread use of electronic equipment generates a substantial amount of electronic waste that contains heavy metals (Cd, Co, Ni, and Pb) [68]. In the main urban area of Jiyuan, human activities, domestic garbage, and industrial waste are the primary sources of heavy metal pollution. Pb and Cd in vehicle exhaust are deposited into road soil through the atmosphere, resulting in a zonal distribution of Pb content in soil on both sides of the road [69]. In the study area, low-value areas of heavy metal content are mainly distributed in the south of Chengliu Town, Yuquan Street, and Heji Town in the northeast of Kejing Town, which are far away from metal smelters and main urban areas and thus have high soil environmental quality.

Conclusions
This article takes the surrounding area of Jiyuan City, Henan Province, China as the research area, collects 235 soil sample data on the spot, conducts chemical analysis, obtains the heavy metal content of the samples, and constructs an inversion model using Sentinel-2A multi spectral images, spatial features of pollution source-sinks and topographic factors. Compared with many traditional models, the prediction accuracy of blending ensemble learning model is evaluated.
The experiments show that the model constructed by the blending method has obvious advantages, among which the test set R 2 of Pb's Blending model is 0.9486, the Blending model R 2 of Cd is 0.9489, and the MAPE value is less than 0.2. The high estimation accuracy of the Blending soil heavy metal estimation model shows that this method is highly feasible for the inversion of soil heavy metal content. The paper proposes to construct the spatial features of potential pollution sources to quantify the spatial influence factors of pollutant diffusion, and the method is applicable to point source and line source pollution with clear sources of pollutants. Generally, soil heavy metal pollutants such as atmospheric dust and sewage discharge originate from mining activities, metallurgy, and industrial production, and the sources of pollution are clear. Therefore, the proposed method has high popularization and application potential. Data Availability Statement: Not applicable.

Conflicts of Interest:
The authors declare no conflict of interest.
Appendix A Table A1. Correlation coefficient between Pb content and spatial and topographic characteristics.