Estimation of Soil Arsenic Content with Hyperspectral Remote Sensing

With the continuous application of arsenic-containing chemicals, arsenic pollution in soil has become a serious problem worldwide. The detection of arsenic pollution in soil is of great significance to the protection and restoration of soil. Hyperspectral remote sensing is able to effectively monitor heavy metal pollution in soil. However, due to the possible complex nonlinear relationship between soil arsenic (As) content and the spectrum and data redundancy, an estimation model with high efficiency and accuracy is urgently needed. In response to this situation, 62 samples and 27 samples were collected in Daye and Honghu, Hubei Province, respectively. Spectral measurement and physical and chemical analysis were performed in the laboratory to obtain the As content and spectral reflectance. After the continuum removal (CR) was performed, the stable competitive adaptive reweighting sampling algorithm coupled the successive projections algorithm (sCARS-SPA) was used for characteristic band selection, which effectively solves the problem of data redundancy and collinearity. Partial least squares regression (PLSR), radial basis function neural network (RBFNN), and shuffled frog leaping algorithm optimization of the RBFNN (SFLA-RBFNN) were established in the characteristic wavelengths to predict soil As content. These results show that the sCARS-SPA-SFLA-RBFNN model has the best universality and high prediction accuracy in different land-use types, which is a scientific and effective method for estimating the soil As content.


Introduction
The pollution of arsenic (As) to the environment has caused widespread concern all over the world [1], and harm to the body of local residents has been confirmed in recent years. Once it enters the human body, no matter through what way, only a small amount will be eliminated, most will accumulate in the body. Arsenic accumulated in the human body will cause abnormal cell metabolism after destroying the redox ability of cells, resulting in serious diseases and even death. It is recognized as a top pollutant by the US Toxic Commission [2]. In arsenic-contaminated areas, the toxicity and biological activity of arsenic depend on the form of arsenic. Among all forms, As (V) accounts for the vast majority, and has the highest biological effectiveness [3]. Therefore, the determination of total arsenic in the soil can reflect the pollution of arsenic in most areas, which is of great significance.
Considering the importance of soil safety to human health and agricultural development, it is very important to identify areas contaminated by heavy metals in the soil [4]. For this reason, a low-cost

Study Area
The study area is located in Honghu and Daye, Hubei Province, China. Daye (114 • 31 ~115 • 20 E, 29 • 40 ~30 • 15 N) has a total area of 1566.3 km 2 , of which cultivated land area is about 480.1 km 2 . In a subtropical continental monsoon climate, Daye has sufficient sunlight and precipitation, which is conducive for agricultural production. Daye has rich mineral resources [18] and more than 2000 years of mining and smelting history. The mining and smelting, which is mainly based on copper and iron ore mining, have promoted the rapid economic development of the region. At the same time, the excessive exploitation of resources has also caused serious pollution of the region's soil, posing a threat to the development of the agricultural economy. In Daye, the land-use type of the soil sample collection area belongs to farmland. Honghu (113 • 07 ~114 • 05 E,29 • 39 ~30 • 12 N) has a total area of 2519 km 2 . In a subtropical humid monsoon climate, winter and summer in Honghu are longer than spring and autumn, and the precipitation is also abundant. In recent years, due to industrial production and transportation, the soil heavy metal pollution is more serious than before, which has led to the deterioration of the ecological environment quality. The land-use type of the soil sample collection area belongs to industrial and mining land. The location of the study area is shown in Figure 1.

Study Area
The study area is located in Honghu and Daye, Hubei Province, China. Daye (114°31′~115°20′E, 29°40′~30°15′N) has a total area of 1566.3 km 2 , of which cultivated land area is about 480.1 km 2 . In a subtropical continental monsoon climate, Daye has sufficient sunlight and precipitation, which is conducive for agricultural production. Daye has rich mineral resources [18] and more than 2000 years of mining and smelting history. The mining and smelting, which is mainly based on copper and iron ore mining, have promoted the rapid economic development of the region. At the same time, the excessive exploitation of resources has also caused serious pollution of the region's soil, posing a threat to the development of the agricultural economy. In Daye, the land-use type of the soil sample collection area belongs to farmland. Honghu (113°07'~114°05'E,29°39'~30°12'N) has a total area of 2519 km 2 . In a subtropical humid monsoon climate, winter and summer in Honghu are longer than spring and autumn, and the precipitation is also abundant. In recent years, due to industrial production and transportation, the soil heavy metal pollution is more serious than before, which has led to the deterioration of the ecological environment quality. The land-use type of the soil sample collection area belongs to industrial and mining land. The location of the study area is shown in Figure 1.

The sCARS-SPA Characteristic Wavelength Selection Algorithm
Founded on Monte Carlo sampling (MCS) and partial least squares regression coefficients, CARS is considered to be a popular and effective characteristic wavelength selection algorithm [19]. However, the variable regression coefficient will change with each random selection of modeling samples, and the importance of the wavelength cannot be well reflected by the absolute value of the regression coefficient. Zheng et al. [12] proposed the sCARS algorithm, which takes the stability of

The sCARS-SPA Characteristic Wavelength Selection Algorithm
Founded on Monte Carlo sampling (MCS) and partial least squares regression coefficients, CARS is considered to be a popular and effective characteristic wavelength selection algorithm [19]. However, the variable regression coefficient will change with each random selection of modeling samples, and the importance of the wavelength cannot be well reflected by the absolute value of the regression coefficient. Zheng et al. [12] proposed the sCARS algorithm, which takes the stability of variables as the indicator to measure the importance of variables, and continues the variable screening process Sensors 2020, 20, 4056 4 of 16 of CARS. In addition, there is collinearity among spectral characteristic wavelengths. SPA can select the wavelength group containing the most relevant information to eliminate collinearity in wavelengths as much as possible [20]. In this study, SPA was used to perform second characteristic wavelength selection to eliminate the collinear effect among many wavelength variables. The specific process of the sCARS-SPA is as follows: Step 1: Calculate the stability index of each wavelength Based on the stability index, the importance of wavelengths is directly reflected. The spectral data X p×q contains q spectral responses of p samples. y q×1 is the arsenic content of q samples. The formula about X p×q and y q×1 can be expressed as: where α is a regression coefficient vector, the number of coefficients is q, and F is an error vector.
In the process of MCS, q s (q s < q) samples are randomly selected from q samples, and the regression coefficient α is calculated. After R times of sampling, a matrix A([α 1 , α 2 , . . . α R ] t will finally be determined, which contains R corresponding regression coefficient vectors. The number of rows and columns of the R matrix are q and R, respectively. Stability is defined here as [21,22]: where for the ith wavelength, e i , d i , and s(d i ) are the stability index, the mean value, and the standard deviation, respectively, after R sampling runs. To ensure that the stability is positive when compared, the symbol of absolute value is applied to the formula.
Step 2: Select the group of wavelengths based on sCARS The important wavelengths were obtained by mandatory wavelength selection and adaptive reweighted sampling (ARS) depending on the wavelength stability. For obtaining the best subset of variables at this stage, cross-validation is performed and the wavelength group with the minimum root mean square error of cross-validation (RMSECV) is the best wavelength group selected by sCARS.
Step 3: SPA is used to perform second feature selection on the variable subset obtained in step 2, eliminating the collinearity effect among many wavelength variables.
Specific steps are as follows: (a) Initialization: z = 1 (first iteration). In the wavelength group obtained in step 2, a wavelength x j is chosen by random selection. x j is expressed as x r(0) , that is, x r(0) = j; (b) The set C is defined as: where R is the number of wavelengths. That is, the wavelength selected in the initialization has not been selected into the wavelength chain. The projection vector of x j to the vector in C is calculated.
(c) The sequence number of the maximum projection is recorded (d) The projection vector of the next round is the maximum projection of the previous round Sensors 2020, 20, 4056 5 of 16 (e) z = z + 1, if z < R, go back to (b) to continue projection.
For each pair of x r(0) and R, the RMSECV of the calibration set is obtained on multiple linear regression analysis (MLR). The x r(0) and R corresponding to the smallest RMSECV are the final subsets of wavelengths.

Partial Least Squares Regression
As a commonly used multivariate statistical algorithm, PLSR considers not only the extraction of principal components from dependent and independent variables, but also the maximization of the correlation between principal components extracted from independent and dependent variables [23,24]. Therefore, PLSR is a general method of modeling using hyperspectral data.

SFLA-RBFNN Method for Estimation of the Soil As Content
RBFNN can handle the difficulty in analyzing regularity and approximate any non-linear function, and is considered to be excellent in generalization and convergence rate. As shown in Figure 2. RBFNN is a feedforward neural network with a three-layer structure, including input layer, hidden layer, and output layer.

Partial Least Squares Regression
As a commonly used multivariate statistical algorithm, PLSR considers not only the extraction of principal components from dependent and independent variables, but also the maximization of the correlation between principal components extracted from independent and dependent variables [23,24]. Therefore, PLSR is a general method of modeling using hyperspectral data.

SFLA-RBFNN Method for Estimation of the Soil As Content
RBFNN can handle the difficulty in analyzing regularity and approximate any non-linear function, and is considered to be excellent in generalization and convergence rate. As shown in Error! Reference source not found., RBFNN is a feedforward neural network with a three-layer structure, including input layer, hidden layer, and output layer.
The nonlinear function h(x, c ) is used as a radial basis function at each node of the hidden layer. The commonly used radial basis transfer function is a Gaussian kernel function, and its formula is: where r is the spread constant of the radial basis function, c is the value of the center symmetry point, and is the number of neurons. After the radial basis function is determined, the output value of the RBF neural network can be obtained by linearly summing the function results. SFLA is a heuristic group optimization algorithm [25]. Combining the advantages of memetic algorithm and particle swarm optimization algorithm, it has excellent global optimization capabilities and computational efficiency. Therefore, in this study, the parameter of the RBFNN will be optimized by SFLA for improving the prediction accuracy and the robustness. The structure of SFLA-RBFNN is shown in the Figure 2.  The nonlinear function h(x, c i ) is used as a radial basis function at each node of the hidden layer. The commonly used radial basis transfer function is a Gaussian kernel function, and its formula is: where r i is the spread constant of the radial basis function, c i is the value of the center symmetry point, and i is the number of neurons. After the radial basis function is determined, the output value of the RBF neural network can be obtained by linearly summing the function results. SFLA is a heuristic group optimization algorithm [25]. Combining the advantages of memetic algorithm and particle swarm optimization algorithm, it has excellent global optimization capabilities and computational efficiency. Therefore, in this study, the parameter of the RBFNN will be optimized Sensors 2020, 20, 4056 6 of 16 by SFLA for improving the prediction accuracy and the robustness. The structure of SFLA-RBFNN is shown in the Figure 2.
To avoid the appearance of overfitting in the optimization process, cross-validation is performed during the establishment of the model. The individual fitness function is set to the mean value of the root mean square error (RMSE) of the validation set. The optimization objective function is: where K is the fold of cross-validation (here it is 4), y i,k is the measured value of the verification set in the kth cross-validation,ŷ i,k is the predicted value of validation set in the kth cross-validation, and n is the number of samples in the validation set.

Flowchart
The flowchart for prediction of soil As content is shown in Figure 3 and is mainly divided into the following four parts: (1) the acquisition and processing of soil hyperspectral data and the As content in soil; (2) the characteristic band is selected by sCARS and sCARS-SPA, respectively; (3) through the joint x − y distance (SPXY) algorithm, train set, validation set, and test set are obtained from the sample set; (4) PLSR, RBFNN, and SFLA-RBFNN are established in the characteristic wavelengths to predict soil heavy metal As content, and the model with the best accuracy is obtained.
Sensors 2020, 20, x FOR PEER REVIEW 6 of 16 To avoid the appearance of overfitting in the optimization process, cross-validation is performed during the establishment of the model. The individual fitness function is set to the mean value of the root mean square error (RMSE) of the validation set. The optimization objective function is: where K is the fold of cross-validation (here it is 4), i, is the measured value of the verification set in the th cross-validation, i, is the predicted value of validation set in the th cross-validation, and is the number of samples in the validation set.

Flowchart
The flowchart for prediction of soil As content is shown in Figure 3 and is mainly divided into the following four parts: (1) the acquisition and processing of soil hyperspectral data and the As content in soil; (2) the characteristic band is selected by sCARS and sCARS-SPA, respectively; (3) through the joint x − y distance (SPXY) algorithm, train set, validation set, and test set are obtained from the sample set; (4) PLSR, RBFNN, and SFLA-RBFNN are established in the characteristic wavelengths to predict soil heavy metal As content, and the model with the best accuracy is obtained.

Accuracy Evaluation
The performance of estimation models was evaluated by R , RMSE, and mean absolute error (MAE). The calculation formulae of R , RMSE, and MAE are as follows:

Accuracy Evaluation
The performance of estimation models was evaluated by R 2 , RMSE, and mean absolute error (MAE). The calculation formulae of R 2 , RMSE, and MAE are as follows: Sensors 2020, 20, 4056 where y i is the measured As content in the soil,ŷ i is the estimated soil As content, y is the mean of the measured soil As content, and n means the number of soil samples.

Soil Sample Collection and Processing
In Honghu and Daye, 27 and 62 soil samples were collected using a checkerboard sampling method at a depth of 0-20 cm. The soil samples were yellow-brown, rich in organic matter, and weakly acidic. According to the World References Based Soil Resources (IUSS WG WRB, 2015) [26], they belong to Fluvisols. After the collection was completed, each sample was split into two parts. The two parts were regarded as the targets of spectrum measurement and soil As content measurement, respectively. The measurement of As content was mainly based on laboratory analyses. First, impurities such as stones and plant roots in soil samples were removed as much as possible. After that, a sieve with a pore size of 2 mm was used to screen the crushed soil. Next, the sieved sample was compacted and screened again with a sieve with a pore size of 0.15 mm [27]. Finally, nitric acid, hydrochloric acid, and perchloric acid were used to decompose the arsenic in various forms in the soil sample, which converted them into soluble arsenic ions in the solution and avoided the interference of sulfur and phosphorus on the measurement. Then, potassium borohydride/silver nitrate spectrophotometry was used to measure the arsenic content based on the absorbance (it is assumed that the possible interference from other ions is acceptable) [3]. The determined value of each soil sample was calculated based on arithmetic average after three measurements.

Soil Spectrum Collection and Processing
For the samples from Honghu, the spectral reflectance was measured by SVC HR-1024 field spectrometer. The number of wavelengths in the spectrum was 990. The resolutions of 350-1000 nm, 1000-1900 nm, and 1900-2500 nm were 1.5 nm, 3.8 nm, and 2.5 nm, respectively. The spectral reflectance of the soil sample from Daye was measured by ASD FieldSpec3 field spectrometer with spectral resolution of 1 nm and wavelength range of 350-2500 nm. The number of wavelengths is 2151. The soil sample was placed in a darkroom, where a 1000-watt halogen lamp was used as a light source. Perpendicular to the soil surface, the irradiation angle of the light probe was 45 • [28].
During the measurement process of the spectrometer, although the spectrometer has a high measurement accuracy, there is usually a certain error between the measured content and the actual content due to the influence of factors such as the sample and the lighting conditions. Spectral transformation is a method to effectively emphasize the spectral characteristics and solve the problem of background noise. In this study, after performing the removal of the noise edge wavelengths of 350-399 nm and 2400-2500 nm [29], CR [30] was performed to highlight the absorption and reflection features of spectral curves, facilitating the extraction of feature wavelengths. The final results are used for modeling and analysis. After the processing is completed, the spectral reflectance curve of the soil sample from Honghu and Daye areas is shown in Figure 4.

Calibration Set, Validation Set, and Test Set
In order to consider both the soil As content vector and the spectral vector, the soil samples were split into a model set and test set for modeling and testing by SPXY algorithm [31], respectively. In the process of parameter optimization, for the purpose of avoiding the poor generalization of the model due to overfitting, the original training set is split into a calibration set and a verification set in the same proportion for cross-validation. Multiple groups of different training and validation of the model can also solve the problem of too one-sided and insufficient training data due to the results of individual testing. The validation set is used to evaluate the model for parameter optimization. The test set is used to evaluate the performance of the final model. For Daye, the number of training sets, verification sets, and test sets is 35, 12, and 15, respectively. For Honghu, the number of training sets, verification sets, and test sets is 15, 5, and 7, respectively. The algorithm selects the model with minimum generalization error as the final model and trains the model again on the whole training set to obtain the final model.
As shown in Table 1, according to the Soil Environmental Quality Standards GB15618-1995 in China, the mean of soil As content in the sampling area of Honghu is 33.05 ug/g, which exceeds the pollution standard. The mean of soil As content in the sampling area of Daye is 9.28 ug/g, which is lower than the pollution standard. Therefore, the sampling area of Honghu belongs to the contaminated area, and the sampling area of Daye belongs to the uncontaminated area. Two areas with different heavy metal As pollution levels are used as research areas, which can enhance the credibility and practicality of the experimental results.

sCARS-SPA Characteristic Wavelength Selection Algorithm
The sCARS-SPA algorithm can gradually eliminate redundant variables and collinear variables in soil hyperspectral data. The number of MCS samples is set to 50 and the fold of cross-validation is five. For the Daye area, after the five-fold cross-validation, the RMSECV trends of the CR of the reflectance spectra are shown in Figure 4. As shown in Figure 4a, when the sampling reached the 28th time (the corresponding number of characteristic wavelengths is 59), the RMSECV curve is at its

Calibration Set, Validation Set, and Test Set
In order to consider both the soil As content vector and the spectral vector, the soil samples were split into a model set and test set for modeling and testing by SPXY algorithm [31], respectively. In the process of parameter optimization, for the purpose of avoiding the poor generalization of the model due to overfitting, the original training set is split into a calibration set and a verification set in the same proportion for cross-validation. Multiple groups of different training and validation of the model can also solve the problem of too one-sided and insufficient training data due to the results of individual testing. The validation set is used to evaluate the model for parameter optimization. The test set is used to evaluate the performance of the final model. For Daye, the number of training sets, verification sets, and test sets is 35, 12 and 15, respectively. For Honghu, the number of training sets, verification sets, and test sets is 15, 5 and 7, respectively. The algorithm selects the model with minimum generalization error as the final model and trains the model again on the whole training set to obtain the final model.
As shown in Table 1, according to the Soil Environmental Quality Standards GB15618-1995 in China, the mean of soil As content in the sampling area of Honghu is 33.05 ug/g, which exceeds the pollution standard. The mean of soil As content in the sampling area of Daye is 9.28 ug/g, which is lower than the pollution standard. Therefore, the sampling area of Honghu belongs to the contaminated area, and the sampling area of Daye belongs to the uncontaminated area. Two areas with different heavy metal As pollution levels are used as research areas, which can enhance the credibility and practicality of the experimental results.

sCARS-SPA Characteristic Wavelength Selection Algorithm
The sCARS-SPA algorithm can gradually eliminate redundant variables and collinear variables in soil hyperspectral data. The number of MCS samples is set to 50 and the fold of cross-validation is five. For the Daye area, after the five-fold cross-validation, the RMSECV trends of the CR of the reflectance spectra are shown in Figure 5. As shown in Figure 5a, when the sampling reached the 28th time (the corresponding number of characteristic wavelengths is 59), the RMSECV curve is at its lowest point, indicating that the selected group of spectral wavelengths is the best at this stage. Then, for the reflectance spectral curve, the training set divided by SPXY is used in SPA to perform second characteristic wavelength selection, and the selected characteristic wavelength is used for final modeling. The characteristic wavelengths selected are shown in Figure 5b. The characteristic wavelengths are located in the area with a significant change, which shows that the SPA can be effective. Eleven wavelengths were finally selected and their wavelengths were 826 nm, 976 nm, 985 nm, 1213 nm, 1216 nm, 1221 nm, 1826 nm, 2341 nm, 2357 nm, 2380 nm, and 2382 nm.
Sensors 2020, 20, x FOR PEER REVIEW 9 of 16 lowest point, indicating that the selected group of spectral wavelengths is the best at this stage. Then, for the reflectance spectral curve, the training set divided by SPXY is used in SPA to perform second characteristic wavelength selection, and the selected characteristic wavelength is used for final modeling. The characteristic wavelengths selected are shown in Figure 4b. The characteristic wavelengths are located in the area with a significant change, which shows that the SPA can be effective. Eleven wavelengths were finally selected and their wavelengths were 826 nm, 976 nm, 985 nm, 1213 nm, 1216 nm, 1221 nm, 1826 nm, 2341 nm, 2357 nm, 2380 nm, and 2382 nm.
(a) (b) For the Honghu area, after the five-fold cross-validation, the RMSECV trends of the CR of the reflectance spectra are shown in Figure 5. As can be seen from Figure 5, when the sampling reached the 33rd time (the corresponding number of characteristic wavelengths is 17), the RMSECV curve is at its lowest point, indicating that the selected subset of spectral wavelengths is the best subset. In the same way as the Daye area, the characteristic wavelengths selected by sCARS were used by SPA to eliminate collinearity among wavelengths. Thirteen wavelengths were finally selected and the wavelengths were 437.

Characteristic Wavelengths
Since the characteristic wavelengths selected by sCARS may still be redundant and have low correlation, the model may have low efficiency and accuracy. The sCARS-SPA can select the For the Honghu area, after the five-fold cross-validation, the RMSECV trends of the CR of the reflectance spectra are shown in Figure 6. As can be seen from Figure 6, when the sampling reached the 33rd time (the corresponding number of characteristic wavelengths is 17), the RMSECV curve is at its lowest point, indicating that the selected subset of spectral wavelengths is the best subset. In the same way as the Daye area, the characteristic wavelengths selected by sCARS were used by SPA to eliminate collinearity among wavelengths. Thirteen wavelengths were finally selected and the wavelengths were 437. 3  lowest point, indicating that the selected group of spectral wavelengths is the best at this stage. Then, for the reflectance spectral curve, the training set divided by SPXY is used in SPA to perform second characteristic wavelength selection, and the selected characteristic wavelength is used for final modeling. The characteristic wavelengths selected are shown in Figure 4b. The characteristic wavelengths are located in the area with a significant change, which shows that the SPA can be effective. Eleven wavelengths were finally selected and their wavelengths were 826 nm, 976 nm, 985 nm, 1213 nm, 1216 nm, 1221 nm, 1826 nm, 2341 nm, 2357 nm, 2380 nm, and 2382 nm.  For the Honghu area, after the five-fold cross-validation, the RMSECV trends of the CR of the reflectance spectra are shown in Figure 5. As can be seen from Figure 5, when the sampling reached the 33rd time (the corresponding number of characteristic wavelengths is 17), the RMSECV curve is at its lowest point, indicating that the selected subset of spectral wavelengths is the best subset. In the same way as the Daye area, the characteristic wavelengths selected by sCARS were used by SPA to eliminate collinearity among wavelengths. Thirteen wavelengths were finally selected and the wavelengths were 437.

Characteristic Wavelengths
Since the characteristic wavelengths selected by sCARS may still be redundant and have low correlation, the model may have low efficiency and accuracy. The sCARS-SPA can select the

Characteristic Wavelengths
Since the characteristic wavelengths selected by sCARS may still be redundant and have low correlation, the model may have low efficiency and accuracy. The sCARS-SPA can select the wavelength with the least redundant information as much as possible and reduce the number of characteristic wavelengths to solve the problem of information redundancy and collinearity, which improves the estimation accuracy and execution efficiency of the model. As shown in Figure 7, the characteristic wavelength of Honghu was reduced from 17 to 13, and the characteristic wavelength of Daye was reduced from 59 to 11. The sensitive wavelengths of soil As content were concentrated in 437.3-441.9 nm, 826-999.4 nm, 1006.5-1099.3 nm, 1213-1221 nm, 1826 nm, and 2303-2382 nm, which indicate that the wavelengths in these six parts were closely related to the soil As content. wavelength with the least redundant information as much as possible and reduce the number of characteristic wavelengths to solve the problem of information redundancy and collinearity, which improves the estimation accuracy and execution efficiency of the model. As shown in Figure 7, the characteristic wavelength of Honghu was reduced from 17 to 13, and the characteristic wavelength of Daye was reduced from 59 to 11. The sensitive wavelengths of soil As content were concentrated in 437.3-441.9 nm, 826-999.4 nm, 1006.5-1099.3 nm, 1213-1221 nm, 1826 nm, and 2303-2382 nm, which indicate that the wavelengths in these six parts were closely related to the soil As content.

The Result of Spectral Transformation
As a general spectral transformation, CR can effectively highlight spectral characteristics, and enhance the effect of characteristic wavelength selection algorithm. From Table 2, it can be seen that based on the CR of the reflectance spectra, the estimation accuracy of the model established by the selected characteristic wavelength based on sCARS-SPA has been significantly improved. For the Honghu area, R increased from less than 0 to 0.8237 and 0.8644. For the Daye area, R increased from −0.1668 and 0.2597 to 0.7002 and 0.8781, respectively. It shows that CR can better enhance the effect of sCARS-SPA and improve prediction accuracy. After the CR was performed respectively, the effects of sCARS and sCARS-SPA in Honghu and Daye are compared, based on the three modeling methods of PLSR, RBFNN, and SFLA-RBFNN. Figure 8 shows the comparison of R . It is obvious that the sCARS-SPA model is superior to the sCARS model in terms of test accuracy. It is indicated that that the obvious collinearity in the characteristic wavelengths selected by the sCARS method is significantly eliminated after the SPA was performed.

The Result of Spectral Transformation
As a general spectral transformation, CR can effectively highlight spectral characteristics, and enhance the effect of characteristic wavelength selection algorithm. From Table 2, it can be seen that based on the CR of the reflectance spectra, the estimation accuracy of the model established by the selected characteristic wavelength based on sCARS-SPA has been significantly improved. For the Honghu area, R 2 p increased from less than 0 to 0.8237 and 0.8644. For the Daye area, R 2 p increased from −0.1668 and 0.2597 to 0.7002 and 0.8781, respectively. It shows that CR can better enhance the effect of sCARS-SPA and improve prediction accuracy. Table 2. Comparison of prediction accuracy before and after spectral pretreatment based on partial least squares regression (PLSR) and RFBNN model.

The Performance of sCARS-SPA Characteristic Wavelength Selection Algorithm
After the CR was performed respectively, the effects of sCARS and sCARS-SPA in Honghu and Daye are compared, based on the three modeling methods of PLSR, RBFNN, and SFLA-RBFNN. Figure 8 shows the comparison of R 2 p . It is obvious that the sCARS-SPA model is superior to the sCARS model in terms of test accuracy. It is indicated that that the obvious collinearity in the characteristic wavelengths selected by the sCARS method is significantly eliminated after the SPA was performed.
Sensors 2020, 20, x FOR PEER REVIEW 11 of 16 Figure 7. Comparison of the sCARS and sCARS-SPA in .
As shown in Tables 3 and 4, the model that obtains the highest fitting accuracy and prediction accuracy is SFLA-RBFNN. For validation set, compared to sCARS method, in the Daye area, R increased from 0.8403 to 0.9230, RMSE decreased from 0.3055 to 0.2161, and MAE decreased from 0.8403 to 0.1700. In the Honghu area, R increased from 0.8514 to 0.8862, RMSE decreased from 0.9616 to 0.8501 and MAE decreased from 0.7810 to 0.7197. For calibration set, in the Honghu are and the Daye area, R is above 0.9. And the difference between R and R is further smaller, avoiding the occurrence of over-fitting and under-fitting. It is obvious that sCARS-SPA improves the prediction accuracy of PLSR, RBFNN and SFLA-RBFNN, which allows these models to better predict the As content in the soil, especially the sCARS-SPA-SFLA-RBFNN model.  As shown in Tables 3 and 4, the model that obtains the highest fitting accuracy and prediction accuracy is SFLA-RBFNN. For validation set, compared to sCARS method, in the Daye area, R 2 p increased from 0.8403 to 0.9230, RMSE p decreased from 0.3055 to 0.2161, and MAE p decreased from 0.8403 to 0.1700. In the Honghu area, R 2 p increased from 0.8514 to 0.8862, RMSE p decreased from 0.9616 to 0.8501 and MAE p decreased from 0.7810 to 0.7197. For calibration set, in the Honghu are and the Daye area, R 2 c is above 0.9. And the difference between R 2 c and R 2 p is further smaller, avoiding the occurrence of over-fitting and under-fitting. It is obvious that sCARS-SPA improves the prediction accuracy of PLSR, RBFNN and SFLA-RBFNN, which allows these models to better predict the As content in the soil, especially the sCARS-SPA-SFLA-RBFNN model. In this study, the PLSR model, RBFNN model, and SFLA-RBFNN model were established and compared in different research areas. For RBFNN, the input layer is composed of the reflectivity of the characteristic wavelength selected by sCARS-SPA, and the output layer is the soil arsenic content. For SFLA, the population size is set to 100, and the maximum evolutionary generation is set to 300. As shown in Figure 9, compared with the PLSR and the RBFNN, the points of SFLA-RBF are closer to the 1:1 line and have higher prediction accuracy. In Daye, R 2 p is 0.9320, RMSE p is 0.2161, and MAE p is 0.1700. R 2 p increased by 0.0539, RMSE p decreased by 0.0556, and MAE p decreased by 0.0733. In addition, in areas with low As content in the soil, the prediction accuracy was more significantly improved by the SFLA method.
Sensors 2020, 20, x FOR PEER REVIEW 12 of 16 In this study, the PLSR model, RBFNN model, and SFLA-RBFNN model were established and compared in different research areas. For RBFNN, the input layer is composed of the reflectivity of the characteristic wavelength selected by sCARS-SPA, and the output layer is the soil arsenic content. For SFLA, the population size is set to 100, and the maximum evolutionary generation is set to 300. As shown in Figure 9, compared with the PLSR and the RBFNN, the points of SFLA-RBF are closer to the 1:1 line and have higher prediction accuracy. In Daye, R is 0.9320, RMSE is 0.2161, and MAE is 0.1700. R increased by 0.0539, RMSE decreased by 0.0556, and MAE decreased by 0.0733. In addition, in areas with low As content in the soil, the prediction accuracy was more significantly improved by the SFLA method. As shown in Figure 10, in the Honghu area, the SFLA-RBF model also has the highest prediction accuracy. R is 0.8862, RMSE is 0.8501, and MAE is 0.7197. Compared with RBFNN, R increased by 0.0218, RMSE decreased by 0.0772, and MAE decreased by 0.078. It can be seen that compared with other models, the sCARS-SPA-SFLA-RBFNN method has better robustness and accuracy and is the best model for predicting soil As content.

Discussion
Due to the presence of contaminants such as heavy metals in the soil, the spectral characteristics of the contaminated areas are slightly different from other uncontaminated areas. Hyperspectral data contain more detailed information owing to the high spectral resolution and has the good ability to identify the small differences. At the same time, a large number of bands causes the model to be complicated, reducing the prediction accuracy and efficiency. How to select characteristic bands from a large number of bands and establish a suitable and accurate model are the problems that need to be solved.
The CR-sCARS-SPA method can improve the generalization ability and accuracy of the estimation model of heavy metal content in soil while ensuring the extraction of effective feature As shown in Figure 10, in the Honghu area, the SFLA-RBF model also has the highest prediction accuracy. R 2 p is 0.8862, RMSE p is 0.8501, and MAE p is 0.7197. Compared with RBFNN, R 2 p increased by 0.0218, RMSE p decreased by 0.0772, and MAE p decreased by 0.078. It can be seen that compared with other models, the sCARS-SPA-SFLA-RBFNN method has better robustness and accuracy and is the best model for predicting soil As content.
Sensors 2020, 20, x FOR PEER REVIEW 12 of 16 In this study, the PLSR model, RBFNN model, and SFLA-RBFNN model were established and compared in different research areas. For RBFNN, the input layer is composed of the reflectivity of the characteristic wavelength selected by sCARS-SPA, and the output layer is the soil arsenic content. For SFLA, the population size is set to 100, and the maximum evolutionary generation is set to 300. As shown in Figure 9, compared with the PLSR and the RBFNN, the points of SFLA-RBF are closer to the 1:1 line and have higher prediction accuracy. In Daye, R is 0.9320, RMSE is 0.2161, and MAE is 0.1700. R increased by 0.0539, RMSE decreased by 0.0556, and MAE decreased by 0.0733. In addition, in areas with low As content in the soil, the prediction accuracy was more significantly improved by the SFLA method. As shown in Figure 10, in the Honghu area, the SFLA-RBF model also has the highest prediction accuracy. R is 0.8862, RMSE is 0.8501, and MAE is 0.7197. Compared with RBFNN, R increased by 0.0218, RMSE decreased by 0.0772, and MAE decreased by 0.078. It can be seen that compared with other models, the sCARS-SPA-SFLA-RBFNN method has better robustness and accuracy and is the best model for predicting soil As content.

Discussion
Due to the presence of contaminants such as heavy metals in the soil, the spectral characteristics of the contaminated areas are slightly different from other uncontaminated areas. Hyperspectral data contain more detailed information owing to the high spectral resolution and has the good ability to identify the small differences. At the same time, a large number of bands causes the model to be complicated, reducing the prediction accuracy and efficiency. How to select characteristic bands from a large number of bands and establish a suitable and accurate model are the problems that need to be solved.
The CR-sCARS-SPA method can improve the generalization ability and accuracy of the estimation model of heavy metal content in soil while ensuring the extraction of effective feature

Discussion
Due to the presence of contaminants such as heavy metals in the soil, the spectral characteristics of the contaminated areas are slightly different from other uncontaminated areas. Hyperspectral data contain more detailed information owing to the high spectral resolution and has the good ability to identify the small differences. At the same time, a large number of bands causes the model to be complicated, reducing the prediction accuracy and efficiency. How to select characteristic bands from a large number of bands and establish a suitable and accurate model are the problems that need to be solved.
The CR-sCARS-SPA method can improve the generalization ability and accuracy of the estimation model of heavy metal content in soil while ensuring the extraction of effective feature bands. Commonly used methods based on the principle of "survival of the fittest" (such as GA [32,33], CARS [34], etc.) can select characteristics bands with strong adaptability and remove incoherent bands, but do not consider the increase in model complexity caused by the collinearity problem, which affected the prediction accuracy [35]. Before the feature selection, CR can effectively improve the accuracy of subsequent feature band selection [36,37]. It can be seen from this research that combining the SPA method with sCARS can solve the problem of data redundancy and simplify the model, thereby improving the generalization ability and prediction accuracy of the model.
Compared with other regression models, the SFLA-RBFNN model has more advantages. Soil arsenic content and spectral reflectance may often have a complex nonlinear relationship [38], so the PLSR method has poor performance in some cases. The neural network algorithm has excellent performance and efficiency, and has excellent performance in solving complex nonlinear problems [39][40][41][42], but it is prone to lack of generalization ability. Therefore, it is combined with SFLA's excellent global search ability to optimize the initial parameters of RBFNN, so as to obtain a model with better fitting ability and prediction accuracy.
In addition, the research object of this paper is the soil samples at the sampling site, and the research scope may have certain limitations. In future research, our study will make full use of satellite and unmanned aerial vehicle remote sensing data to expand our research scope to obtain regional inversion results. Compared with the traditional spatial interpolation method based on many sampling points [43,44], the proposed algorithm combined with hyperspectral images to detect soil pollution is more economical and convenient. Moreover, based on the algorithm, other substances similar to arsenic (such as phosphorus) can also be included in the study. In addition, different forms of arsenic in soil can be estimated separately, combined with other more sophisticated detection techniques [45], which can provide the basis for comprehensive detection and protection of the soil.

Conclusions
In this study, the soil samples of Daye and Honghu were taken as the study objects. Using the two characteristic selection methods of sCARS and sCARS-SPA and the three modeling methods of PLSR, RBFNN, and SFLA-RBFNN, the soil As content was estimated based on hyperspectral data.
The research conclusions are as follows: 1.
The continuous removal of the spectral reflectance of different land types can effectively enhance the effect of selecting characteristic wavelengths related to soil As content.

2.
By comparing the performance of a model based on sCARS and sCARS-SPA, it is found that the model established in the wavelengths selected by sCARS-SPA has higher prediction accuracy.

3.
In the two research areas, PLSR, RBFNN, and SFLA-RBFNN were used to estimate the soil As content. It was found that the SFLA-RBFNN model has highest prediction accuracy and generalization.

4.
The results of experiment show that sCARS-SPA-SFLA-RBFNN model is feasible for spectral analysis of soil As content. The model not only reduces the redundancy of spectral information and solves the problem of collinearity, but also has good prediction accuracy. It provides a suitable method for the prediction of large-scale and high-precision soil As content.