Improving the Estimation Accuracy of Soil Organic Matter Content Based on the Spectral Reﬂectance from Soils with Different Grain Sizes

: Accurate and rapid estimation of soil organic matter (SOM) content is of great signiﬁcance for advancing precision agriculture. Compared with traditional chemical methods, the hyperspectral estimation is superior in rapidly estimating SOM content. Soil grain size affects soil spectral reﬂectance, thereby affecting the accuracy of hyperspectral estimation. However, the appropriate soil grain size for the hyperspectral analysis is nearly unknown. This study propose a best hyperspectral estimation method for determining SOM content of farmland soil in the Ibinur Lake Irrigation Area (ILIA) of the northwest arid zones of China. The original spectral reﬂectance of the 20-mesh (0.85 mm) and 60-mesh (0.25 mm) sieved soil were obtained, and the feature wavebands were selected using ﬁve types of spectral transformations. Then, hyperspectral estimation models were constructed based on the partial least squares regression (PLSR), support vector machine (SVM), random forest (RF), and extreme gradient boosting (XGBoost) models. Results show that the SOM content had relatively higher correlation coefﬁcient with spectral reﬂectance of the 0.85 mm sieved soil than that of the 0.25 mm sieved soil. The transformation of original spectral reﬂectance of soil effectively enhanced the spectral characteristics related to SOM content. Soil grain size obviously affected spectral reﬂectance and the accuracy of hyperspectral estimation models. The overall stability and estimation accuracy of RF model was signiﬁcantly higher compared with the PLSR, SVM, and XGBoost. Finally, the RF model combined with the root mean ﬁrst-order differentiation (RMSFD) of spectral reﬂectance of the 0.85 mm sieved soil ( R 2 = 0.82, RMSE = 2.37, RPD = 2.27) was identiﬁed as the best method for estimating SOM content of farmland soil in the ILIA.


Introduction
Soil organic matter (SOM) is an indicator for evaluating soil quality and is crucial for soil health [1].SOM plays a pivotal role in soil fertility and agricultural productivity [2].The significant role of SOM also extends to influencing the global carbon budget, mitigating environmental pollution, and influencing regional climate change [3,4].Accurately assessing SOM content is essential for identifying areas requiring fertilization within sustainable soil management practices and sustainable agriculture [5].In addition, the rapid estimation of SOM content is crucial for understanding the spatial distribution of soil fertility [6].However, the dynamics of SOM content, influenced by various temporal and spatial factors, can result in variability and spatial heterogeneity of SOM, particularly in agricultural soils [7,8].The differences in natural environment, soil types, and the complex heterogeneity of soils in different areas, further limit the estimation of SOM content [9].Therefore, developing a rapid and effective monitoring technique for SOM content is challenging due to the limitations associated with the differences in regional soil environment.
Land 2024, 13, 1111 2 of 16 Hyperspectral remote sensing technology is known for its feasibility in accurately and rapidly monitoring of soil properties.It has been using for estimating soil moisture [10], soil salt content [11], soil total nitrogen [12], heavy metals [13,14], and SOM content [15].Hyperspectral estimation of SOM content by using soil spectral reflectance is of great significance for advancing modern precision agriculture [1].The monitoring of soil spectral signatures and then hyperspectral estimation of SOM content supports wider environmental and agricultural endeavors [15].However, hyperspectral estimation accuracy of SOM content depends on the quality of data processing and model construction, though its overall accuracy may be slightly lower than traditional methods.Nonetheless, hyperspectral estimation of SOM offers advantages in terms of higher temporal and spatial resolution, which are critical for effectively estimating SOM content [16].
Soil spectral reflectance is a comprehensive indicator of the spectral behavior of physical and chemical properties of soil [17].Soil grain size leads to obvious differences in physical properties of soil, as well as characteristic changes in soil spectral reflectance [18].It has been proven that the soil grain size significantly affects soil properties, including pore structure, fungal hyphae, as well as SOM [2].In general, soil grain size also affects the spectral reflectance data of soil.The average spectral reflectance of soil varies with grain size across all wavelength bands, highlighting the influence of soil texture on spectral reflectance data [19].Even for the same type of soil, different soil grain sizes affect their spectral characteristics [20,21].In addition, the estimation accuracy of hyperspectral models based on spectral reflectance from soils with different grain sizes are different [18].In summary, the soil grain size can affect the accuracy and stability of hyperspectral estimation models [22,23].
There is no consensus on the optimal soil grain size in SOM estimation, and the effects of soil grain size on the hyperspectral estimation accuracy of SOM content are nearly unknown, especially for farmlands in arid zones.The main objectives of this research are to (a) obtain the feature spectral wavebands for SOM contents of farmland soils in the northwest arid zones of China; (b) detect the effects of soil grain size on the spectral reflectance related to SOM; (c) clarify the effects of soil grain size on the hyperspectral estimation accuracy of SOM content; (d) identify a best hyperspectral method for rapidly estimating SOM content of farmland soil by means of the partial least squares regression (PLSR), random forest (RF), support vector machine (SVM), and extreme gradient boosting (XGBoost) models.Results of this study would offer a technical reference for selecting the optimum soil grain size during the hyperspectral estimation of SOM content.

Data Acquisition
This study was conducted in the Ibinur Lake Irrigation Area (ILIA) of the NW arid zones of China, with an experimental area range of 80 • 50 -83 • 00 E and 44 • 20 -45 • 10 N (Figure 1), which covers an area of 3300 km 2 .Climate type of the experimental area belongs to the temperate continental dry climate, with an annual average precipitation of 105.27 mm, 80% of which is mainly concentrated from June to September.The annual average evaporation reaches 2221.3 mm, and the annual average temperature is 7.8 • C. Main soil types in the ILIA are irrigation desert soil, sandy soil, saline soil, and calcareous soil [24].
The field investigation, soil sampling, chemical analysis, and spectral measurement of this study were conducted in May of 2023.A total of 106 surface soil specimens from the top 20 cm (0-20 cm) layer were collected from farmlands (mainly cultivated land including cotton, corn, beets, and wheat) according with the soil sampling standard detailed in "NY/T 395-2000" [25].The locations of sample sites are also shown in Figure 1.
Five sub-samples (approximately 400 g) were collected at each sample site (100 m × 100 m areas), and mixed as one typical soil sample (about 2 kg).After three days of air-drying in laboratory, the non-soil materials such as plant roots and stones in the collected samples were removed.The collected soil samples divided into two groups.One group was ground and then passed through a 60-mesh (grain size of ≤0.25 mm) sieve for determining the Five sub-samples (approximately 400 g) were collected at each sample site (100 m × 100 m areas), and mixed as one typical soil sample (about 2 kg).After three days of airdrying in laboratory, the non-soil materials such as plant roots and stones in the collected samples were removed.The collected soil samples divided into two groups.One group was ground and then passed through a 60-mesh (grain size of ≤0.25 mm) sieve for determining the SOM content, and the other group was ground and passed through a 20-mesh (grain size of ≤0.85 mm) and 60-mesh sieve, respectively, for measuring the soil spectral reflectance.
The SOM content was determined according with the National Standard of China detailed in NY/T 1121.6-2006 [26].Soil spectral reflectance extraction is accomplished by a FieldSpec ® 3 portable object spectrometer (Analytical Spectral Devices, Boulder, CO, USA) with spectral resolution of 1 nm.The spectrometer was switched for 30 min before the spectral extraction, and it was corrected using a black and white board.The spectral reflectance data of 350-1750 nm (including visible light band (350-1000 nm) and near infrared band (900-1700 nm)) from two different grain size of soils were obtained.The changes in soil moisture may affect the predictive performance of the models, especially in scenarios where it is necessary to capture long-term trends or seasonal variations.The uncertainty of soil moisture may lead to an increase in the uncertainty of model predictions [10].Therefore, when developing and using these models, the influence of soil moisture dynamics needs to be considered and the water absorption band should be removed to improve the accuracy and robustness of the models.The spectral data within 350-399 nm and 1301-1430 nm were excluded to reduce abnormal soil spectrum [27].
Each soil sample was scanned 10 times and 10 spectral reflectance curves were obtained, then the average value of them was taken as the final spectral reflectance data.Consequently, the Savitzky-Golay (S-G) algorithm was used to smooth the final spectral reflectance data to improve the signal-to-noise ratio [28].Finally, the spectral reflectance curves of the 0.85 mm and 0.25 mm sieved soils after the above spectral pretreatment were obtained.

Spectral Feature Extraction
The original soil spectral reflectance data of the two different grain size soils were The SOM content was determined according with the National Standard of China detailed in NY/T 1121.6-2006 [26].Soil spectral reflectance extraction is accomplished by a FieldSpec ® 3 portable object spectrometer (Analytical Spectral Devices, Boulder, CO, USA) with spectral resolution of 1 nm.The spectrometer was switched for 30 min before the spectral extraction, and it was corrected using a black and white board.The spectral reflectance data of 350-1750 nm (including visible light band (350-1000 nm) and near infrared band (900-1700 nm)) from two different grain size of soils were obtained.The changes in soil moisture may affect the predictive performance of the models, especially in scenarios where it is necessary to capture long-term trends or seasonal variations.The uncertainty of soil moisture may lead to an increase in the uncertainty of model predictions [10].Therefore, when developing and using these models, the influence of soil moisture dynamics needs to be considered and the water absorption band should be removed to improve the accuracy and robustness of the models.The spectral data within 350-399 nm and 1301-1430 nm were excluded to reduce abnormal soil spectrum [27].
Each soil sample was scanned 10 times and 10 spectral reflectance curves were obtained, then the average value of them was taken as the final spectral reflectance data.Consequently, the Savitzky-Golay (S-G) algorithm was used to smooth the final spectral reflectance data to improve the signal-to-noise ratio [28].Finally, the spectral reflectance curves of the 0.85 mm and 0.25 mm sieved soils after the above spectral pretreatment were obtained.

Spectral Feature Extraction
The original soil spectral reflectance data of the two different grain size soils were mathematically transformed into the first-order differentiation (FD), logarithmic FD (LTFD), root mean FD (RMSFD), reciprocal logarithmic FD (ATFD), and logarithmic reciprocal FD (RLFD) to enhance the spectral information related to SOM content, as well as to reduce unpredictable interference of environmental background [9].To select the feature wavebands, the correlation analysis was performed between SOM content and original and transformed spectral reflectance data of soils with two different grain sizes.The Pearson's correlation coefficient (r) between SOM content and spectral reflectance were calculated, and the significance of correlation analysis was tested at the p < 0.01 level (two-tailed), whereas the threshold for the r was set at ±0.248 [29].Then, wavebands with absolute correlation coefficients more than 0.248 were selected as the feature wavebands, and used for following model construction.The correlation coefficient (r) can be determined according to special-ized literature Zhong et al.The r value ranges from −1 to +1, and based on the absolute value of magnitude, the strength of the correlation is classified as: maximum correlation (1.0 ≥ r ≥ 0.80), strong correlation (0.60 ≤ r < 0.80), moderate correlation (0.40 ≤ r < 0.60), weak correlation (0.20 ≤ r < 0.40), and weakest correlation (0 ≤ r < 0.20) [30].

Model Construction
The samples of spectral extraction were divided into a calibration set (81 samples) and a validation set (25 samples) according with the Kennard Stone algorithm [1].The calibration set was used to construct and train models, whereas the validation set was used to test and evaluate the model performance.Among various algorithms, the partial least squares regression (PLSR), random forest (RF), support vector machine (SVM), and extreme gradient boosting (XGBoost) models were adopted to establish the hyperspectral estimation models of SOM.PLSR is a statistical method designed to model the relationship between a set of independent variables (x) and a dependent variable (y), especially useful in scenarios of multicollinearity among the independent variables or when the number of predictors exceeds the number of observations [13].SVM is a supervised learning algorithm for regression, it identifies a hyperplane that optimally represents the relationship between input and output variables while permitting a certain degree of deviation or error.SVM is effective for non-linear relationships and is applied in diverse research fields [31].RF is an ensemble learning algorithm for regression analysis that enhances prediction accuracy and stability by combining multiple decision trees [32].XGBoost is an advanced machine learning algorithm frequently used for regression and classification tasks, which is known for its efficiency for non-linear relationships [33].The data of different soil types may have different distribution patterns and characteristics, which requires the model to have a strong generalization ability to adapt to unseen soil types.These models need to ensure their generalization ability through appropriate parameter adjustment and model validation.The PLSR, the SVM and the RF were constructed for predicting two different sieves SOM content in this study.Based on Python, the "random-state" of three models was set as 69.Due to the randomness of the RF model, the number of parameters ("n-estimators" and another "random-state") will disturb the predictive performance of the model.Under the consideration of model performance, model running time, sample number and other factors, the number of parameters ("n-estimators" and another "random-state") of the RF model was set in the range from 1 to 99.

Model Evaluation Indices
To compare the accuracy and reliability of the constructed hyperspectral estimation models, the R 2 (coefficient of determination), RMSE (root mean square error), and RPD (residual predictive deviation) of validation set were used.These three indices were calculated as follows: (1) where y m and y e represent the ground-measured and hyperspectral estimated values of SOM content of sample i, respectively.The y ave represents the average value of the groundmeasured SOM contents, and n is total number of the collected soil samples.
In general, a robust hyperspectral estimation model has higher R 2 and RPD values but a lower RMSE [30,34].R 2 is used to assess the stability and estimation accuracy in reflectance spectroscopy studies [35].The stability and prediction accuracy of R 2 is classified into five categories: R 2 > 0.90 indicates an "excellent prediction", whereas 0.82 ≤ R 2 < 0.90 indicates a "good prediction", 0.66 ≤ R 2 < 0.82 indicates an "approximate quantitative prediction", 0.50 ≤ R 2 < 0.66 indicates a "poor prediction", and R 2 < 0.50 denotes an "unsuccessful prediction" [35,36].The RMSE is used to evaluate the estimation quality of the model.The lower RMSE indicates the higher estimation quality of the model.
The RPD is defined as the ratio of the standard deviation (S.D) of the ground-measured data to the RMSE of the cross-validation.It is used to evaluate the estimation ability of the hyperspectral model [37].The estimation ability of RPD is divided into five categories: 1.40 < RPD indicates a "poor model and/or estimation", whereas 1.40 ≤ RPD < 1.80 indicates a "fair model and/or estimation", 1.80 ≤ RPD < 2.00 indicates a "good model and/or estimation", 2.00 ≤ RPD < 2.50 indicates a "very good quantitative model and/or estimation", RPD > 2.50 indicates an "excellent model and/or estimation" [37,38].

Descriptive Analysis of SOM
Table 1 details the basic statistical outcomes for the calibration, validation, and total sets of SOM content of farmland soil in the ILIA.The pH, salt content, and EC values of the collected soil samples were also given.The S.D and CV (coefficient of variation) values of SOM content used to quantify data variability.It can be seen that the SOM content of the total revealed a range from 6.04 to 31.60 g/kg, with an average value of 17.18 g/kg.The average pH value of the collected samples was 8.85, indicating an alkaline soil condition, whereas the average salt content was noted as 0.40 g/kg, and the average EC was measured at 1432.30 us/cm.Specifically, the average SOM content (16.94 g/kg for the calibration set vs. 17.96g/kg for the validation set), S.D (4.99 g/kg for the calibration set vs. 5.38 g/kg for the validation set), and CV (29.48% for the calibration set vs. 29.94%for the validation set) were remarkably consistent between the calibration and validation sets.This similarity proves that the division of dataset in this study was appropriate, which is very applicable for the subsequent model construction [9].As shown in Figure 2, the range of the original spectral reflectance of the 0.85 mm and 0.25 mm sieved soil samples were 0.074-0.598and 0.122-0.647,respectively.The average reflectance value of the 0.85 mm and 0.25 mm sieved soil samples were 0.334 and 0.423, respectively.As shown here, the spectral reflectance curves of these two sieved soil As shown in Figure 2, the range of the original spectral reflectance of the 0.85 mm and 0.25 mm sieved soil samples were 0.074-0.598and 0.122-0.647,respectively.The average reflectance value of the 0.85 mm and 0.25 mm sieved soil samples were 0.334 and 0.423, respectively.As shown here, the spectral reflectance curves of these two sieved soil exhibited a rapid increase an upward trend at the 400-800 nm wavelength range, while showing relatively stable at the 800-1300 nm and 1430-1750 nm wavelength range.The steeper slope observed in the 400-600 nm range may be attributed to the presence of iron in the soil.However, the trend of the spectral reflectance curves decreased at the 1300-1430 nm wavelength range.Besides, the spectral reflectance curves of these two sieved soil starts to change significantly at around 600 nm, and the spectral curves of investigated soil samples exhibited consistency in shape, trend, and the positions of main peaks and valleys.Generally, the spectrum of the 0.25 mm sieved soil was slightly higher than that of the 0.85 mm sieved soil (Figure 2).This result indicating the obvious effects of soil grain sizes on the soil spectral reflectance.

Correlations between Soil Spectral Reflectance and SOM Content
The correlation between the soil spectral reflectance (including the original and mathematically transformed spectral reflectance) and SOM content of the collected soil samples was analyzed (Figure 3).It can be seen that the SOM content exhibited relatively weak association with the original spectral reflectance (R) of both the 0.25 mm sieved soil (r = −0.227,at the weak correlation level) and the 0.85 mm sieved soil (r = −0.415,at the moderate correlation level) (Figure 3a).As for the original spectral reflectance, the 0.85 mm sieved soil exhibited better correlation with SOM content compared with the 0.25 mm sieved soil.It is evident that the original spectral reflectance curve of the 0.25 mm sieved soil does not meet the correlation test threshold of ±0.248.The original spectral reflectance of soil, especially the 0.25 mm sieved soil, had poor performance in the correlation between SOM content of farmland soil in the ILIA.Among the five types of mathematically transformed spectral reflectance data, the FD, RMSFD, and RLFD fall into the strong correlation level (0.60 ≤ r < 0.80), whereas LTFD and ATFD fall into the moderate correlation (0.40 ≤ r < 0.60).The results indicate that the FD, RMSFD, and RLFD transformation can effectively minimize environmental interference or eliminate baseline drift during spectral data collection, thereby enhancing spectral features of soil and facilitating the identification of effective wavebands.It can be concluded that the mathematical transformation of the original soil spectral reflectance can effectively enhance the correlation between the SOM content and soil spectral reflectance, which is consistent with results of related study [39].Thus, applying appropriate mathematical transformations to the original spectrum constitutes an effective strategy for enhancing accuracy of hyperspectral estimation model.Notably, the FD, RMSFD, and RLFD transformation of the original soil spectral reflectance of the 0.85 mm sieved soil exhibited The correlations between SOM content and the five types of transformed spectral reflectance data including FD (Figure 3b), LTFD (Figure 3c), RMSFD (Figure 3d), ATFD (Figure 3e), and RLFD (Figure 3f) were significantly improved.Specifically, as for the 0.25 mm sieved soil, the absolute values of the maximum correlation coefficients between SOM content and FD, LTFD, RMSFD, ATFD, and RLFD transformed soil spectral reflectance data were 0.544 (at 885 nm), 0.533 (at 885 nm), 0.541 (at 885 nm), 0.533 (at 885 nm), and 0.532 (at 885 nm), respectively, at a moderate correlation level (0.40 ≤ r < 0.60).Meanwhile, as for the 0.85 mm sieved soil, the maximum absolute correlation coefficients between SOM content and the FD, RMSFD, and RLFD transformed spectral data were 0.658 (at 515 nm), 0.641 (at 441 nm), and 0.651 (at 515 nm), respectively, at a strong correlation level (0.60 ≤ r < 0.80), whereas the maximum absolute correlation coefficients between SOM content LTFD and ATFD transformed spectral data were 0.543 (at 407 nm) and 0.543 (at 407 nm), respectively, at a moderate correlation level.
Among the five types of mathematically transformed spectral reflectance data, the FD, RMSFD, and RLFD fall into the strong correlation level (0.60 ≤ r < 0.80), whereas LTFD and ATFD fall into the moderate correlation (0.40 ≤ r < 0.60).The results indicate that the FD, RMSFD, and RLFD transformation can effectively minimize environmental interference or eliminate baseline drift during spectral data collection, thereby enhancing spectral features of soil and facilitating the identification of effective wavebands.It can be concluded that the mathematical transformation of the original soil spectral reflectance can effectively enhance the correlation between the SOM content and soil spectral reflectance, which is consistent with results of related study [39].Thus, applying appropriate mathematical transformations to the original spectrum constitutes an effective strategy for enhancing accuracy of hyperspectral estimation model.Notably, the FD, RMSFD, and RLFD transformation of the original soil spectral reflectance of the 0.85 mm sieved soil exhibited the more significant (r > 0.6) impact on the spectral characteristics of the soil.
In addition, the feature wavebands primarily located within the 407-885 nm range of visible light, and achieving a maximum correlation coefficient of −0.658, at the strong correlation level.In the near-infrared waveband, there is no notable correlation between SOM content and spectral reflectance.The near-infrared band is typically highly sensitive to SOM content.Therefore, SOM displays relatively lower reflectivity in the near-infrared waveband due to its absorption of most near-infrared light [40].Consequently, spectral reflectance in the near-infrared band exhibited a negative correlation with SOM content.

Model Construction and Evaluation
Based on the correlation coefficient between the soil spectral reflectance data and SOM content, wavebands with the absolute correlation coefficient value more than 0.248 were taken as the feature wavebands.Then, taking the selected feature wavebands as the independent variables (x), whereas taking the SOM content as the dependent variables (y), the PLSR, SVM, RF, and XGBoost algorithms were employed to construct hyperspectral estimation models of SOM content of farmland soil in the ILIA.Three evaluation indices including the R 2 , RMSE, and RPD for the constructed hyperspectral estimation models were obtained to compare the performance of constructed models (Table 2).  2 showed that, as for the 0.85 mm sieved soil, the R 2 values of the constructed PLSR model combined with FD, LTFD, RMSFD, ATFD, and RLFD were 0.56, 0.61, 0.59, 0.54, and 0.62, respectively, at the "poor prediction" level based on the classification criteria of the R 2 .Meanwhile, as for the 0.25 mm sieved soil, the R 2 values of the PLSR model combined with FD, LTFD, RMSFD, ATFD, and RLFD were 0.41, 0.44, 0.40, 0.44, and 0.48, respectively, indicating an "unsuccessful prediction" category.According to the R 2 values of the constructed PLSR models for two types of soil grain sizes, the stability and estimation accuracy of PLSR model by using the 0.85 mm sieved soil were relatively higher than that of the 0.25 mm sieved soil.
The ranges of RMSE values of the constructed PLSR model across the five types of spectral transformations were 3.40-3.74for the 0.85 mm sieved soil, whereas 3.90-4.16for the 0.25 mm sieved soil.The RMSE of the 0.85 mm sieved soil was lower than that of the 0.25 mm sieved soil.It indicates that the estimation quality of PLSR model by using the 0.85 mm grain size soil was higher than the 0.25 mm grain size soil.Moreover, the ranges of RPD of PLSR model across the five types of spectral transformations were 1.44-1.58for the 0.85 mm sieved soil, whereas 1.29-1.38 for the 0.25 mm sieved soil.Based on the classification criteria of RPD, the estimation ability of the constructed PLSR model for the 0.85 mm sieved soil fall into a "fair model and/or estimation", whereas PLSR model for the 0.25 mm sieved soil belonged to the "poor model and/or estimation" category.The above analysis indicates that the stability, estimation accuracy, estimation quality, and estimation ability of the constructed PLSR model were poor based on three model evaluation indices.However, the spectral reflectance data of the 0.85 mm sieved soil were better for constructing PLSR model compared with the 0.25 mm sieved soil.Therefore, the RLFD transformed spectral reflectance of the 0.85 mm sieved soil is superior when constructing hyperspectral estimation model of SOM content by using the PLSR model.
The scatter plot of SOM content for the ground-measured and predicted by the selected PLSR method (with the highest R 2 and RPD, and lowest RMSE) was exhibited in Figure 4.In Figure 4, the reasons for choosing the linear equation are based on the simplicity of the model, statistical foundations, performance metrics, data characteristics and predictive accuracy.Despite the differences between the predicted and actual values, previous related research indicates that the linear model is still regarded as a practical and effective predictive tool [11,39].Results of the 0.85 mm and 0.25 mm sieved soils were compared.It can be seen that RLFD transformed spectral reflectance of the 0.85 mm sieved soil had a relatively higher performance.Therefore, the 0.85 mm-RLFD-PLSR (R 2 = 0.62, RMSE = 3.40, RPD = 1.58) can be identified as a better PLSR method for estimating SOM content of farmland soil in the ILIA.However, based on the evaluation indices of PLSR models, the overall performance of all the constructed PLSR models were not reliable.
Land 2024, 13, x FOR PEER REVIEW 9 of 16 content of farmland soil in the ILIA.However, based on the evaluation indices of PLSR models, the overall performance of all the constructed PLSR models were not reliable.

SVM Model
As shown in Table 2, as for the 0.85 mm sieved soil, the R 2 values of the constructed SVM model combined with FD, LTFD, RMSFD, ATFD, and RLFD were 0.70, 0.64, 0.74, 0.63, and 0.69, respectively, at the "approximate quantitative prediction" category for FD, RMSFD, and RLFD, and the "poor prediction" for other two spectral transformations.Meanwhile, as for the 0.25 mm sieved soil, the R 2 values of the SVM model combined with

SVM Model
As shown in Table 2, as for the 0.85 mm sieved soil, the R 2 values of the constructed SVM model combined with FD, LTFD, RMSFD, ATFD, and RLFD were 0.70, 0.64, 0.74, 0.63, and 0.69, respectively, at the "approximate quantitative prediction" category for FD, RMSFD, and RLFD, and the "poor prediction" for other two spectral transformations.Meanwhile, as for the 0.25 mm sieved soil, the R 2 values of the SVM model combined with FD, LTFD, RMSFD, ATFD, and RLFD were 0.44, 0.73, 0.62, 0.73, and 0.38, respectively, with the "approximate quantitative prediction" for LTFD and ATFD, the "poor prediction" for RMSFD, and the "unsuccessful prediction" for other two spectral transformations.The ranges of RMSE values of the constructed SVM model across the five types of spectral transformations were 4.17-4.58for the 0.85 mm sieved soil, whereas 4.83-4.96for the 0.25 mm sieved soil (Table 2).The difference in the R 2 and RMSE values for these two different soil grain sizes were relatively small.
Moreover, the ranges of RPD values of SVM model across the five types of spectral transformations for both the 0.85 mm and 0.25 mm sieved soil were less than 1.40.Based on the classification criteria of RPD, the estimation ability of the constructed SVM model indicated a "poor model and/or estimation" level.Relatively speaking, spectral reflectance data of the 0.85 mm sieved soil were better for constructing SVM model compared with the 0.25 mm sieved soil.
The scatter plot of SOM content for the ground-measured and predicted by the selected SVM method (with the highest R 2 and RPD, and lowest RMSE) was exhibited in Figure 5. Results of the 0.85 mm and 0.25 mm sieved soils were compared.It is clear that the RMSFD transformed spectral reflectance of the 0.85 mm sieved soil had a relatively higher performance in estimation of SOM content.However, the 0.85 mm-RMSFD-SVM (R 2 = 0.74, RMSE = 4.29, RPD = 1.25) can be identified as a better SVM method for estimating SOM content of farmland soil in the ILIA.Overall, based on the model evaluation indices, the overall performance of all the constructed SVM models were also not reliable.

RF Model
As for the 0.85 mm sieved soil, the R 2 values of the constructed RF model combined with FD, LTFD, RMSFD, ATFD, and RLFD were 0.74, 0.64, 0.82, 0.59, and 0.75, respectively, with a "good prediction" for RMSFD, an "approximate quantitative prediction" for FD and RLFD, and a "poor prediction" for other two spectral transformations (Table 2).Meanwhile, as for the 0.25 mm sieved soil, the R 2 values of the RF model combined with FD, LTFD, RMSFD, ATFD, and RLFD were 0.58, 0.55, 0.60, 0.44, and 0.72, respectively, with an "approximate quantitative prediction" for RLFD, an "unsuccessful prediction" for ATFD, and a "poor prediction" for other three spectral transformations.According to the R 2 values of RF models for these two different grain size of soils, the stability and estimation accuracy of RF model by using the 0.85 mm sieved soil were obviously higher than that of the 0.25 mm sieved soil.
The ranges of RMSE values of the constructed RF model across the five types of spectral transformations were 2.37-3.00for the 0.85 mm sieved soil, whereas the 3.05-4.07for the 0.25 mm sieved soil.The RMSE of the 0.85 mm sieved soil was obviously lower than that of the 0.25 mm sieved soil.It proves that the estimation quality of RF model by using

RF Model
As for the 0.85 mm sieved soil, the R 2 values of the constructed RF model combined with FD, LTFD, RMSFD, ATFD, and RLFD were 0.74, 0.64, 0.82, 0.59, and 0.75, respectively, with a "good prediction" for RMSFD, an "approximate quantitative prediction" for FD and RLFD, and a "poor prediction" for other two spectral transformations (Table 2).Meanwhile, as for the 0.25 mm sieved soil, the R 2 values of the RF model combined with FD, LTFD, RMSFD, ATFD, and RLFD were 0.58, 0.55, 0.60, 0.44, and 0.72, respectively, with an "approximate quantitative prediction" for RLFD, an "unsuccessful prediction" for ATFD, and a "poor prediction" for other three spectral transformations.According to the R 2 values of RF models for these two different grain size of soils, the stability and estimation accuracy of RF model by using the 0.85 mm sieved soil were obviously higher than that of the 0.25 mm sieved soil.
The ranges of RMSE values of the constructed RF model across the five types of spectral transformations were 2.37-3.00for the 0.85 mm sieved soil, whereas the 3.05-4.07for the 0.25 mm sieved soil.The RMSE of the 0.85 mm sieved soil was obviously lower than that of the 0.25 mm sieved soil.It proves that the estimation quality of RF model by using the 0.85 mm grain size soil was higher than the 0.25 mm grain size soil.
It should be noted that the RPD values of the RF model combined with FD, LTFD, RMSFD, ATFD, and RLFD of the 0.85 mm sieved soil were 1.81, 1.49, 2.27, 1.58, and 1.79, respectively (Table 2).Based on the classification criteria of RPD, the estimation ability of RF model indicated a "very good quantitative model and/or estimation" for RMSFD, a "good model and/or estimation" for FD, and a "fair model and/or estimation" for other three spectral transformations.The RPD values of the RF model combined with FD, LTFD, RMSFD, ATFD, and RLFD of the 0.25 mm sieved soil were 1.52, 1.45, 1.49, 1.32, and 1.76, respectively, with a "fair model and/or estimation" for FD, LTFD, RMSFD, and RLFD transformations, and a "poor model and/or estimation" for ATFD transformation.
Based on three model evaluation indices, the stability, estimation accuracy, estimation quality, and estimation ability of RF model were superior compared with PLSR and SVM.As analyzed here, spectral reflectance data of the 0.85 mm sieved soil were relatively better for constructing RF model compared with the 0.25 mm sieved soil.Therefore, the RMSFD transformed spectral reflectance of the 0.85 mm sieved soil is superior when constructing hyperspectral estimation model of SOM content by using the RF model.
The scatter plot of SOM content for the ground-measured and predicted by the selected RF method (with the highest R 2 and RPD, and lowest RMSE) was exhibited in Figure 6.Results of the 0.85 mm and 0.25 mm sieved soils were compared.It can be seen that the RMSFD transformed spectral reflectance of the 0.85 mm sieved soil had a higher performance in estimation of SOM content.Overall, the 0.85 mm-RMSFD-RF (R 2 = 0.82, RMSE = 2.37, RPD = 2.27) can be identified as the best RF method for estimating SOM content of farmland soil in the ILIA.However, the RF model is very applicable for estimating SOM content.

XGBoost Model
As given in Table 2, the R 2 values of the XGBoost model combined with five types of spectral transformation for both the 0.85 mm and 0.25 mm sieved soils were less than 0.39, indicating an "unsuccessful prediction".The R 2 values for the 0.85 mm grain size soil were higher than that the 0.25 mm grain size soil.Besides, the ranges of RMSE values of XGBoost model across the five types of spectral transformations were 3.52-4.16for the 0.85 mm sieved soil, while 3.78-5.21for the 0.25 mm sieved soil.
The RMSE of the 0.85 mm sieved soil was lower than that of the 0.25 mm sieved soil.However, the estimation quality of XGBoost model by using the 0.85 mm grain size soil was relatively better than the 0.25 mm grain size soil.The ranges of RPD values of the XGBoost model across the five types of spectral transformations were smaller than that of PLSR, SVM, and RF, with a "poor (or fair) model and/or estimation".It indicates that the stability, estimation accuracy, estimation quality, and estimation ability of the XGBoost model were very poor.
The scatter plot of SOM content for the ground-measured and predicted by the selected XGBoost method (with the highest R 2 and RPD, and lowest RMSE) was exhibited

XGBoost Model
As given in Table 2, the R 2 values of the XGBoost model combined with five types of spectral transformation for both the 0.85 mm and 0.25 mm sieved soils were less than 0.39, indicating an "unsuccessful prediction".The R 2 values for the 0.85 mm grain size soil were higher than that the 0.25 mm grain size soil.Besides, the ranges of RMSE values of XGBoost model across the five types of spectral transformations were 3.52-4.16for the 0.85 mm sieved soil, while 3.78-5.21for the 0.25 mm sieved soil.
The RMSE of the 0.85 mm sieved soil was lower than that of the 0.25 mm sieved soil.However, the estimation quality of XGBoost model by using the 0.85 mm grain size soil was relatively better than the 0.25 mm grain size soil.The ranges of RPD values of the XGBoost model across the five types of spectral transformations were smaller than that of PLSR, SVM, and RF, with a "poor (or fair) model and/or estimation".It indicates that the stability, estimation accuracy, estimation quality, and estimation ability of the XGBoost model were very poor.
The scatter plot of SOM content for the ground-measured and predicted by the selected XGBoost method (with the highest R 2 and RPD, and lowest RMSE) was exhibited in Figure 7. Results of the 0.85 mm and 0.25 mm sieved soils were also compared.Figure 7 illustrated that the FD transformed spectral reflectance of the 0.85 mm sieved soil had a relatively higher performance in estimation of SOM content.Therefore, the 0.85 mm-FD-XGBoost (R 2 = 0.39, RMSE = 3.76, RPD = 1.43) can be selected as a better XGBoost method for estimating SOM content of farmland soil in the ILIA.Overall, based on the model evaluation indices, the overall performance of all the constructed XGBoost models were very poor and not reliable.
indicating an "unsuccessful prediction".The R values for the 0.85 mm grain size soil were higher than that the 0.25 mm grain size soil.Besides, the ranges of RMSE values of XGBoost model across the five types of spectral transformations were 3.52-4.16for the 0.85 mm sieved soil, while 3.78-5.21for the 0.25 mm sieved soil.
The RMSE of the 0.85 mm sieved soil was lower than that of the 0.25 mm sieved soil.However, the estimation quality of XGBoost model by using the 0.85 mm grain size soil was relatively better than the 0.25 mm grain size soil.The ranges of RPD values of the XGBoost model across the five types of spectral transformations were smaller than that of PLSR, SVM, and RF, with a "poor (or fair) model and/or estimation".It indicates that the stability, estimation accuracy, estimation quality, and estimation ability of the XGBoost model were very poor.
The scatter plot of SOM content for the ground-measured and predicted by the selected XGBoost method (with the highest R 2 and RPD, and lowest RMSE) was exhibited in Figure 7. Results of the 0.85 mm and 0.25 mm sieved soils were also compared.Figure 7 illustrated that the FD transformed spectral reflectance of the 0.85 mm sieved soil had a relatively higher performance in estimation of SOM content.Therefore, the 0.85 mm-FD-XGBoost (R 2 = 0.39, RMSE = 3.76, RPD = 1.43) can be selected as a better XGBoost method for estimating SOM content of farmland soil in the ILIA.Overall, based on the model evaluation indices, the overall performance of all the constructed XGBoost models were very poor and not reliable.

Discussion
In this work, the overall performance of the constructed hyperspectral estimation models can be ranked as: RF > SVM > PLSR > XGBoost.It should be noted that the RF model had a significantly higher R 2 and RPD value and relatively lower RMSE values compared with PLSR, SVM, and XGBoost models.Therefore, the RF was selected the best model for predicting SOM content of farmlands in the ILIA.Results of this study are inconsistent with the research findings of some previous studies.For example, Zheng et al. reported that the PLSR had the best estimation accuracy of SOM content of coastal soil [41].Wei et al. constructed a hyperspectral inversion model for SOM content of farmland soils and suggested that the AdaBoost algorithm had the best accuracy compared with the Ridge Regression (RR), Kernel RR (KRR), and Bayesian RR (BRR) [17].Zhang et al. also constructed a SOM estimation model, and their results showed that the estimation accuracy of SVM surpassed than that of the back propagation neural network (BPNN) [42].Recently, Li et al. suggested that the CNN (convolutional neural network) had high accuracy in predicting SOM content [5].Bai et al. indicated that the PLSR model based on outer-product analysis (OPA) achieved the best estimation accuracy of SOM content [1].
It is worth noting that the RF model had better accuracy than the PLSR model for SOM content in the Ogan-Kuqa River Oasis of NW arid zones of China [43].This result is consistent with our research findings.Similarly, the best accuracy for hyperspectral estimation of heavy metals in farmland soils was obtained by using the RF Model [44].However, due to the effects of the regional geographical environment and physicochemical features of various soil types in different areas, the optimal hyperspectral model for estimating SOM content varies considerably [9].
From the perspective of inversion accuracy, most preprocessed spectra have higher modeling accuracy than the original spectra.This is because in the process of obtaining spectral information, external interference can introduce noise, which hinders the accurate reflection of the spectral characteristics of features.However, spectral preprocessing techniques can reduce spectral noise and highlight spectral feature information (Figure 2).The original spectral reflectance of the 0.25 mm sieved soil was slightly higher than that of the 0.85 mm sieved soil.This result indicates that the smaller the soil grain size, the higher the soil spectral reflectance.The reason is that the void among the smaller soil grains are smaller than the bigger soil grains, which enhance the spectral reflectance of soil [2].Moreover, the lower spectral reflectance of soils with the 0.85 mm grain size may is attributed to light scattering and changes in optical path length [20].Soil with a sieved of 0.25 mm has a smaller porosity, which reduces the absorption and scattering of light, thereby increasing the reflectance.On the other hand, soil with a sieved of 0.85 mm has a larger porosity, and the path of light within the soil is longer, which increases the absorption and scattering of light, leading to a decrease in reflectance.Therefore, when analyzing the spectral reflectance of soil, it is very important to consider the influence of soil grain sizes.By studying the spectral characteristics of soils with different grain sizes, scientific bases can be provided for soil classification, quality assessment, and land resource management.
At present, it is difficult to fully extract the effective feature wavebands by using notsieved soil samples, which limits the estimation ability of hyperspectral models.However, by selecting feature wavebands obtained from soils with appropriate grain size, deeper feature wavebands extraction can be achieved and the constructed hyperspectral estimation model has better generalization, which is consistent with our research findings [20].Based on the R 2 , RPD, and RMSE values of the constructed hyperspectral estimation models by using soils with different grain sizes, a significantly higher R 2 , RPD values and lower RMSE values were observed for the 0.25 mm sieved soil.The correlations between soil spectral reflectance and SOM content were effectively improved by using the spectrum of the 0.85 mm sieved soil samples.Then the use of the feature wavebands of the 0.85 mm sieved soil significantly improved the stability and prediction ability of the constructed hyperspectral estimation model in this study.It is verified that the soil grain size effects the stability, estimation accuracy, estimation quality, and estimation ability of hyperspectral estimation of SOM content, and the 0.85 mm sieved soil is more suitable for spectral measurement and following model construction.In the case where higher grain size results in a higher R 2 value, the relationship may exist with the physical and chemical properties of the soil as follows: (1) Porosity structure: Larger grain sizes may lead to changes in the size and distribution of soil pores.A more uniform or suitable pore structure may make the related physical processes more regular, thereby improving the degree of fit of the model to the data, i.e., a higher R 2 value.(2) Particle arrangement: When grain sizes are larger, the arrangement of particles may be more orderly, which affects the soil's permeability, water retention, and other physical properties.This can make the relationship between these properties and other factors clearer, thus increasing the R 2 value.(3) Nutrient adsorption and release: Larger grains may affect the soil's ability to adsorb and release nutrients.More stable or regular nutrient dynamics may allow the model to better explain the data, leading to an increase in the R 2 value.However, to accurately determine the relationship between grain size and R-squared, as well as the specific correlation with the physical and chemical properties of the soil, further experimental research and detailed data analysis are required [20,22,23,45].
Finally, RF model based on RMSFD transformed spectral reflectance of the 0.85 mm sieved soil (0.85 mm-RMSFD-RF) can realize the effective fusion of spectral features, which can make up for the limitation of single data features, and further improve the stability and estimation ability of the constructer model.Therefore the 0.85 mm-RMSFD-RF method (R 2 = 0.82, RMSE = 2.37, RPD = 2.27) is the best hyperspectral estimation method of SOM content of farmland soil in the ILIA.
Based on the measured and predicted SOM content, the actual distribution and the estimated distribution patterns of SOM content based on the selected PLSR, SVM, RF, and XGBoost methods were mapped using the Ordinary Kriging (OK) interpolation method and geostatistical analysis method (Figure 8).

Conclusions
This study investigated the effects of soil grain size on the accuracy of hyperspectral estimation of SOM content of farmland soil in arid zones.The following conclusions were drawn: (1) The smaller the soil grain size, the higher the spectral reflectance.In the original spectral reflectance curve, the wavebands sensitive to SOM were primarily found within the 350-550 nm range (r > 0.5, p < 0.01), exhibiting a negative correlation with SOM content.Findings of this work offer a technical reference for the hyperspectral estimation of the SOM content of farmland soil in arid zones.However, further investigation should be considered in future studies.It can be observed that the spatial distribution patterns of the estimated SOM content via the 0.85 mm-RMSFD-RF method (Figure 8d) was most similar with the actual distribution of SOM content (Figure 8a), with higher SOM content in the eastern and northern parts, and lower SOM content in the central parts of the ILIA.However, the spatial distribution patterns of SOM content estimated by the 0.85 mm-RLFD-PLSR (Figure 8b), the 0.85 mm-RMSFD-SVM (Figure 8c), and the 0.85 mm-FD-XGBoost (Figure 8e) methods significantly varied from the actual distribution.This result further proves that the 0.85 mm-RMSFD-RF is the best hyperspectral estimation method for SOM content of farmland soil in the ILIA.However, this work is a regional study, and the applicability of hyperspectral estimation models varies across different geographic regions due to differences in soil types and physical and chemical properties of soil [46].Therefore, future studies are needed to explore whether the overall estimation accuracy and stability of hyperspectral estimation models using the spectrum of the 0.85 mm sieved soil was also the optimal method to other regions.

Conclusions
This study investigated the effects of soil grain size on the accuracy of hyperspectral estimation of SOM content of farmland soil in arid zones.The following conclusions were drawn: (1) The smaller the soil grain size, the higher the spectral reflectance.In the original spectral reflectance curve, the wavebands sensitive to SOM were primarily found within the 350-550 nm range (r > 0.5, p < 0.01), exhibiting a negative correlation with SOM content.(2) The spectral reflectance of the 0.85 mm sieved soil demonstrated relatively higher correlation coefficients with SOM content than the 0.25 mm sieved soil.
(3) The mathematical transformation of original spectral reflectance of soil can effectively enhance the spectral characteristics related to the SOM content, and soil grain size obviously effect the accuracy of hyperspectral estimation model of SOM content.(4) The overall estimation accuracy and stability of the constructed hyperspectral estimation models in this study can be ranked as: RF > SVM > PLSR > XGBoost.The RF model had a significantly higher R 2 and RPD value and relatively lower RMSE values compared with the PLSR, SVM, and XGBoost models.The 0.85 mm-RMSFD-RF method (R 2 = 0.82, RMSE = 2.37, RPD = 2.27) was selected as the best model for estimating SOM content of farmland soil in the ILIA.
Findings of this work offer a technical reference for the hyperspectral estimation of the SOM content of farmland soil in arid zones.However, further investigation should be considered in future studies.

Figure 1 .
Figure 1.Map of the study area.(a) Location of ILIR; (b) Satellite image of ILIR; (c) Sample sites.

Figure 1 .
Figure 1.Map of the study area.(a) Location of ILIR; (b) Satellite image of ILIR; (c) Sample sites.

Figure 2 .
Figure 2. The original soil spectral reflectance curves processed through S-G smoothing.(Each line represents the spectrum of the soil sample (n = 106)).(a) 0.85 mm sieved soil; (b) 0.25 mm sieved soil.

Figure 8 .
Figure 8. Spatial distribution map of SOM content based on the measured and predicted values.

( 2 )
The spectral reflectance of the 0.85 mm sieved soil demonstrated relatively higher correlation coefficients with SOM content than the 0.25 mm sieved soil.(3) The mathematical transformation of original spectral reflectance of soil can effectively enhance the spectral characteristics related to the SOM content, and soil grain size obviously effect the accuracy of hyperspectral estimation model of SOM content.(4) The overall estimation accuracy and stability of the constructed hyperspectral estimation models in this study can be ranked as: RF > SVM > PLSR > XGBoost.The RF model had a significantly higher R 2 and RPD value and relatively lower RMSE values compared with the PLSR, SVM, and XGBoost models.The 0.85 mm-RMSFD-RF method (R 2 = 0.82, RMSE = 2.37, RPD = 2.27) was selected as the best model for estimating SOM content of farmland soil in the ILIA.

Figure 8 .
Figure 8. Spatial distribution map of SOM content based on the measured and predicted values.

Table 1 .
Descriptive statistics of the soil properties.

Table 2 .
Model evaluation indices of the hyperspectral estimation models.