Predicting soil carbon in granitic soils using Fourier-transform mid-infrared (FT-MIR) spectroscopy: the value of database disaggregation

Soil carbon (C) is an important component in quality assessments and efficient models are required to estimate C rapidly. Accurate C assessments are valuable in monitoring land-use changes. Fourier-transform mid-infrared (FT-MIR) spectroscopy has proved to be a powerful tool for assessing C. The potential of FT-MIR spectroscopy to estimate C was evaluated using the following techniques: (1) three algorithms [partial least squares (PLS)], principal component regression (PCR), and classical least squares (CLS); and (2) disaggregating the dataset into subgroups based on soil depth and texture. The C contents of samples collected in the Johannesburg Granite Dome were determined by dry combustion for comparison. The soils ranged considerably in C (0.123–2.650%), clay (2.80–41.20%), and silt content (8.56–23.75%). Using standard normal variant (SNV), Savitzky-Golay smoothing and PLS, the best-performing model was the horizons subgroup which provided values of root mean square error for prediction (RMSEP) between 0.079 and 0.095%, root mean square error for calibration (RMSEC) = 0.041 − 0.092, r 2 pre = 0.6174–0.8459, r 2 cal = 0.6599–0.9778, residual prediction variation (RPD) = 2.404–2.753, and ratio of performance to interquartile range (RPIQ) = 2.667–3.454. The pronounced accuracy of FT-MIR spectroscopy coupled with PLS, pre-processing techniques, and textural subgroups confirms the potential of infrared spectroscopy as an efficient tool for estimating C content. Future studies should investigate the combined effects of FT-MIR spectroscopy and subgroups grouped according to soil types and land-uses when predicting C.


Introduction
Soil is a crucial constituent in the carbon (C) biogeochemical cycle (Smith et al. 2015). The measurement of soil C concentrations is significant because soil organic carbon (SOC) plays an essential role in maintaining soil quality, the long-term sustainability of agricultural lands, and mitigating climate change (Fageria 2012;Navarro-Pedreño et al. 2021;Shen et al. 2021).
Conventional soil assessments, including field evaluations and laboratory analysis, are often time-consuming, expensive and require different procedures for measuring each soil property (Kwiatkowska-Malina 2018; Masi et al. 2020). Efficient and cost-effective methodologies are needed in soil surveys to meet the demand for quantitative data in digital soil mapping and the identification of effective management practices (Jia et al. 2017;Masi et al. 2020).
Diffuse reflectance spectroscopy has been highly investigated during the last two decades as an alternative to conventional methods of soil analysis (Viscarra Rossel et al. 2006;Waruru et al. 2014). Studies argue that infrared spectroscopy is less expensive, rapid in the case of large sample sizes, non-destructive, and requires a single spectrum for the simultaneous estimation of multiple physical and bio-chemical soil properties (Waruru et al. 2014;Pei et al. 2019). The spectral characteristics detected in the Fourier-transform mid-infrared (FT-MIR) (400-4 000 cm −1 ) region are used to describe soil properties such as total nitrogen (N) and soil C fractions (Xie et al. 2011;dos Santos et al. 2020) as well as cation exchange capacity (CEC) (Seybold et al. 2019). Gates (2018) found that the ability of the FT-MIR to estimate SOC, N, and CEC was reasonably accurate, with the coefficients of determination (r 2 ) measuring greater than 0.8. Xie et al. (2011) estimated SOC at various depths using the FT-MIR and found that more accurate predictions were achieved in the 5-30 cm depth range than predictions between 0 and 5 cm or deeper than 30 cm. Additionally, research has found the FT-MIR to provide higher-performance models than the more common visible near-infrared (vis-NIR) spectroscopy (Bellon-Maurel and McBratney 2011;Viscarra Rossel and Webster 2012;Gates 2018). Viscarra Rossel and Webster (2012) found that C:N ratio, available phosphorus, and exchangeable acidity were poorly estimated using vis-NIR spectroscopy. This result is due to essential molecular vibrations of soil constituents detected in the FT-MIR region. In contrast, only their overtones and combinations are found in the vis-NIR region (Fang et al. 2018).
Numerous multivariate methods, such as partial least squares (PLS) and principal component regression (PCR), are widely used in soil surveys using FT-MIR spectroscopy. In contrast, the use of classical least squares (CLS) is less common. Thermo Scientific Inc. recommends using CLS when each constituent that must be quantified produces a peak or combination of peaks in the spectrum that overlap significantly. Partial least square regression is the most common method used in quantitative estimations and has resulted in accurate predictions (Janik et al. 2009;Kawamura et al. 2017). Mouazen et al. (2010) compared the performance of the PLS, PCR, and backpropagation neural network (BPNN) methods to accurately predict several soil properties. They concluded that both BPNN and PLS methods provided better predictions than PCR. This outcome may be due to the PLS methods taking the correlation between dependent and independent variables into consideration, whereas the PCR method does not (Kabir et al. 2017).
Pre-processing techniques such as smoothing have been used to reduce instrumental noise and extract helpful information for calibration (Heil and Schmidhalter 2021). An example is standard normal variate (SNV), which has been proposed for scatter correction (Barnes et al. 1989). However, there are contradicting reports on whether preprocessing techniques improve or worsen estimation accuracies, or have no effect at all. Some studies reported improved predictions using smoothing (Vašát et al. 2017;Xu et al. 2021), while others reported that no single preprocessing technique outperformed another (Viscarra Rossel and Behrens 2010; Knox et al. 2015). Gates (2018) reported that the combination of SNV and smoothing did not improve the model performance when predicting SOC. These conflicting results indicate that no single pre-processing technique works well for soils of all regions.
There is limited research on the use of FT-MIR spectroscopy to evaluate the soils of South Africa. The present study aimed to assess the potential of FT-MIR, combined with the three chemometric methods of PLS, PCR, and, although rarely used in soil assessments, CLS, in predicting soil C contents in granitic soils. The proportion of clay in soils plays an important role in determining how much C soils can store (Sausen et al. 2014;Zhong et al. 2018;Churchman et al. 2020). For this reason, the soil samples were divided into six calibration subgroups determined according to texture to assess if this would improve the predictive ability of the techniques. In this study, we hypothesise that the FT-MIR, coupled with the PLS method, can estimate soil C contents accurately and that the determination of textural subgroups in the calibration dataset further improves the usefulness of FT-MIR to enable accurate soil C estimation. This information will assist future studies in improving calibration datasets and, ultimately, digital mapping data.

Study area
The study area falls within the Johannesburg Granite Dome area, partly located in the Upper Crocodile catchment in the Gauteng Province of South Africa. The site covers approximately 768 km 2 and is bound by latitudes 25°49′ 48″-26°12′24″ S and longitudes 27°45′36"-28°14′24″ E (Figure 1). The catchment has a semi-arid climate with an average annual rainfall of 682 mm and a mean annual evaporation of approximately 1 700 mm (Zondi 2017). The soil groups in the area were classified according to the World Reference Base (WRB 2015). They were identified as Plinthosols, Acrisol, Arenosols, Leptosol, Gleysols, Planosols, Luvisols, Lixisols, and Stagnosols. The primary land-uses are cultivated land, urban landscapes, wetlands, grasslands, and open land.
A total of nine hillslopes were randomly designated within the study area as part of a previous research assignment to determine the hydropedological behaviour of the area (Tinnefeld et al. 2017). Soil samples were collected with the use of an auger. The diagnostic soil horizons (A, B, and C) were sampled at each site, irrespective of soil depth. The soil samples were placed into tightly sealed plastic containers and taken to the laboratory for chemical analyses and spectral measurement.

Chemical soil analysis
All samples were taken to the laboratory, air-dried, ground, and passed through a 2 mm sieve for further analysis. Soil samples were stored in plastic containers, allowing minimal soil reactions to occur.

Soil carbon analysis
A total of 215 soil samples were collected from the diagnostic horizons to analyse soil C content by dry combustion adapted from Nelson and Sommers (1996) with a TruSpec Leco CN analyser.

Particle size analysis
The hydrometer method, which calculates the proportions of soil particles based on their settling rates in an aqueous solution, was used to determine particle size distribution in the soil samples (Gavlak et al. 2005). The three identified three size classes were clay (< 0.002 mm), silt (0.002-0.05 mm), and sand (0.05-2 mm). Equations 1-3 were used to calculate the different soil fractions as follows: Clay (%) = (corrected hydrometer reading at 6 hours 52 minutes ) soil mass × 100 (1) Silt (%) = (corrected hydrometer reading at 40 seconds) soil mass × 100 (2) Fourier-transform mid-infrared analysis Mid-infrared spectra were collected from ball-milled soil samples (< 0.3 mm) using a Thermo Scientific Nicolet iS50 FT-IR spectrometer fitted with a Smart iTX accessory with Diamond Crystal. The crystal was cleaned with methanol following each spectrum collection. The instrument measured absorbance over the range of 525-4 000 cm −1 with a sampling interval of 2 cm −1 and a spectral resolution of 4 cm −1 . Each spectrum was the average result of 16 scans (no replicates). A new background spectrum was collected every 60 minutes. The spectral data were calibrated using the TQ-Analyst 9 software package (Thermo Fisher Scientific Inc., Madison, WI, USA, 2017). The selected quantitative analysis methods were PLS, PCR, and CLS. The software recommends a ratio of 3:1 for calibration and validation data points to eliminate any bias; however, the program automatically determined the calibration and validation data points. The split between the calibration and validation datasets did not always give a 3:1 ratio, but it was considered to be close enough as shown in Tables 2 and 4. The dataset was log-transformed and pre-processed using SNV and the Savitzky Golay smoothing filter algorithm with a third-order polynomial. The six subgroups (all 215 samples; the three horizons (A, B, and C); clay content; clay and silt composition; a combination of the horizons and clay content; and a combination of the horizons, silt and clay content) were determined after all the soil samples had been scanned.

Statistical analysis
Outlier detection Prior to using pre-processing techniques, outliers (actual C content) were detected by means of Grubbs' two-tailed Test using the XLSTAT version 2021.1 software package (Addinsoft Inc. Paris, France), a statistical add-in for Microsoft Excel, and excluded from the dataset. Outliers increase dataset variation and significantly decrease the quality and performance of calibration models. The International Organization for Standardization (ISO 1994) recommended the Grubbs' Test for detecting outliers. The Grubbs' test statistic for a two-tailed test is calculated using Equation 4: Where Y = the sample mean, and s = sample standard deviation. Grubbs' Test tabulates the G critical value; however, Equation 5 has been formulated to calculate this and the level of significance.
Where n = number of observations; t α/ (2n), n−2 = the upper critical value of a t-distribution; and n−2 = degrees of freedom. The H 0 hypothesis is rejected when the G test > G critical , and the H a hypothesis is excluded when the G test < G critical .
TQ Analyst statistics (a) Coefficient of determination. The coefficient of determination (r 2 ) indicates how many data points are within the line generated by the regression equation. The higher the coefficient, the higher the number of plotted points that pass through the regression line and, therefore, it is an indicator of a better fit for observations. Ideal prediction models have an r 2 greater than 0.75 (Agussabti Whereŷ i = the predicted value; y i = the average of the observed values; y i = the observed values; and n = the number of samples. (b) Root mean square error. Root Mean Square Error (RMSE) measures the error of a model in predicting quantitative data and compares the prediction errors of different models (Neill and Hashemi 2018). In FT-MIR spectroscopy studies, RMSE can be divided into root mean square error for calibration (RMSEC) and root mean square error for prediction (RMSEP) (Zhao et al. 2015). The lower the RMSEC and RMSEP values, the higher the model's accuracy and predictive ability. Ideal prediction models have an RMSE lower than their standard deviation (Agussabti et al. 2020). The following equations are used to calculate RMSEC (Equation 7) and RMSEP (Equation 8). The equations are similar, with the only difference being the number of observations (n) in either the calibration or validation set.
Where n = number of observations in the calibration set; Σ =summation; f = the predicted values; o = the observed values; and (Z fi -Z oi ) 2 = the squared differences.
Where n = the number of observations in the validation set.
(c) Residual Prediction Deviation. Residual prediction deviation (RPD) is a measure of the differences in the standard deviation of the observed data against the predicted data. It measures how a linear regression model fits the data. Ideal prediction models have an RPD index greater than 2.0 (Agussabti et al. 2020). Additionally, models with an RPD index between 1.4 and 2.0 are considered reasonably accurate, and those lower than 1.4 are poor-performing. However, the criteria used to define the good-and poor-performing models using RPD is subjective as there is no statistical basis for determining the classification thresholds (Minasny and McBratney 2013).
The following Equation 9 is used to calculate RPD: Where σ = standard deviation. Standard deviation is a measure of the variation in a dataset and is calculated using Equation 10: Where ∑ = sum of X = observed value μ = dataset mean n = number of values in the dataset.
(d) Ratio of Performance to Interquartile Range. The ratio of performance to interquartile range (RPIQ) considers both the variability of measured data and the prediction error (RMSEP). It is calculated as in Equation 11: Where IQR = interquartile range (Q3-Q1)

Results
Six calibration subgroups were determined for all analyses according to varying soil depth (horizons A, B, and C), the composition of clay and silt in the soils, and their interaction. The six subgroups included: (1) all 215 samples; (2) the three horizons (A, B, and C); (3) clay content; (4) clay and silt composition; (5) a combination of the horizons and clay content; and lastly (6) a combination of the horizons, silt, and clay content. Although the soil textural analysis led to a division of three grades (clay, silt, and sand), sand was not included in the subgroups because it cannot store large amounts of soil C.

Soil particle size and carbon analysis
The results of the soil particle size analysis are summarised in Table 1. The predominant soil texture was sand, with an average of 72.467 ± 8.809%. The mean compositions of clay and silt were similar in the "all samples" subgroup, measuring 13.252 ± 6.425% and 14.282 ± 6.038%, respectively. For this reason, a third column was added to assess if their interaction would affect the prediction of soil C. Most of the samples (102) had clay contents ranging from 10 to 20% and a silt and clay composition ranging from 40 to 60% (89). The soil C content of all subgroups was also determined. Average C contents ranged from 0.245 ± 0.112% in the Horizon C Silt and Clay (> 30%) subgroup to 1.109 ± 0.552% for the Horizon A Clay (> 10%) subgroup. In addition, these subgroups recorded relatively high clay contents of 19.948 ± 5.634% and 13.901 ± 3.442%, respectively. Table 2 details the calibration results using PLS, PCR, and CLS with the six subgroups to accurately predict soil C. The results show that the best-calibrated models (horizon subgroup) were achieved using the PLS method with RMSEC = 0.041-0.092, r 2 cal = 0.6599-0.9778, and RPD = 2.404-2.753. However, the PCR method produced the best-performing models in the Silt and Clay and Horizons and Clay subgroups, with the highest mean r 2 values of 0.7421 and 0.8743% and RMSEC values of 0.136 and 0.064%, respectively. These results indicate the importance of silt and clay in the calibration process. The PLS and PCR methods outperformed the CLS method (r 2 cal = 0.2489-0.6418 and RMSEC = 0.094-1.194) in all subgroups. The results showed a great improvement with the introduction of subgroups. Although the number of observations in each subgroup did not necessarily appear to influence the results, more of the better-performing models were produced from subgroups with n < 50.

FT-IR analysis
The PLS method outperformed the PCR and CLS methods in all but two subgroups (Silt and Clay and Horizons and Clay) as shown in Table 3, where the calibration dataset was assessed. However, the PCR method produced the second-best model (Horizons and Clay subgroup). The lowest mean RMSEC values for all three methods were derived with the Horizons and Clay subgroup (0.064%), while the highest value was found when all the samples were analysed together i.e., 'All samples' (0.207%). This further highlights the importance of sub-dividing the samples into smaller subgroups with similar properties instead of analysing all the samples together.
The predictive ability of the PLS, PCR, and CLS methods with the six subgroups to accurately estimate soil C is detailed in Table 4. The number of samples used for validation is less than those used for calibration to eliminate the possibility of high RMSE values ( Table 2). The results show that the best-performing models (horizon subgroup) were achieved using the PLS method with RMSEP ranging between 0.079 and 0.095%, r 2 pre = 0.6174-0.8459, and RPD = 2.404-2.753.
The PLS method outperformed the PCR and CLS methods in all but two subgroups (Silt and Clay and Horizons and Clay). The model performance index (RPD) for the PLS, PCR and CLS methods ranged from 0.771-3.210, 0.737-3.561, and 0.690-2.262, respectively. Performance indices measure how well the model can predict specific properties. Models with RPD values greater than 2.0 are considered the best performing, whereas those with RPD values between 1.4 and 2.0 are considered reasonably accurate. The RPD is indirectly proportional to the RMSEP value. The lower the RMSEP, the higher the RPD, and the more precise the model. The subgroup with the best-performing model produced by PLS was Horizon C Clay (> 10%). In contrast, PCR and CLS produced the best-performing models for Horizon B (> 10%) and Horizon A, respectively. However, the PCR method was the best method of the three, recording an RPD of 3.561. Additionally, the r 2 of the PCR method was significantly higher at 0.8361%, whereas the PLS and CLS methods showed lower r 2 values of 0.7215 and 0.5803, respectively.
Although the CLS method performed the worst of the three models, it produced fairly accurate models in the following subgroups: Clay (> 20%), Silt and Clay (0-20%), Horizon A clay (> 10%), and Horizon B Silt and Clay (> 20%), with PRD values of 1.440, 1.455, 1.643, and 1.556, respectively. Additionally, the CLS methods produced a model which was better than the PLS and PCR methods in the horizon B clay (< 10%) subgroup, although the model has poor performance (RPD = 1.194). The PLS and PCR methods had RPD values of 0.918 and 0.980, respectively. The RPD index showed a great improvement as the dataset was divided according to particle size analysis. The Horizon subgroup was the best-performing subgroup coupled with the PLS algorithm (Table 4). These results highlight the importance of dividing databases according to particle sizes and including silt if the amounts present are significantly higher than those of clay (i.e. Silt and Clay (0-20%)). The PLS method outperformed both the PCR and CLS methods in four subgroups. Table 5 also highlights the collaborative importance of r 2 , RMSEP, RPD, and RPIQ values in the overall performance of the model. The Clay and Horizons subgroups have similar r 2 values (0.7397 and 0.7669%). However, the latter subgroup had a lower RMSEP and higher P.I., RPD, and RPIQ indices due to differences in standard deviation and interquartile range values. Table 6 summarises the soil C contents obtained using the CN Leco analyser and the predicted values obtained with the three quantitative methods. The table shows how the CLS method tends to over-predict soil C contents, whereas the PLS and PCR methods slightly under-predict soil C contents. The PCR method performed well in the Clay subgroups (Horizons and Clay and Silt and Clay), followed by the PLS method. However, overall, the PLS method was able to predict soil C contents most accurately.

Soil type and environment conditions
Greater soil variability is known to lead to higher prediction errors (RMSEP) (Reeves and Smith 2009). The predictive models in this study were developed from a field-scale database (n = 215) where the climate, soil texture, and vegetation were similar, thus limiting soil variability. Kuang and Mouazen (2012) investigated the influence of the   19); however, they also achieved higher r 2 (0.99) and RPD values (9.8). Differences in land use and management practices can lead to significant differences in the range of soil C contents measured (outliers). Database outliers can negatively impact the performance of calibration models (Wada 2020) and should ideally be eliminated from datasets. This was proved in the present study, where the model performance indices (r 2 , RMSE, RPD, and RPIQ) of the calibration models improved after detecting and removing outliers. Previous research has indicated that the presence of inorganic C interferes with predicting SOC (Reeves et al. 2005). Future studies should incorporate dividing datasets further into similar land management practices.

Calibration models Soil depth and clay content
The dataset produced better-performing models with an increase in soil depth from horizons A to B. There was a decrease in model performance from horizons B to C. The C horizon has depths ranging from 150 mm and 1 800 mm. This supports Xie et al. (2011), who found less accurate SOC estimations in soils deeper than 300 mm and concluded that the best predictions were made with samples collected between 50 and 300 mm. They also found less precise predictions in the topsoil (less than 50 mm). In this study, samples collected between 50 and 300 mm were mainly incorporated in the A and B horizons, with both horizons outperforming models calibrated using Horizon C soil samples.
Similarly, an increase in clay content from 0-10% to 10-20% led to improved estimates of C contents, while samples with a clay content greater than 20% led to a decline in model performance (RPD and RPIQ). However, with an increase in depth (B and C horizons), soils with a clay content greater than 10% provided more accurate soil C estimations than those with less than 10%. Similarly, Ramirez-Lopez et al. (2019) found that soils with a higher clay content produced models with the lowest RMSE and a reasonably high r 2 (0.87). When clay is coupled with depth and silt content in the A horizon, soils with a clay and silt content greater than 20% generate better-performing models. However, in both the B and C horizons, model accuracy decreased with an increase in clay and silt content.

Chemometric methods
The predictive ability of the FT-MIR spectrometry combined with the PLS method was higher than that of the FT-MIR coupled with either the PCR or CLS methods. This is consistent with current literature (Seybold et al. 2019;dos Santos 2020) but contradictory to a study by Morellos et al. (2016), which concluded that the PCR method performs slightly better than the PLS method for the prediction of SOC. The calibration group must be greater than the validation group to minimise complexities generated by wide soil C content ranges. This will lower the risk of high RMSE values for the models (Seybold et al. 2019). The PLS method is known to perform better than the PCR method (Ramírez et al. 2021). This is because PLS regression considers the variability of the dependent variables, whereas PCR does not (Kabir et al. 2017). Limited information is available on the use of the CLS method in soil surveys (Zhou and Cao 2013).
Although the CLS method seems to over-predict soil C contents, it did predict soil C reasonably well in the subgroups with clay. Furthermore, it is widely accepted that pre-processing efforts can improve the predictive ability of different models by minimising any spectral scatter effects and extract helpful information for calibration (Du et al. 2011;Heil and Schmidhalter 2021). Studies have reported how the Savitzky-Golay smoothing filter increases the precision of calibration models (Vašát et al. 2017;Xu et al. 2021), which was evident in this study. Standard normal variate (SNV) is a popular pre-processing technique. This study saw an improvement in model performance with SNV and Savitzky-Golay smoothing in contrast to Gates (2018), who found that SNV did not improve model performance in predicting SOC. Seybold et al. (2019) found lower RMSEP values for SOC than for total C because the SOC dataset had a smaller sample size and standard deviations. They further found that the PLS method over-predicted SOC contents at higher SOC values. Our study found that all three methods over-predicted soil C at higher values (> 2%), with the PLS method providing the highest overpredictions. It is suggested that further research be conducted in the City of Johannesburg area using a larger sample size to improve the reliability of the models.

Conclusion
This study aimed to assess the ability of FT-MIR spectroscopy to accurately predict C contents in granitic soils collected in the Johannesburg Granite Dome area. The study further evaluated the impact of disaggregating datasets into textural subgroups, and which chemometric method would improve predictions of soil C between the PLS, PCR, and CLS methods. The findings demonstrated that soil C contents estimated by FT-MIR spectroscopy were relatively well correlated with those determined by dry combustion (CN Leco Analyser). Furthermore, the PLS method provided better-calibrated models; however, the PCR method outperformed both the PLS and CLS methods in predicting soil C in the Horizons and Clay and Silt and Clay samples. The detection of outliers resulted in lower standard deviations and, together with SNV and the Savitzky-Golay smoothing filter, improved the overall performance of the calibration models. The database showed improved calibration models as there was a decrease in soil depth. The importance of including silt contents which are higher than the clay contents in the subgroups was also highlighted. Therefore, FT-MIR spectroscopy coupled with the PLS method and soil textural subgroups may be an alternative to conventional methodologies because it is cost-effective, rapid, and produces reliable results. Future studies can further improve this study by disaggregating datasets into similar land management practices.

Data availability
The data are not publicly available as they are still being analysed and interpreted for several other reports and publications.