Combining ancillary soil data with VisNIR spectra to improve predictions of organic and inorganic carbon content of soils

Graphical abstract


Method details
In soils that do not contain soil inorganic carbon (SIC), traditional laboratory methods of soil organic carbon (SOC) content determination are generally simple, and usually analogous to those for soil total carbon (STC) content determination. However, when carbonates are present in soil samples, such as in arid and semi-arid regions, the determination of SOC content becomes more challenging. As the sum of SOC and SIC content is equal to the STC content of a sample, at least two of the three components must be determined. However, due to the costly nature of measuring SOC and SIC content by traditional laboratory methods, this is often unfeasible.
The use of spectroscopic tools, such as visible near infrared (VisNIR), to predict the different components of soil carbon (SOC and SIC) has been extensively studied in recent years due to the cost and labour savings that come with these approaches [1]. Both SIC and SOC content are generally well predicted by these approaches [2], as there are specific VisNIR wavelengths that are heavily associated with these different components of soil carbon [3]. Most studies solely use VisNIR spectra as input variables for predicting SOC and SIC content [2], however, there are opportunities to combine VisNIR spectra with other useful and readily available soil information to further improve predictions. For example, soil pH data is commonly available as it can be easily determined by traditional laboratory methods, and including this information as a predictor variable could be advantageous due to the relationship that pH has with both SOC and SIC content. In this study, SOC and SIC content of samples in a semi-arid region in Australia are predicted using Cubist models, and the impact of combining VisNIR spectra data with soil pH and STC data as predictor variables is analysed.

Study are and soil dataset
This study uses soil data collected from a semi-arid area surrounding the township of Hillston, in south-west NSW. The study area is $2500 km 2 in size, and primarily consists of largely flat alluvial floodplains, with some rocky outcrops at higher elevation. The soils on the floodplain are mainly grey, brown and red Vertisols (IUSS Working Group WRB 2014), with sandier soils of largely aeolian origin at the higher points. Rainfall at Hillston is low, with a mean annual rainfall of 372 mm. The study area is subject to hot summers and cool winters, with a mean minimum temperature of 17.7 C in summer and 4.5 C in winter, and mean maximum temperatures of 31.2 C in summer and 15.9 C in winter [4].
Soil samples from 80 locations from a soil survey conducted in 2002 are used, as well as 140 soil cores from a soil survey from 2015 [5]. Samples were extracted from soils under a variety of land uses, including irrigated and dryland cropping, irrigated perennial horticulture, and rangeland grazing. Many of the same sites were sampled in both surveys (n = 70), as the locations were georeferenced. The subsampling intervals in the surveys differed, with the 2002 survey sampled at 0-0.2 and 0.3-0.4 m, and the 2015 sampled at 0-0.1, 0.1-0.3 and 0.3-0.5 m. In total, the soil dataset consists of 399 soil samples.

Traditional laboratory methods
All of the soil samples were air-dried and then ground through a 2 mm sieve. Prior to laboratory analysis, all samples were tested for the presence of soil inorganic carbon (SIC). A $1 g subsample of ground soil was placed on a ceramic plate and a few drops of 1 M hydrochloric acid (HCl) were placed directly onto the sample. Any sample that showed an effervescence reaction was considered to contain calcium carbonate, the most prominent form of SIC in these soils. An additional subsample ($10 g) was then taken and finely ground (<53 mm) using a Fritsch Mortar Grinder Pulverisette 2 (Fritsch, Germany) for 4 min at 50-60 Hz frequency. Soil total carbon (STC) content was determined by the combustion method with the Leco1 CHN analyser for 2002 samples, and the Elementar vario MAX CNS for 2015 samples. The Elementar vario MAX CNS and the Leco1 CHN analyser are very similar in their analytical approach, and both use the combustion technique. Soil organic carbon content for 2002 samples was determined by treating samples with 2 M HCl to remove inorganic carbon, and then analysing by the Leco1 CHN analyser [6]. For 2015 samples, SOC content was determined by the Walkely-Black method, which is a wet oxidation technique that uses chromic acid [7]. The Walkely-Black method was used, as it is one of the more rapid traditional laboratory methods of measuring SOC content. To estimate SIC content, the difference between STC and SOC contents was used. For 2002 samples, the SOC content was determined immediately, however, STC was determined from archived samples 13 years after sampling. While there may be potential drawbacks of analysing soil samples that have been archived for many years, most studies in the literature that have analysed the impact of archiving on soil carbon levels have found that this is negligible (e.g [8].).

Spectral predictions
VisNIR spectral acquisition and processing Archived soil samples from both the 2002 and 2015 soil surveys (n = 399) were scanned by visible near infrared (VisNIR) with an Agrispec portable spectrophotometer with a contact probe attachment (Analytical Spectral Devices, Boulder, Colorado) on the dried and ground soil samples. To reduce signal-to-noise ratios of the spectra, three scans of each sample were performed, from which an averaged reflectance spectrum was derived. Calibration of the instrument was made with a Spectralon white tile and was re-calibrated after every 15 scans, or five samples.
Pre-processing of the VisNIR spectra was performed, which included splicing the discontinuities at VisNIR detector junctions (1000 and 1800 nm), and then converting reflectance to absorbance.
Smoothing of the spectra was then performed using a Savitzky-Golay filter (Savitzky and Golay, 1964), and wavelengths of VisNIR outside the 500-2450 nm were removed, and the remaining wavelengths were resampled at 10 nm intervals to reduce data quantity. A Standard Normal Variate (SNV) baseline correction (Barnes et al. 1989) was then performed on the remaining spectra.

Prediction models
Along with the VisNIR spectra, the mid-depth of the sample was included as a predictor variable in the models, to ensure that the depth was taken into account when predicting. In addition to these, soil pH and STC content (measured by traditional laboratory approaches described in the methodology above) were included as predictor variables, as this data was available for all samples. For each soil property (SOC and SIC), five variations of model inputs was tested. These included VisNIR and mid-depth (model A); VisNIR, mid-depth and pH (model B); VisNIR, mid-depth and STC content (model C); VisNIR, mid-depth, pH and STC content (model D); and finally mid-depth, pH and STC content without VisNIR spectra (model E).
Cubist models were used to predict SOC and SIC content, with each of the five combinations of predictor variables. Cubist is a regression rule technique that essentially functions by creating one or more rules, where each rule is a linear model of the predictor variables [9]. To test prediction quality, 75% of the dataset was used as calibration, and the remaining 25% was used as validation. These datasets were selected by performing a Latin hypercube sampling of the VisNIR spectra, pH, STC, mid-depth, and the response variables to ensure that both the validation and calibration datasets were appropriately represented. The maximum number of rules used in the Cubist models was 10. The bagging, or bootstrap aggregating, method was used to generate different models from varying realisations of the calibration dataset with the aim to enhance the prediction and also estimate uncertianty of the model [18]. This approach uses repeated random sampling, where the calibration dataset of size N is replaced to calculate the B bootstrap. Each bootstrap has the same size as the calibration dataset, but does not contain the same samples. In total, there were 50 bootstraps, meaning that 50 Cubist models were generated for each soil property and each combination of predictor variables. The mean was then calculated from the 50 soil property predictions for each sample in the calibration dataset. The statistics used to test the model quality included Lin's concordance correlation coefficient (LCCC), root mean square error (RMSE), bias (mean of the residuals), and R 2 , with this being tested on both the calibration and validation datasets. All statistical analyses were performed in the statistical program R [10].

Summary statistics of laboratory-measured soil properties
Overall, SOC contents of the soil samples are low, with a mean value of 0.58% and values ranging from 0.09 to 1.77% ( Table 1). The SIC contents were quite variable, with a minimum of 0, a maximum of 1.60%, and a mean of 0.04%. The STC content of samples ranged from 0.63 to 1.85%, and possessed a mean of 0.63%. The mean pH of all samples was slightly alkaline at 7.57, but ranged considerably from 5.02 to 9.62 (Table 1).

Visible near infrared (VisNIR) spectroscopy predictions
Both SOC and SIC content of samples displayed a mild to strong relationship (Pearson's correlation) with the predictor variables of STC content and soil pH ( Table 2). The independently-validated statistics also showed that both SOC and SIC content of samples could be predicted with high accuracy using spectroscopic techniques (Figs. 1 and 2; Table 3). Overall, SOC content was predicted with greater accuracy than SIC content, and the different combinations of model inputs had a clear impact on the prediction quality for both SOC and SIC content. The LCCC was primarily used for the assessment of model quality, as it is the fit of the 1:1 line of the observed and predicted values. It is also unit less, which makes it useful for comparing different models of the same soil property, as well as comparing models for different soil properties.
For SOC content, it was clear that the inclusion of additional soil property data improved prediction results considerably, improving the predictions on the validation dataset from an LCCC of 0.81 for the model without ancillary soil data (model A), to an LCCC of 0.94 for the model that included all   Fig. 1). It can be seen in Fig. 1 that model D predicted SOC to $100% accuracy for multiple soil samples (where the points lie exactly on the 1:1 line) in both the calibration and validation plot, whereas this did not occur for model A. For SOC content, the inclusion of pH alone with VisNIR and mid-depth (model B) did not improve predictions, but the inclusion of STC content with VisNIR and mid-depth (model C) improved predictions significantly. Organic carbon content had a particularly high positive correlation (r) with total carbon content (0.86) and a weaker negative correlation with pH (À0.36), which explains their relative importance in the prediction of SOC content (Table 2). Overall, the inclusion of both pH and STC content together with VisNIR and middepth (model D) resulted in the best model. Interestingly, the second best model for predicting SOC content was with the model that contained no VisNIR spectra and only ancillary soil information as predictor variables (model E), with an LCCC of 0.92 on the validation dataset (Table 3). This indicates the high value of the ancillary soil information in predicting SOC content. While SOC content could be predicted to higher accuracy, the inclusion of STC content and pH data as predictor variables with VisNIR and mid-depth was even more effective in improving SIC content predictions with Cubist models (Table 3; Fig. 2). For SIC content, VisNIR and mid-depth only models (model A) predicted the validation dataset very poorly, with an LCCC of 0.35, whereas the model with the full suite of input variables (model D) predicted the validation dataset to an accuracy of 0.83 LCCC (Table 3). While the inclusion of pH alone (model B) did not improve SOC content predictions, it made a Depth noteworthy improvement from the simplest model (model A) in SIC content predictions to predict the validation dataset to an accuracy of 0.52 LCCC (Table 3). There was a slightly higher absolute correlation (r) with pH and SIC content (0.39) than there was with pH and SOC content (-0.36), although the correlation with STC and SIC contents was much weaker (0.35) than with STC and SOC contents (0.86) ( Table 2). Overall, the Cubist models that included VisNIR spectra, mid-depth, pH and STC content as predictor variables proved to be the most accurate at predicting SIC content. Similarly to SOC content predictions, the model with mid-depth, soil pH and STC data and without VisNIR spectra was the second best model for predicting SIC content, with and LCCC of 0.78 when predicting on the validation dataset (Table 3). Table 4 shows the five most important predictor variables for each Cubist model, giving the percentage of times where the variable was used in a condition, and the percentage of times it was used in a linear model. It was clear that the ancillary soil data played a significant role in both the conditions, and the models ( Table 4). For example, in the model that contained the full-suite of predictors (model D), pH and STC were the most important variables for SOC content, and in the top three for SIC content (Table 4). In terms of important wavelengths for the different SOC models; 1400, 1900, and 2140 nm were important in both model A and B, and 570 nm was important in models A and C. For the SIC models, wavelength 570 nm appeared in both models A and C, 1480-90 nm in B and D. Higher wavelengths at 2220 nm for model A and 2270 nm for model C were also important predictors for SIC content (Table 4).

Soil property predictions and predictor variable importance
Overall, both SOC and SIC contents of samples from the semi-arid region of Hillston could be accurately predicted with spectroscopic techniques. It was clear from the results that combining soil pH and STC content data as predictor variables with VisNIR spectra substantially improved the accuracy of both SOC and SIC content predictions compared to solely using VisNIR spectra.
In particular, SOC content was predicted with very high accuracy by the model that included VisNIR, mid-depth, pH and STC (model D), with an LCCC of 0.94 when predicted on the validation dataset compared to an LCCC of 0.81 for the model that contained only VisNIR and mid-depth (model A). This is logical, as SOC content is highly positively correlated (r) with STC content (0.86), and mildly negatively correlated with pH (À0.36). The importance of these ancillary data in predictions of SOC content was demonstrated in the model that only contained mid-depth, pH and STC (model E), where despite no VisNIR spectra being included in the model, the calibration dataset could still predict the validation dataset to an accuracy of 0.92 LCCC (Table 3). When predicting on both the calibration and validation datasets with model D, it was apparent that SOC content was predicted with $100% accuracy for several samples, as can be seen in Fig. 1. These very accurate predictions can be logically explained. It is likely that the Cubist model is detecting that there is no inorganic carbon in the sample, and because Cubist models are essentially rule-based decision trees, the model is simply assigning the inputted STC value as the SOC content prediction. There are particular VisNIR wavelengths that are associated with SIC [11], and if these wavelengths of the scanned sample do not possess the appropriate reflectance, the model is likely determining that there is no SIC present in the sample. While the inclusion of soil pH alone with VisNIR and mid-depth did not improve SOC predictions (model B), when this was included in combination with STC (model D), the predictions were slightly better than model C (VisNIR, mid-depth and STC), suggesting that there is an advantageous interaction occurring with STC and pH in these models.
Studies have reported that the accuracy of predicting SOC and SIC content of samples with VisNIR is generally quite similar [12,13], although this depends on a number of factors. In our study, this was not the case, and the best combination of covariates (model D) predicted the validation dataset with an LCCC of 0.94 for SOC content, and 0.83 for SIC content (Table 3). Again the value of ancillary soil data was exemplified, with the model that contained mid-depth, pH and STC, and no VisNIR spectra (model E) showing relatively high predictions of SIC content, with an LCCC of 0.78 on the validation dataset. A possible reason for the poorer predictions of SIC content compared to SOC content is due to the nature and distribution of the SIC dataset. The SIC dataset in our study is zero-inflated (contains many zero values), and consequently there are fewer samples that contain some amount of SIC in the training dataset. In addition, SIC content of samples was not directly measured by laboratory methods, and was determined by the difference between measured SOC and STC content, which includes a greater amount of error. While there were multiple occurrences of SOC content being predicted to $100% accuracy, this was not the case for SIC content. This is logical, as all soil samples in the study contain some amount of SOC, even if it is a very small amount, but not all samples contain SIC. As a result, the model could not simply assign the SIC value as the STC value.
While the inclusion of soil pH as a predictor variable with VisNIR and mid-depth (model B) did not improve the SOC content predictions, it made a significant improvement in SIC content predictions. Soil pH was found to be positively correlated (r) with SIC content (0.39) and it is known that very alkaline pH levels indicate the presence of considerable amounts of carbonate in a soil. Soils that possess a pH of less than 7 (1:5 H 2 O) also commonly do not contain SIC (Wang et al. 2015). Although the correlation (r) of SIC with STC content was relatively weak (0.35), including STC content as a predictor improved predictions on the validation dataset from an LCCC of 0.35 for model A to 0.82 for model C. While combining pH with spectra improved SIC content predictions, this positive impact seemed to be masked when both pH and STC content were included as predictor variables together, with the LCCC of predictions on the validation dataset for model D only slightly better at 0.83.
The analysis of variable importance of the different models of SOC and SIC content showed that the ancillary soil data played a significant role in both the conditions and the models ( Table 4). As expected, the most important wavelengths of the VisNIR spectra for the different models for both SOC and SIC content varied, however, there were a few wavelengths that were important for several models. In particular, 570 nm was in the top five most important predictors in model A and B for both SOC content and SIC content. Other studies have reported similar results, such as Viscarra Rossel et al. [2], where 570 nm was identified as an important predictor for SOC content, and Ostovari et al. [14], where 571 nm was an important predictor for calcium carbonate (CaCO 3 ). For SIC content, higher wavelengths at 2220 nm for model A and 2270 nm for model C were identified as important predictors, which is also commonly reported by other similar studies (e.g [11,14,15].).

Limitations and opportunities
It must be acknowledged that including STC content data with VisNIR spectra to predict the SOC content of a sample is likely impractical and unnecessary for many studies. The prediction of SOC content with spectra alone was of high quality in our study, and this has been the case for many other studies [2]. Our study, however, demonstrates that there is considerable benefit in measuring STC content and including this in model predictions with VisNIR to predict SIC content. Although inorganic carbon is not found in all soils, this approach could be particularly appropriate for areas that typically possess soils with carbonates, such as arid and semi-arid areas. Our results also suggest that there is considerable benefit in including soil pH data with VisNIR spectra to predict SIC content. Soil pH data is also typically more available than STC content data, as it can be rapidly and cheaply measured by traditional laboratory methods. As soil pH is often correlated with different soil attributes such as nutrient availability, this also shows promise for combining soil pH data with spectra to predict other soil properties.
This also opens the discussion as to the possible benefits of combining other cheaply-measured and readily available soil data with VisNIR spectra to predict different soil properties, as there are many soil properties that are highly correlated with each other. For example, soil electrical conductivity (EC) is easily measured by traditional laboratory methods and hence this data is often available. It is known that EC is well correlated with other soil properties that are typically laborious to measure, such as soil particle size, and cation exchange capacity [16]. While studies often use ancillary soil data combined with pedotransfer functions to estimate the value of a soil property [17], there are no studies, to our knowledge, that use ancillary soil data in combination with spectra to predict another soil property. There are some limitations to adopting this approach, as including additional soil property data with spectra in predictive models requires that both the training dataset and prediction dataset possess a value for that soil property. Despite this, when ancillary soil data is available, it could be very useful in improving the quality of soil spectroscopic predictions.

Conclusions and future directions
It was clear from the results in this study that the inclusions of soil pH and STC content as predictor variables substantially improved the prediction of both SOC and SIC content when combined with VisNIR wavelengths of the scanned soil samples. When combined with VisNIR spectra, soil pH data markedly improved the prediction of SIC content, which is a particularly significant finding as SIC content is difficult to measure by traditional laboratory techniques, whereas soil pH information is often readily available. The overall results from this study suggest that there is promise for including other readily available soil data with VisNIR to predict different soil properties, particularly when the soil property used as a predictor is correlated with the soil property to be predicted.