Visible near infra-red (VisNIR) spectroscopy for predicting soil organic carbon in Ethiopia

Over the past few decades, the advantages of the visible-near infra-red (VisNIR) diffuse reflectance spectrometer (DRS) method have enabled prediction of soil organic carbon (SOC). In this study, SOC was predicted using regression models for samples taken from three sites (Gununo, Maybar and Anjeni) in Ethiopia. SOC was characterized in laboratory using conventional wet chemistry and VisNIR-DRS methods. Principal component analysis (PCA), principal component regression (PCR) and partial least square regression (PLS) models were developed using Unscrambler X 10.2. PCA results show that the first two components accounted for a minimum of 96% variation which increased for individual sites and with data treatments. Correlation (r), coefficient of determination (R 2 ) and residual prediction deviation (RPD) were used to rate four models built. PLS model (r, R 2 , RPD) values for Anjeni were 0.9, 0.9 and 3.6; for Gununo values 0.6, 0.3 and 1.2; for Maybar values 0.6, 0.3 and 0.9, and for the three sites values 0.7, 0.6 and 1.5, respectively. PCR model values (r, R 2 , RPD) for Anjeni were 0.9, 0.8 and 2.7; for Gununo values 0.5, 0.3 and 1; for Maybar values 0.5, 0.1 and 0.7, and for the three sites values 0.7, 0.5 and 1.2, respectively. Comparison and testing of models shows superior performance of PLS to PCR. Models were rated as very poor (Maybar), poor (Gununo and three sites) and excellent (Anjeni). A robust model, Anjeni, is recommended for prediction of SOC in Ethiopia.


INTRODUCTION
Concerns about global warming have resulted in an international agreement on reducing the emission of greenhouse gases (Kandel et al., 2011).The concern created a renewed interest in determination of soil organic carbon (SOC) content (Brunet et al., 2007).SOC represents one of the major pools in the global C cycle.Therefore, small changes in SOC stocks cause an important CO 2 fluxes between terrestrial ecosystems and the atmosphere (Stevens et al., 2006).Determination of SOC content is an important part of research to examine the fluxes.Current technologies to determine SOC depend on two categories of technologies often described as "intensive" and "non-intensive" (McCarty et al., 2002).
To quantify SOC, "intensive technology", uses several different techniques of fractionation and chemical extractions procedures.The intensive technologies include dry combustion for total carbon, calcimeter method for inorganic carbon and wet oxidation for SOC (Janik et al., 1998;Sankey et al., 2008;Walkley and Black, 1934)."Intensive technologies" are conventional and standard procedures but are time-consuming, laborious and expensive.The existence of several deviations in analytical procedures among the standard methods makes them more complex (McCarty et al., 2010).
In recent years, the "non-intensive technology" method is used as an alternative method because of its multiple advantages.Attention is given for such an alternative method as Visible near infrared reflectance (VisNIR) using diffuse reflectance spectroscopy (DRS) (Brunet et al., 2007).VisNIR-DRS methods are new, rapid, simple, non-destructive, reproducible, cost effective and some times more accurate than conventional analytical methods (Chang et al., 2001;Brown et al., 2005;Gomez et al., 2008;Cecillon et al., 2009;McCarty et al., 2010).
It is well-known fact that infrared predicted data can never be better than the original laboratory values.VisNIR-DRS method is less accurate than conventional laboratory methods such as wet oxidation and dry combustion (Stevens et al., 2006).If the sources of laboratory error can be identified, however; the VisNIR method may in fact be a better tool for interpretation than the 'appropriate' chemical analysis (Janik et al., 1998).A comprehensive review on advantages and disadvantages of VisNIR Spectrometer exist in Blanco and Villarroya (2002).VisNIR Spectrometer methods have also a limitation associated with instrumentation, data transferability, variation in study scale (Mouazen et al., 2010).In spite of these limitations, progress has shown the potential of Visible-Near Infra-Red Reflectance (VisNIR) for soil analysis (Janik et al., 1998).
In predicting SOC various types of spectrometers (DRS) are used (Blanco and Villarroya, 2002).The most common types of spectrometers are described as diffuse reflectance (DR), Mid Infrared (MIR) and Near Infrared (VisNIR).In this study, VisNIR spectrometer was used with range from 700 to 2,500 nm wavelength (Viscarra Rossel et al., 2006;Viscarra Rossel and McBratney, 2008).DRS has been used in soil science research since the 1950s (Viscarra Rossel and McBratney, 2008), however, characterizing soil using VisNIR-DRS dates back to the 1960s (Brown et al., 2005).Over the past 40 years, VisNIR-DRS methods have been developed as tool to predict SOC (Kang, 2006).Today the wide application of VisNIR-DRS methods has resulted in a modern technique for landscape modeling (Brown et al., 2005) precision agriculture (He and Song, 2006;Brown et al., 2005) digital soil mapping (Viscarra Rossel and McBratney, 2008) and soil C monitoring (Brown et al., 2005;Ge et al, 2011) for use in carbon sequestration studies and carbon finance.

Shiferaw and Hergarten 127
VisNIR-DRS includes finding suitable data treatment and calibration strategies (Chang et al., 2001).As soil organic matter is complex, spectra results are not directly informative (Brunet et al., 2007).There is complexity of spectra and overlapping bands associated with its soil organic matter component (Kang, 2006;Sankey et al., 2008).The VisNIR spectra for SOC have not been well described so far, perhaps due to the complexity of material (Brown et al., 2005).Moreover, soil constituents various materials other than organic matter, which interact in a complex way to produce a given spectrum.So, direct quantitative prediction of soil characteristics is impossible (Cecillon et al., 2009;Chang et al., 2001).It is good to note that soils are more diverse in composition compared with traditional VisNIR products like grains or forages (Ge et al., 2011).It is therefore rather possible to calibrate model to predict soil organic carbon.Simple equations involving pedo-transfer functions are used for predicting soil properties (Janik et al., 1998).Likewise, over the past decades, both physical and chemical properties of soils have been predicted from soils spectral data using multivariate equations (Kang, 2006;Cecillon et al., 2009).The prediction is successful for soil organic carbon.Multivariate analysis is used to construct models capable of accurately predicting properties of unknown samples.Multivariate calibration methods such as multiple linear-regression (MLR), principal components regression (PCR), Boosted Regression Trees (BRT), Artificial Neutral Networks (ANN), Locally Weighted Regression (LWR) and partial least squares regression (PLSR) has been applied to all spectroscopic studies (quantitative analysis) with variable degrees of success (Kang, 2006;Chang et al., 2001;Genot et al.,2011).PLS, PCR, MLR are good where there is linear relationship while ANN and others can be used where there is no linear relationship (Blanco and Villarroya, 2002).None of the above models are universally accepted and there are variously proposed calibration techniques (Chang et al., 2001;Genot et al., 2011).
Regression techniques involve relating the soil spectral data measured using VisNIR-DRS to laboratory measured soil properties (Ge et al., 2011).In this study, spectral data was related with SOC determined using analytical (Walkley and Black) method using multivariate regression models.Models built are tested using full prediction method and checked for accuracy using statistical parameters (Chang et al., 2001;Kandel et al., 2011).
This study makes use of three models: PCA, PLS and PCR.These models were selected for three reasons.First, they are full spectrum data compression techniques (Viscarra Rossel and McBratney, 2008;Naes et al., 2002).Second, the models can handle co-linearity.Third, they are most widely used and successful in SOC predictions (Blanco and Villarroya, 2002;Ge et al., 2011).As reviewed by Stevens et al. (2006), PLS and PCR are more frequently used than other models.MLR model was not used in this study because of its limitation in leverage correction and handling co-linearity (Stevens et al., 2006;CAMO, 2012).
As reviewed by Brown et al. (2005), soil properties were predicted using VisNIR Spectrometer in a wide range of scale representing soil variability from local, regional to global libraries.Regional libraries refer to a greater geographic extent than local libraries while global libraries are based on major soil taxa from multiple continents (Sankey et al., 2008;Brown et al., 2005).A comparison of results by Sankey et al. (2008) and review by Chang et al., (2001) and Stevens et al., (2006) shows that local libraries have better calibration accuracy compared with regional and global libraries.This study attempts to build four models (for individual 3 sites and all three sites) and recommends the most robust model for prediction of SOC in Ethiopia.Until recently, VisNIR-DRS has not been used as a tool to predict soil properties in Ethiopia.The paper specifically attempts to show the effect of data treatment on models, model testing and selection.

Methods
An equivalent mass depth soil sampling method was used as suggested for soil carbon study by Stolbovoy et al. (2002).Soil samples were taken from 64 soil profiles in three sites.Although the study sites are small in size, there are different types of soil types in the areas (Table 1) resulted in an intensive sampling.Depending on profile depth, samples were taken from 0-10, 10-30, 30-50, 50-100 cm depths.Although SOC distribution decrease with soil depth, its concenteration is visible up to 1 meter (Allen et al., 2010).Thus, deep sampling protocol is suggested for SOC study (Baker et al., 2007).Total soil samples are 96 from Gununo, 98 from Anjeni and 81 from Maybar.As recommended by Brunet et al. (2007) and Knadel et al. (2011) soil samples were grinded and sieved through 0.2 mm for better carbon prediction as used in this study.
A field spectroscopy (VisNIR-DRS) by Analytical Spectral Device (ASD) Incorporation was used for measurement of 275 samples taken from three sites.SOC was measured in laboratory using standard procedure for wet oxidation method as described in Walkley and Black (1934).Scanning procedures are as described in Brown et al. (2005) with detail protocols as indicated in Viscarra Rossel (2009).Reflectance spectra were measured on petri dishes, twice for each sample using a mug light.Spectra wavelength ranges from 350 to 2500 nm.Data reduction methods are needed in VisNIR Spectrometer study (Blanco and Villarroya, 2002).Following spectra data transposing for pre-processing, data was reduced using average (for replicate sample spectra measurement).Then every 10 th of the wavelength was selected.
There also seems to be lack of clarity on pre-processing to optimize spectral data (Brunet et al., 2007).Proper data pretreatment help develop accurate calibration (Reeves et al., 2006;Blanco and Villarroya, 2002).Having tested various data pretreatment procedures, Multiplicative scatter correction (MSC) and Detrending (DT) were selected to get best calibration and validation result.Steps used in developing multivariate models are as described in Blanco and Villarroya (2002) and CAMO (2012).
Unscrambler X 10.2 (CAMO Software, Analytical Spectral Device {ASD}, Oslo, Norway) (CAMO, 2012) was used for data pretreatment, model calibration, validation and testing.Using test set validation method; principal component analysis (PCA) was used to examine hidden structure of data, to visualize relationship (similarity and difference) between soil samples and spectral wavelength (variables).PCA was used mainly to describe sample effect on models.PCA was used as descriptive tool while PCR and PLS were used as predictive tool.SOC content was regressed against soil spectra using PLS and PCR.
All model calibration involves selecting 10 components (factors), testing regression coefficients at *P < 0.05% significance level with test set validation.A total of 4 models were built for three individual sites independently and for all the three sites (altogether).To develop model for the three sites, data (n=275) was divided in to validation (30%, n=82) and calibration (70%, n=193) set.In developing each site models, validation and calibration samples are 28 and 68 for Gununo, 29 and 69 for Anjeni and 24 and 57 for Maybar, respectively.
The regression models were compared to examine accuracy and predictive ability using correlation coefficient (r), slope, coefficient of determination (R 2 ), root mean error of calibration (RMEC) and prediction (RMEP).Ratings of the models in this study were based on combining two parameters.The first parameter was based on R 2 values rate as suggested by Viscarra Rossel and McBratney (2008).The second parameter was based on RPD value rate as suggested by Mouazen et al. (2010).The accuracy of developed models were tested using full prediction by examining (predicted and reference plot) which shows the difference between measured and predicted values.

Soil organic carbon (SOC) analytic result
The soil of the study sites were described and classified by the Ethio-Swiss Soil Conservation Program (SCRP) (Kejela, 1995;Weigel, 1986,a, Weigel, 1986,b).Altitude of the study area varies from 1982 to 2858 meter above sea level (m.a.s.l).Traditional agro-ecology of the sites varies from Moist WeynaDega to Wet WeynaDega.SOC samples of the three sites (n= 275) have 2.5 mode and 1.9(g/Kg) median.SOC data is skewed positively (0.8, standard error of skewness = 0.14) with first quartile (Q1) = 1.0 and third quartile (Q3) = 2.6 values.
to 3.9% mainly because survey area was smaller compared with Kejela (1995).Weigel (1986a) indicated that high percentage of OC is available in Gununo with some soil units of Humic Acrisols and Nitisols.Organic  Matter (OM) variation shows that some layers of Humic Acrisols has a maximum of 6.2% while Eutric Nitosols has a minimum of 1.2% (% OM = O.C% X 1.72).Weigel (1986, b) characterized SOC variation of Maybar with maximum values at depths of Phaeozem soil profiles with 5.9% OM and minimum value of 1.5 % OM at some depth.Comparison of variation of SOC (g/Kg) across the sites shows that the minimum values were recorded in Anjeni and higher values in Maybar (Table 2).

Principal component analysis (PCA)
PCA shows that the first two principal components accounted for a minimum of 96% of the variance (raw spectra for all the three sites).Percent variance increased for specific sites (Table 3) and with data treatment.For example, for the three sites, with Detrending the first two components accounts for 99% of the variance.
PCA is used to find out outliers in a data set (Tobler, 2011).Maybar samples have 4% potential outliers (Figure 2).Under normal situation, 5% of the samples may lie outside the ellipse (CAMO, 2012).Samples far from center have high leverage (potentially influential) (Naes et al., 2002;CAMO, 2012).If leverage values for samples are above 0.4, it is "bothering" (CAMO, 2012).Maybar sample has 9% highest and worse absolute leverage values with 4% potential outliers which have reduced model quality.
The result explains why Maybar model has least predictive ability as reflected in values of correlation (r), coefficient of determination (R 2 ) and residual prediction deviation (RPD) in both PLS and PCR models (Figures 3  and 4).Samples, which appear as potential outliners, were not removed in this study because they contain real soil information measured under laboratory condition.Comparison of variances showed the closeness of calibrated and validated curves, which reflected that models were true representativeness and there is absence of threat from outliers.A further data treatment with Multiplicative Scatter Correction (MSC) and De trending (DT) also developed better PCA with fewer components.

Principal component regression (PCR)
PCR is a multivariate regression analysis technique.PCR is used in predicting SOC using VisNIR-DRS.PCR and PLS provide similar results, though PLS usually converges in less factors than PCR.Although there seems to be confusion on data pre-processing to optimize spectral features for SOC prediction, Chang et al. (2001) points out that finding suitable data treatment is main challenge in VisNIR-DRS study.
Some authors prefer derivatives (Brunet et al., 2007) but in this study, results using first and second order derivatives were even worse than the raw spectral data.Various data treatment methods (moving average, baseline, standard normal variant (SNV) were tested before selecting MSC and Detrending (DT).The various data treatment procedures (baseline effect, moving average) have improved the models a little compared with raw spectral data.

Partial least square regression (PLS)
Review shows that the most frequently used regression

Figure 1 .
Figure 1.Location of study sites in Ethiopia.

Table 1 .
Description of soils of the study sites.
*Based on SCRP

Table 3 .
SOC % variation accounted by first components with raw spectra.

Table 4 .
PCR model calibration and validation results.

Table 5 .
PLS model calibration and validation results.