Evaluating the utility of mid-infrared spectral subspaces for predicting soil properties

We propose four methods for finding local subspaces in large spectral libraries. The proposed four methods include (a) cosine angle spectral matching; (b) hit quality index spectral matching; (c) self-organizing maps and (d) archetypal analysis methods. Then evaluate prediction accuracies for global and subspaces calibration models. These methods were tested on a mid-infrared spectral library containing 1907 soil samples collected from 19 different countries under the Africa Soil Information Service project. Calibration models for pH, Mehlich-3 Ca, Mehlich-3 Al, total carbon and clay soil properties were developed for the whole library and for the subspace. Root mean square error of prediction was used to evaluate predictive performance of subspace and global models. The root mean square error of prediction was computed using a one-third-holdout validation set. Effect of pretreating spectra with different methods was tested for 1st and 2nd derivative Savitzky–Golay algorithm, multiplicative scatter correction, standard normal variate and standard normal variate followed by detrending methods. In summary, the results show that global models outperformed the subspace models. We, therefore, conclude that global models are more accurate than the local models except in few cases. For instance, sand and clay root mean square error values from local models from archetypal analysis method were 50% poorer than the global models except for subspace models obtained using multiplicative scatter corrected spectra with which were 12% better. However, the subspace approach provides novel methods for discovering data pattern that may exist in large spectral libraries.


Introduction
Infrared spectroscopy is providing soil scientists with a new tool for assessing soil quality rapidly and cheaply and is opening up new possibilities for monitoring soil quality in landscapes [33,41] and for digital soil mapping. Other applications of soil spectroscopy have also emerged, for instance, using infrared spectroscopy method as a tool for inferring past climatic changes from lake sentiments determination of organic matter fractions, assessment and monitoring soil quality [9]. Although near-infrared (NIR) and mid-infrared (MIR) spectroscopy are the commonly used techniques for soil measurements Attenuated Total Reflectance (ATR) and Raman spectroscopy approaches have been shown to be useful. Jahn et al. [17] used ATR methods to assay nitrates and ammonium ions from soil. Raman has been used for soil classification studies and recently for assessing the structural role of copper in the soil active glasses [35]. Among the different spectroscopy techniques NIR and MIR are the low-cost and easy to use and have successfully been used to measure carbon (C) content [4]. The main difference between the two ranges is that absorption in mid-infrared spectroscopy corresponds to fundamental bands of molecular vibrations, whereas near-infrared absorptions are due to overtones and combinations of overtones according to several articles cited by Bellon-Maurel and McBratney [4]. Although NIR requires less sample preparation than MIR makes it best suited for in-field analysis but with advancing technology new portable MIR instruments are emerging which can be used in the field giving better specificity and reproducibility of spectra. Because of the dominant intensive vibrations found in MIR spectra, it is generally believed to be more powerful than NIR [20]. Ludwig's view is supported by Pirie et al. [26] who reported a better performance for MIR correlation coefficients validation sets: 0.79 ≤ r ≤ 0.92 against those of NIR 0.53 ≤ r ≤ 0.87 for pH, organic C, clay, sand Mehlich-3 Ca and Mg in Australian top and sub soils.
As spectroscopy instruments continue to improve and scientists' confidence in the usefulness of spectroscopy methods increase as evidenced by scaling up from characterizing few soil samples collected in one site to regionally based land assessments and monitoring studies [37] the need for robust soil spectral prediction models has risen. A result of the increased sampling and subsequent collection of spectra Chemometrics and Intelligent Laboratory Systems 153 (2016) 92-105 has given rise to the development of large spectral libraries. Generating spectral libraries with wide spectral diversity has been recommended for building calibration models that can reliably be used to predict spectra recorded from new samples [32].
Following an earlier work as discussed in Brown et al. [7] showed that about 5.2 × 10 9 carefully selected calibration samples will be required to span the global soil compositional space. However, using the large amounts stored in the spectral libraries can decrease the accuracy of the models fitted to predict multiple soil attributes [27]. This is due to a large spectral diversity of samples in terms of geographical origin, environmental conditions, parent material, mineralogy, etc. For instance, Viscarra Rossel et al. [40] stated that NIR spectra and soil properties can vary under different soil mineralogy and their content in soil organic matter. Ramirez-Lopez et al. [27]) observed that modeling soil attributes using large and diverse soil infrared spectral libraries remains a challenging task. To utilize the growing spectral libraries several strategies have been proposed to partition the complexities found in global spectral spaces into local spaces using either geographical or spectral partitioning. For instance, Wetterlind and Stenberg [42] used models calibrated with a national visible-NIR library, and models calibrated only with local samples grouped according to fields sampled. They observed that the local models outperformed the national calibration models. Stevens et al. [34] observed after partitioning soil dataset into different soil texture classes and agro-pedological regions that local NIR models of soil organic carbon perform better than global models. Spectral space partitioning has been done using memory-based learning methods, which search, through a spectral library to find similar spectra. A recent work done by Dahlbacka et al. [10] presented a proof of concept study for using an iterative algorithm to find local quantitative PLS. In their study, they compared different methods for calculating similarity measures for refining the models by removing a specified number of calibration spectra that represent constituent values from the predicted value, then created a new PLS model on the reduced calibration set to make a new prediction. However these approaches are computationally intensive and the criteria for searching through a spectral library and identify points in a local neighborhood have not been satisfactorily developed.
In this study, we proposed and developed simple methods for partitioning global spectral library space into subspaces from which local calibration models will be developed and assessed against global models. We are proposing four different methods for identifying the spectral subspaces: 1. spectral matching using absolute value algorithm to calculate hit quality index value for a spectral library, 2. spectral matching using spectral correlation algorithms to compute pairwise cosine angles, 3. use of self-organizing maps to group spectra into subspaces equal to the number of subspaces obtained using pure mineral matching and 4. archetype analysis of spectra. 1.1. Local spectral spaces methods

Cosine spectral similarity angle
Cosine of the angle between two vectors can be used to express similarity between two spectra and has been used extensively in NIR mathematical treatment for expressing sample similarity [6] and can be derived using the Euclidean dot product formula as follows: With two vectors holding measurements for two samples, A and B, the cosine similarity, cosθ, is represented using a dot product and magnitude as: The resulting similarity metric measure ranges from −1 meaning exactly opposite, to 1 meaning similar vectors with 0 indicating orthogonality (dissimilarity) and values in between indicating intermediate similarity between the vectors. In its application the correlation thresholds obtained using this method are used to determine whether two spectra are a match, and which the correlation is an angle and not a probability. Thus, a threshold of 0.78 in no way means 78% likelihood or 78% confidence.

Hit quality index (HQI)
Spectral library matching is a widely used interpretation aid [47] in spectroscopy applications. The idea behind spectral matching is to mathematically compare unknown (or a new sample) spectrum against a collection of known spectra. The result of this comparison is a number called the "hit quality index", (HQI) which is a direct measure of how similar two spectra are to each other. To increase the odds of obtaining an accurate search, it is advisable to use the full spectrum. A typical spectrum contains thousands of data points. Different search algorithms are available for comparing two spectra to each other depending on the software being used. In the nomenclature of spectral library searching, the similarity of two spectra can be defined to as a normalized measure of spectral covariance: Where known denotes the spectrum of a reference material whose identity of either chemical or physical composition is known, unknown denotes the spectrum of the material under investigation or the sample being compared with the known spectra, and the dot symbolizes the dot product of two spectral vectors.
Another simple search algorithm for computing hit quality index is what is referred to as absolute value algorithm. First, a known or reference spectrum is subtracted from the sample spectrum. The result of this calculation is called a residual. The size of residual is directly related to how similar two spectra are to each other. For example, two identical spectra will produce a residual of zero almost a straight line. We have implemented this method in our study due to its simplicity.

Self-organizing maps (SOMs)
Self-organizing map (SOM) belongs to a category of Artificial Neural Networks (ANN) called competitive learning networks. The first author of SOM Teuvo Kohonen [21], simply defines the methods as map reflecting topological ordering. SOM uses a lattice L of n neurons. The arrangement and weights of the neurons are determined by the input set Χ ⊆ ℝ d and an updating/training algorithm. The design of the algorithm is such that it positions the neurons within the neuron space in a way to preserve both distribution and topology. During the training process, a weight vector w i ∈ ℝ d is assigned to each neuron i ∈ L. The weights are also referred to as "prototypes" or "codebook" vectors. Further, the vector w i represents the position of the i th neuron in the feature space. Each datum is mapped onto a neuron associated with the nearest weight vector, e.g. the one with the smallest Euclidean distance from the data pattern, but any other similarity metric can be used. Finding the best-matching unit (BMU) is considered the most computational and important tasks associated with a SOM algorithm. The SOM organizes itself during a competitive and unsupervised learning process. Each pattern is shown to the SOM (randomly or sequentially) and the closest node becomes the "winner". The learning process yields a map G L ¼ ðΦ L→X ; Φ X→L Þ, which, defines two mappings, and functions from an input vector x ∈ Χ to a neuron i ∈ L with a particular weight vector w i ∈ Χ. The two mappings are defined as follows [1]: Where d(x) corresponds to the neuron, which is closest to the i th datum. A typical SOM algorithm can be summarized as:

SOM algorithm
This technique has been widely applied in disciplines dealing with high-dimensional in the area of machine vision and image analysis, optical character and script reading, speech analysis and recognition, health, signal processing and radar measurements, industrial and other real-world measurements, process control, mathematical problems and artificial intelligence problems [1]. Most of the past evaluation of SOMs' performance focused on comparisons with other techniques, such as principal component analysis and k-means clustering [23] and while in another work for developing a SOM toolbox [39] involved performance test where computational requirements of the algorithms, i.e., computing time for different training methods, not the quality of the mapping or the reliability of the classes mapped. In this study, the SOM algorithm output will be determined by assessing the type of spectral signatures grouped together into the local subspace.

Archetypes
Archetype analysis has the aim to represent observations in a multivariate data set as convex combinations of extreme points [13]. Consider n × p matrix X representing a multivariate data set with n observations and p variables. The goal of archetypal analysis is to find k × p matrix Z that characterizes the archetypal patterns in the data, such that data can be represented as mixtures of those archetypes. Precisely, the archetypal analysis aims at obtaining the two n×k coefficient matrices α and β, which minimize the residual sum of squares: The elements are required to be greater or equal to zero and their sum must be 1, i.e., ∑ k j¼1 α ij ¼ 1 with α ij ≥ 0 and i = 1 . n . ‖.‖ 2 denotes an appropriate matrix norm.
The archetypes are convex combinations of the data points: Where, β is the second set of coefficients of the data set, n ×k is a matrix whose elements are required to be greater or equal to zero and their sum must be 1, i.e., ∑ n i¼1 β ij ¼ 1 with β ji ≥ 0 and j =1… k. These two fundamentals also define the basic principles of the estimation algorithm: it alternates between finding the best α for given archetypes Z and finding the best archetypes Z for given α; at each step several convex least squares problems are solved, the overall RSS is reduced successively [14].

Multiplicative scatter correction
Multiplicative scatter correction (MSC) was proposed by Isaksson and Naes [16] to correct for light scattering or change in path length for each sample estimated relative to that of an ideal sample. In principle this estimation should be done on a part of the spectrum that does not contain chemical information, i.e. influenced only by the light scattering. However, the areas in the spectrum that hold no chemical information often contain the spectral background where the signal to noise (SNR) may be poor. In practice, the whole spectrum is sometimes used. This can be done provided that the chemical differences between the samples appear to have the same scatter level as the ideal. As an estimate of the ideal sample, we can use for instance the average of the calibration set. MSC performs best if an offset correction is carried first. For each sample: Where X i is the NIR spectrum of the sample, and X j symbolizes the spectrum of the ideal sample (the mean spectrum of the calibration set). For each sample, a and b are estimated by ordinary leastsquares regression of spectrum X i versusX j spectrum over the available wavelengths. Each value X ij of the corrected spectrum X i (MSC) is calculated:

Standard normal variate
Standard normal variate (SNV) has been proposed for removing the multiplicative interference of scatter and particle size [3]. The SNV transformation centers each spectrum and then scales it by its own standard deviation: Where X ij is the absorbance value of spectrum i measured at wavelength j, X i is the absorbance mean of the uncorrected in the spectrum and SD is the standard deviation of the p absorbance values, Spectra treated in this manner have always zero mean and variance and variance equal to one, and are thus independent of original absorbance values.

De-trending
De-trending of spectra accounts for the variation in baseline shift and curvilinearity of powdered or densely packed samples by using a second-degree polynomial to correct the data [3]. De-trending operates on individual spectra. The global absorbance of NIR spectra is generally increasing linearly with respect to the wavelength, but it increases curvilinearity for the spectra of densely packed samples. A second-degree polynomial can be used to standardize the variation in curvilinearity: Where X i symbolizes the individual IR spectrum and λ * the wavelength. For each sample, a , b and c are estimated by ordinary least squares regression of spectrum X i (DTR) is calculated by: Normally de-trending is used after SNV transformation. It has been demonstrated that MSC and SNV transformed spectra are closely related and that the difference in prediction ability between these methods seems to be fairly small [11,15].

Saviztky-Golay derivatives
Noise within spectral data can be removed by Savitzky-Golay smoothing [31]. In this method, a polynomial least-squares fit is performed on a spectral window around spectral point j of i th sample. The corrected spectral point (x ij new ) is estimated using this calculated polynomial model. Subsequently, the window is shifted to a spectral point (j + 1), and the procedure is repeated until the entire spectral range is smoothed. Savitzky-Golay smoothing is also used in combination with 1st and 2nd derivatives from the spectral data [36].

Spectral library
To test our approach for determining spectral subspaces, we used MIR spectral libraries from Africa Soil Information Service (AfSIS) project covering sub-Saharan Africa region.

Field sampling
Sampling for AfSIS library was carefully executed to obtain representative soil samples covering approximately 18.1 million km 2 of the nondesert, including Madagascar [46]. To achieve this 60, 10 × 10 km sized "Sentinel Sites", stratified by the major Koppen-Geiger climate zones of Africa [25], excluding some of the African countries which were no-go zones due to security reasons were used. Each sentinel site was subdivided into 16 sampling units (clusters), each cluster was further split into 10 smaller sampling units (plots). The sampling plot was designed to sample approximately 1000 m 2 (0.1 ha or 30 * 30 m) area and saved into a Geographical Positioning System (GPS) unit. Field crewmembers easily navigated the geo-referenced plots with the help of the GPS unit but when a point led to a difficult point to sample an alternative plot was established nearby and the new coordinates are saved into the GPS unit. Within a plot, four subplots were identified.
To determine the subplot layouts [43], one field crewmember stood at the center marked as subplot 1 as shown in Fig. 2.
In the general direction of downslope, subplot 2 was marked at 12.2 m. To mark the upslope sub-plots 3 and 4 (wings of the Y-frame in Fig. 2), the field crewmember standing at subplot 1 broadcast his outstretched hands backwards facing the downslope subplot 2 with the measuring tape at the end of the hand, pulled back the tape to the center of his chest and marked the position of the lefthand side subplot 4 at 12.2 m. The stretching approximated 120°the angle between the subplots. Similarly, the crewmember pulled back the tape to the center of his chest and marked the position of the right handside subplot 3 at the same length of 12.2 m. Four pegs each about 1 m lengths were prepared and labeled 1, 2, 3, and 4. These pegs were used for marking the center points of the subplots. Using a soil auger samples were collected at 0-20 cm and 20-50 cm from the four subplots then composited to give a representative plot sample for each depth.

Laboratory analysis
First all soil samples were air-dried and then large clods were crushed to pass through a 2 mm sieve. All samples received in the laboratory were analyzed for MIR spectra and 10% of the samples were subjected to reference analysis using wet chemistry for a wide range of soil properties but for this study we focus on pH, Mehlich-3 Aluminum (m3.Al), Mehlich-3 Calcium (m3.Ca), total carbon, clay and sand.

Soil sample analysis using wet chemistry methods
The selected samples for reference analysis were thoroughly mixed before scooping. This was to ensure that a homogenous subsample was selected and a similar one was left in the bag, which was to be used for MIR analysis. Soil property analysis by wet chemistry methods was done according to the methods described by Awiti et al. and Brown et al. [2,7].

MIR spectral measurements and pretreatments
The soil samples were air-dried and then finely ground to powder (approximately b100 μm) using a sample mill. The pure minerals were also finely ground. The fine samples were then loaded into aluminum microtiter plates (A752-96, Bruker Optics, Karlsruhe) using a micro spatula to fill the 6-mm diameter wells and leveled. Samples were loaded into four replicate wells, each sample was scanned 32 times in MIR reflectance mode using a Fourier-transform MIR spectrometer (FT-IR; Tensor 27, Bruker Optics, Karlsruhe, Germany) with a high throughput screening extension arm using a liquid Nitrogen cooled HgCdTe detector. A single spectrum obtained for each sample was later transferred to a desktop computer where it was converted and combined into a single flat data table. Halloysite and quartz were obtained from James Hutton institute mineral collection while Illite and Montmorillonite were ordered from the Clay Mineralogical Society.

Spectral subspaces and calibration models
First, the two spectral libraries were split into a training set (70%) and a testing set (30%) of the total number in each library using conditioned Latin hypercube sampling algorithm as implemented in 'clhs' R package [28]. This selection was carefully done to ensure that samples from the same sampling point are kept together i.e., topsoil and subsoil from the same sampling point were either in the training or testing set. The combined spectra were preprocessed using (1) Savitzky-Golay 1st and 2nd derivatives with a smoothing interval of 21 points [37]; (2) standard normal variate (SNV); (3) SNV + Detrending; and (4) multiplicative scatter correction (MSC). Infrared data often contain systematic variation like an additive or multiplicative offset, which may be caused by scatter effects due to differences in particle sizes, chemical interferences, or instrument drift. The preprocessing eliminates or reduces the impact of the non-relevant spectral information and often leads to simpler and more robust calibration models. These variations may complicate data analysis and interpretation.
Each set of the preprocessed spectra was used to generate spectral subspaces using the four methods aforementioned.

Spectral cosine angle correlation spectral subspaces (CACSS)
Using the preprocessed spectra, each sample spectrum was projected to one pure mineral spectrum at a time to determine the cosine angle between the two spectra vectors. The pure mineral giving the smallest angle was used to label the sample spectra to belong to the same subspace with that pure mineral. From trigonometry two similar vectors will have an angle of zero degrees between them and taking their cosine gives one. Similarly, the angle between two vectors will widen depending on how the two vary from one another.

Hit quality index spectral subspaces (HQISS)
Here is how we implemented a simple search algorithm, the absolute value algorithm or the hit quality index spectral subspaces (HQISS), to obtain sample spectral library subspaces matching with the spectra for the 11 pure minerals: 1. Pick one actual soil spectrum from the spectral library then subtract from each of the 11 pure minerals' spectra. 2. The result of the subtraction gives a residual, where the size of the residual is directly related to how similar two spectra are to each other. Identical spectra will have a residual equal to zero (a straight line). Dissimilar spectra give residuals less than or greater than zero. 3. Calculate the size of the residual by taking the absolute value of each data point in the residual, take their sum and then divide by the number of data points to get the HQI. 4. Rank the 11 HQIs' to get the lowest value for which corresponds to the pure spectra most similar to the sample. 5. Repeat 1 to 4 for each sample in the library. 6. Identify the subspaces where each spectrum in the library is mapped.

Self-organizing maps spectral subspaces
Excluding the pure minerals' spectra, the samples' spectra table was subjected to a self-organizing map (SOM) analysis to determine the subspaces formed by spectral features according to their similarities. The number of maps fit was decided based on the results obtained from spectral matching using CACSS and HQISS methods.

Archetype spectral subspaces
The hardest part in archetypal analysis is picking on the optimal or best number of archetypes. If prior information is available to the analyst to know the relevant archetypes contained in a particular dataset the known number is used otherwise elbow criterion of a residual function [14] which is the value corresponding to a minimum residual sum of squares (RSS) is used. We fitted four archetypes based on the results obtained from HQISS and CACSS, which also gave a reasonable value corresponding to the minimum RSS.

Model development
Random forest regression was used to calibrate spectra with pH, m3.Al, m3.Ca, total carbon, clay and sand. Global and local models were developed using spectra processed with the five methods explained above. The choice of RF method among the commonly used machine learning methods like partial least regression (PLSR) and principal component regression methods was based on its excellent ability to pick non-linearity relationship between predictors and response variables. Also, it has been reported to be simple in theory, fast speed when handling large data, fine-tuning mechanism to control overfitting and that it contains an automatic compensation mechanism on biased sample numbers of groups during the training process. Each sample in the testing set was predicted using a calibration model from a corresponding training set i.e. spectra from the same subspace and preprocessed using the same method. There are a number of methods used to assess model performance using test data. The commonly used methods include bias, root mean square error (RMSE) and the ratio of performance (RPD). Because the three statistics will often give similar information leading to the same conclusion, in this study we used RMSE values. Eq. (10) gives the formulae for calculating RMSE where y denotes the measured value and ŷthe predicted value, n is the number of samples and SD is the standard deviation of the laboratory-measured value for the soil property being predicted.
3. Results and discussion

Soil characteristics of global and local models
Descriptive statistics of the soil properties conventionally analyzed in the laboratory for different spectral subspaces are shown in Tables 1-4 for both calibration and validation data. Calibration soil samples are very diverse with soils ranging from very acidic to alkaline soils with pH values ranging from 3.61 to 9.86 but soil samples in each spectral subspace obtained gave narrower ranges. Soil samples in subspace 3 obtained with spectral archetype analysis had a pH range of 7.71-8.86 but a broad range of total carbon of 1.12-11.29%. In terms of soil texture, the samples vary from very sandy to very clay soils with equally high mean values for Al (821 ppm) and Ca (1842 ppm). Variation of the soil properties in different spectral subspaces is varying depending on how well the subspaces classified similar soil samples. For instance, CACSS and archetypes based subspaces are similar in terms of variations in clay content with most of their subspaces giving the highest standard deviation N20%. Coefficient of variation (%CV) values show m3.Ca had the largest variability of N 100 for archetype4 subspace in the calibration data with a similar %CV obtained from archetype1 of validation m3.Ca. pH %CV for both calibration and validation subspaces was the lowest (2.4-12.5).
Mean distributions of soil properties across the other subspaces were different. In Table 2, we see HQISS put samples with lowest mean carbon (0.4) and highest sand content (78.3) to subspace related with quartz. Samples associated with Illite and Montmorillonite subspaces gave the highest carbon (1.5) and clay (64) respectively. There were only 13 samples associated with Halloysite pure mineral in the calibration of HQISS and none in the validation set. Illite subspace was the most dominant while Halloysite is the least dominant with 13 samples, which were all from the calibration set, and none from the validation. Similar to the archetypal subspaces, m3.Ca within HQISS subspaces gave the highest %CV values N 180.
In Table 3 archetype subspaces seem to have been created based on carbon and soil texture variations. Samples put in archetype3 have the highest carbon (5.8) and the lowest sand (20.3) while archetype2 has the lowest carbon (0.6) but highest sand (58.6). Although archetypes 1 and 4 are rich in clay N50, they contain varying levels of pH (7.6, 5.7) and Al (649.7, 1201.8) respectively. It is likely that the high Al content in archetype4 contributes to the slight acidity of the samples in this subspace.
Similarly, SOM subspaces appear to have been created on the basis of soil texture and carbon variations with SOM4 subspace giving the highest carbon (2.1) and highest clay content (59.2) while SOM1 has the lowest carbon (0.7) and clay content (19.2). The most dominant spectral subspaces SOM2 and SOM3 consist of samples with highest Al (827, 1224) and equal low pH (6). Due to the high Al concentration and the low (pH ≤ 5.5), the soil samples falling into these SOM subspaces are likely to be acidic. Reyes-Díaz et al. [29] stated that toxic Al +3 results in a reduction of crop root growth and eventually overall plant toxicity leading to reduced crop yields. The overall variability for SOM subspaces was lowest among the four subspaces considered for m3.Ca %CV values of 82.6-108.2.
As expected, we found that the independent validation set had a similar distribution to the calibration set but with narrower ranges for the six-soil properties. This is a good indicator that the selected validation points fall within the boundary of the calibration space hence increasing the chance of being reliably predicted because they share similar features.

Distribution of MIR spectra within local spaces
Distribution of the samples within their local spaces is shown in Fig. 3 using score plot for the first two principle components (PCs) for all the 1906 samples used in this study. The first two PCs explain up to 74.4% of the original mid-infrared spectral variation, which comprises both physical and chemical soil information. Using different colors and labeling sample points according to their local subspaces, we showed how well some of the subspace methods discovered hidden structure in the global spectral library. For instance, the SOMSS gave well-separated clusters, labeled as SOM1, SOM2, SOM3 and SOM4. When the points were projected into a PC score plot and read side by side with the subspaces from HQISS it was easy to relate SOM1 samples with soil samples identified as close to the sample with pure quartz. Samples associated with SOM2 can be said to belong to sample class associated with pure Montmorillonite mineral. SOM3 gave mixed samples associated with Halloysite, Montmorillonite and Illite pure minerals as identified in the HQISS. SOM4 was also a mixed bag when related to samples identified in both the ArchetypeSS and HQISS. In the ArchetypeSS it is seen to be dominated by arche-type1 interspersed with the few samples assigned to archetype-3 and a mixture of samples associated with Montmorillonite and Illite. Using Tukey's test we found that mean total carbon between subspaces obtained using SOMSS and ArchetypeSS differed significantly in each subspace. We used HQISS to understand the common spectral features within a subspace. We averaged all spectra in each subspace and obtained a representative spectrum with different shapes and intensities Fig. 4. Some of the clay minerals found in soil include kaolinite, Halloysite, quartz, carbonate, gibbsite, Illite, and smectite in widely varying proportions [24]. Illite minerals are characterized by a broad and poorly defined hydroxyl stretching band near 3620 and 3630 cm −1 [24]. Illite rich soils are also referred to as desert loam soils and from spectral subspaces obtained they are dominant with about 80% of the samples grouped to be similar to Illite.
Montmorillonite is a subclass of the smectite clay mineral with a prominent absorption band centered at~1639 cm −1 according to  Table 4 Summary of soil properties for both calibration and independent validation set for self-organizing maps spectral subspaces. Yitagesu et al. [45], it is a typical water bearing clay mineral and it is associated with the bending vibrations of structural water molecules. For the averaged spectrum for all the samples associated with Halloysite shown in Fig. 4 (subplot c) has hydroxyl stretching vibrations at 3698, 3672, 3655 and 3622 cm −1 in the 3800-3000 cm − 1 regions. The characteristic bands between 1750 and 600 cm − 1 , which includes smaller sharp peaks at 1020 and 920 cm −1 can be said to be due to the alumino-silicate lattice vibrations and Al-OH deformation vibrations [45] respectively like in the case of kaolinite minerals which exhibit similar spectral characteristics to Halloysites. Two more bands observed at 1650 and 1530 cm −1 can be assigned to water bending modes and C-H in-plane bending vibration. Finally, the averaged spectrum representing the soils found to be spectrally close to quartz pure mineral spectrum as shown in Fig. 5 (subplot d) shows intense peaks in the regions 2000-1650 and 1080-700 cm −1 [24]. The fundamental O-Si-O stretching and bending frequencies at 1080, 800-780 and 700 cm −1 were found to be the most dominant bands in the infrared spectra of quartz-rich soils. In our study, we observed other two prominent peaks outside these regions at 1350 and 1220 cm −1 , which are dominated by C-H bending vibrations from organic materials. Fig. 6 gives scatter plots for the global calibration models showing predicted values against the actual measurement values. Similar scatter plots were found for archetype subspaces but with lower r 2 and higher RMSE values. We have not shown the scatter plot for the combined archetype models. Our results showed that the best RF model combinations for the Savitzky-Golay 1st derivative processed spectra are to be 500 trees but different numbers of random variables were tried at each split in the six calibration models (pH = 182; m3.Al = 388; m3.Ca = 40; total carbon = 40; clay = 19 and sand = 86). A similar number of trees was reported by McDowell et al. [22] for soil total carbon analysis using MIR data for 307 Hawaiian soil samples. But, their model used up to 396 random variables, which are about 10 times the number of variables, used in this study for total carbon. Soil pH was well calibrated (r 2 = 0.87 and RMSEC = 0.01). The result was as good as obtained by Terhoeven-Urselmans et al. [37] for the prediction of soil properties from a globally distributed soil MIR spectral library of 971 soil samples (r 2 = 0.81, RMSEC = 0.63). Similar results were also reported by Shepherd and Walsh [32] for the characterization of soil properties from a spectral library with 758 soils from eastern and southern Africa (r 2 = 0.83, RMSEC = 0.34). But, in terms of RMSEC, our results are much better from those previously reported. However, our model seems to overestimate alkaline soil samples, which can be attributed to fewer samples in this range. There were 182 wavebands found to be the most significant in predicting soil pH. These wavebands are 3683-3639; 2580-2306-; 2137-2098; 1709-1689; 1556-1400 cm − 1 (Fig. 7). These bands are associated with hydroxyl stretching vibrations, alumino-silicate lattice vibrations and Al-OH deformation vibrations [45] and very similar to the ones found by Terhoeven-Urselmans et al. [37] using a PLSR model. Both m3.Al and m3.Ca were satisfactorily calibrated with the MIR spectra (r 2 = 0.89 and RMSEC = 182.14; r 2 = 0.91 and RMSEC = 692.56;) respectively. The relatively high cross-validated RMSEC for m3.Ca can be attributed to the few points with high m3.Ca values which were under-predicted by MIR. A total of 388 important variables were reported for a m3.Al which occurred almost across the full MIR spectra range, from 3950 to 3664; 3554-3209; 2858-2173; 1957, 1871-1344; 1205; 962-632. These bands were mainly concentrated in the parts of the spectrum associated with Si-O-H vibration of clays, kaolinite and Fe oxides at 3719-3685 cm -1 , O-H stretching of Gibbsite at the bands 3525-3460 cm −1 and a small peak at 920 cm -1 associated with Al-OH deformation of kaolinite [24,38].

Random forests ensemble tree regression models
Soil total carbon was predicted well for the calibrations set (r 2 = 0.93; RMSEC = 0.06). Terhoeven-Urselmans et al. [37] reported a lower accuracy (r 2 = 0.77) for similar diverse calibrations samples while McDowell et al. [22] reported higher accuracy (r 2 = 0.96) but with a large RMSEC probably due to a wide range of total carbon in the calibration set (0.24 to 55.29%) compared to (0.11 to 11.3%) of total carbon used in this study. Important wavebands for total carbon were 40, from 2121 to 2114; 1794-1736; 1537-1500; 1375-1360;  1022-1018 cm − 1 Fig. 7. These are the ranges associated with C_O stretching [38] at 1775-1711 cm −1 and 1350-1550 regions which contain absorption mainly resulted from soil calcium carbonate, and a stronger absorption meant a higher calcium carbonate content and a higher soil pH [12].
Predictions for particle size were good, for clay (r 2 = 0.79; RMSEC = 1.56) and for sand (r 2 = 0.78; RMSEC = 3.4). However, the sand random forest regression tended to under-predict sand content for samples with actual measurement of sand N 50% samples while samples with clay b 50% were over predicted Fig. 6. Our results were broadly similar to those of previous researchers [26,37] in terms of r 2 values but with higher RMSEC values than those obtained in this study. Important wavebands for clay were 2731-2700; 1228-1205; 1084 cm −1 , which mainly correspond to quartz and other clay minerals [18]  which also overlapped with important variables found in sand prediction. Additional wavebands in the regions 2285-2025, 1751-174 and 1423 cm −1 were found to be important for sand prediction, which corresponds to alumino-silicate lattice vibrations and Al-OH deformation vibrations [45].

Model predictive performance
Validation statistics calculated from both global and subspace models show that 1st and 2nd derivative processed spectra gave the best models with highest Q 2 values. Q 2 is the cross-validated r 2 [44] for the independent validation set. Mehlich-3 Ca model gave poor predictions Q 2 b 0.6 except for the HQISS 1st derivative processed spectra and CACSS 2nd derivative processed spectra subspace models, Table 5. Few points with high m3.Ca could have caused the poor calcium models. Although, MSC preprocessed spectra calibrated well with most of the soil properties they gave low Q 2 indicating that the predictions were so poor, and do not predict better than chance. Clay and sand models gave stable predictions for the 1st derivative preprocessed spectra. Combining SNV and detrending did not give better predictions than SNV only.
Additional results showed RMSE values obtained from the independent validation set using subspace and global models are given in Table 6. In general, we found predictions from the global models outperformed subspace models in many instances except in a few of them. Sand and clay RMSE values from ArchetypeSS are N50% higher than all the sand and clay global models except for the MSC preprocessed spectra which were lower b 12%.
SOMSS models predicted sand content much better with lower RMSE values than the global one except for the 1st derivative preprocessed spectra. The second best-predicted soil property using the SOMSS is m3.Ca, which had lower RMSE value except for the 1st and 2nd derivative preprocessed spectra. Local models for m3.Al and total carbon mostly gave high RMSE values compared to the global models. Although the CACSS local models were poor compared to the global ones with RMSE values in the range of 2-30%, pH model for the 1st derivative preprocessed spectra gave RMSE value equal to the RMSE for the global model. However, MSC processed spectra in the CACSS gave lower RMSE value of about 8% lower than the global one. 1st derivative preprocessed spectra gave total carbon local models with RMSE values in the range, 0.41-0.43 which is almost equal to the global model with RMSE of 0.42. RMSE values from the MSC preprocessed spectra were the highest among the five spectral preprocessing methods in both the global and local models. This seems to agree well with previous work done for modeling soil carbon fractions using visible near-infrared and mid-infrared spectroscopy [19] but contradicts previous work [8] who used partial least squares (PLS) regression to predict soil organic carbon using near-infrared spectra. In summary, total carbon, clay and sand gave stable modes while pH, m3.Al and m3.Ca gave models with poor predictive performance. Based on these results it is possible that the type of analytical method for acquiring soil properties measurements data influences model predictive performance. Because it is beyond the scope of this current study we suggest that methods for minimizing or controlling analytical measurement errors should be investigated.

Conclusion
We did not find evidence in these results to support the main hypothesis of this study. We, therefore, conclude that global models are more accurate than the local ones. Although our findings are at variance with other reported work [42]. However, Ramirez-Lopez et al. and Sankey et al. [28,30] got similar results to ours and concluded that global models predicted the validation dataset better than the local ones. Spectral data processing using Savitzky and Golay algorithm outperformed the other methods with the 2nd derivative giving the best models for pH, m3.Ca, total carbon and clay while the 1st derivative method gave the best models for m3.Al and sand. On the other hand, MSC preprocessed spectra gave predictions with largest RMSEP values relative to all the other methods. This means that MSC preprocessed spectra may have a larger signal to noise ratio either caused by the removal of valuable information or the method was unable to filter out all the irrelevant information. We therefore suggest that future studies need not to use MSC as the only spectral preprocessing method because it may lead to models with low predictive accuracy. The ability of the HQISS to group soil MIR spectra according to how they are similar to the four pure mineral spectra confirms MIR spectral signatures are due to vibrations of molecular groups within minerals and organic molecular groups [18]. Since the CACSS did not form well-separated clusters within the local models we suggest future research to consider modifying the method and include only the most informative regions known to contain mineral figure print. Also we recommend further testing of our proposed method to search for local subspaces in large spectral libraries. Other different model fitting methods like support vector machine neural networks and boosted regression trees may be worthy to be tested in a similar setup like for this study.

Conflict of interest
None.