Digital mapping of soil attributes using machine learning

- Mapping the chemical attributes of the soil on a large scale can result in gains when planning the use and occupation of the land. There are different techniques available for this purpose, whose performance should be tested for different types of landscapes. The aim of this study was to spatialize chemical attributes of the soil, comparing eight methods of prediction. Forty morphometric attributes, generated from a digital elevation model, were used as independent variables, in addition to geophysical data, images from the Landsat 8 satellite and the NDVI. All possible combinations between the satellite bands were calculated, generating 28 new variables. Combinations between the Th, U and K bands obtained from the geophysical data were also calculated, generating a further three variables. The final variables to be calculated were the distances between the four points of the edges of the basin (d1, d2, d3 and d4). The dependent variables for the model were Al, Ca, Fe, K, Mg, Na, Si, Ti, Cr, Cu, Mn, Ni, P, Pb, V, Zn, Zr, S and Cl. A total of 200 soil samples were used, which were collected from 100 points at two depths (0-10 and 10-30 cm); the total elements were determined using an X-ray fluorescence analyzer. The Random Forest algorithm proved to be superior to the others in predicting the chemical attributes of the soil at both depths, and is suitable for predicting soil attributes in the study region. Spatial variables are essential, and should be considered when modelling chemical elements in the soil. Using the methods under test, it is possible to predict elements with R² values ranging from 0.32 to 0.62.


INTRODUCTION
New computational techniques have been presented as an alternative tool for mapping soil classes and attributes, providing greater speed, repeatability and error recognition, which are seen as the greatest advantages over conventional methods . Due to the increasing need for more-detailed information on soils for various applications, digital methods can contribute significantly when planning land use and occupation.
Digital mapping is based on the SCORPAN model, in which a given soil class is a function of five factors of the ClORPT model (climate -Cl, organisms -O, relief -R, parent material -P and time -T), added to the factors of soil (s), and spatial or geographic position (n) (McBRATNEY et al., 2003). This new model not only allows the classes of soil to be mapped, but also the soil attributes.
Since the end of the 1960s, there has been an emphasis on what might be termed "geographical" or "purely spatial" approaches, i.e. soil attributes would be predicted from spatial position, largely by interpolation between the observation sites (McBRATNEY et al., 2003).
As such, some studies began to test the use of spatiality to predict the attributes, not only of soils, but also other elements of the physical environment (DAVIES; GAMM, 1969;KISS et al., 1988), with the appearance of such interpolation methods as kriging, among others. However, the use of spatial tools alone has led to the development of models where the trend surfaces are fairly simplified and "artificial" representations, and where the more complex spatial patterns generally need to be modeled (McBRATNEY et al., 2003). This led to the adoption of hybrid models that combine different methods, such as the use of classifying algorithms like Random Forest and kriging. These are combined with the aim of improving both techniques, always including the spatial component, which is a potentially valuable and economical source of environmental information, and should never be disregarded (ARRUDA; DEMATTE; CHAGAS, 2013;McBRATNEY et al., 2003).
Various factors derived from the digital elevation model (DEM) have been tested as predictor variables OLIVEIRA et al., 2012;PINHEIRO et al., 2012;YANG et al., 2016) in different methods for generating digital maps. Several techniques have been evaluated for mapping attributes, among them neural networks, decision trees and multiple linear regression. However, no studies were found in the literature that mapped a high number of chemical elements in the soil, using, in addition to environmental covariates, spatial variables as predictors.
Under the hypothesis that there may be algorithms which are more efficient in predicting a large number of soil attributes with good R² values, where spatial variables may prove to be essential to the model, this work aimed to map the chemical attributes of the soil (the levels of Al, Ca, Fe, K, Mg, Na, Si, Ti, Cr, Cu, Mn, Ni, P, Pb, V, Zn, Zr, S, Cl) in a drainage basin located in the district of Iconha, Espírito Santo (ES), at two depths, using environmental and spatial covariates as predictors and comparing predictions of the Random Forest, Ridge Regression, Cubist and Partial Least Squares methods; principal component regression (PCR); Adaptive Forward-Backward Greedy Algorithm (FOBA); Generalized Boosted Regression Models (GBM) and Gradient Boosting with Componentwise Linear Models (GLMBOOST).

MATERIAL AND METHODS
The work was carried out in the drainage basin of the Ribeirão Inhaúma; an area of 2,403.9 ha, centered on 21°10'58.82" S and 41°00'08.87" W, and located in the district of Iconha, in the south of the state of Espírito Santo.
In order to predict the chemical attributes of the soil, it was necessary to determine the independent variables that would be tested. The digital elevation model (DEM), obtained with data from the ALOS Satellite (Advanced Land Observing Satellite) at a spatial resolution of 12.5 m, was used to generate 37 morphometric covariates (Table  1), using a script developed in the R software (RSaga) for applying the terrain-analysis tools included in the free SAGA software (BÖHNER; MCCLOY;STROBL, 2006) together with the land use and occupation map.
The Landsat 8 (LS8) satellite bands were also used as independent variables, the NDVI (Normalized Difference Vegetation Index) was calculated using the LS8 bands in the equation proposed by Rouse et al. (1973), where it is expressed as the ratio between the difference in reflectance measured in the near red (ρ -IV) and infrared (ρ -V) channels, and the sum of these channels, so: (1) Aerogeophysical data, such as gamma spectrometry and magnetometry, obtained from the Mineral Resources Research Company (Companhia de Pesquisa de Recursos Minerais -CPRM), were also used ( Table 1).
All possible combinations between the bands were calculated, generating 28 new spectral covariates. These were obtained, by means of the relation: Digital mapping of soil attributes using learning machine (2) where: B xy is the result of the band x to band y ratio, and b x and b y represent the Landsat 8 satellite bands.
For the variables derived from the gamma spectrometry, the same ratio was used as in equation 2, where combinations were made between the estimated values of thorium (Th), uranium (U) and potassium (K), generating three new variables. The distances between the four points at the edges of the basin (d1, d2, d3 and d4) were calculated as per the relation: (3) Uranium where: d AB is the distance between two points A and B, x 2 and y 2 are the coordinates of point A and x 1 and y 1 are the coordinates of point B.
The x and y coordinates were also considered as variables, i.e. latitude and longitude. All the maps were standardized with the same size of cells, lines and columns, and a resolution of 30 m. The maps from the MDE were re-sampled using the ArcGis 10.1 software.
The levels of Al, Ca, Fe Al, Ca, Fe, K, Mg, Na, Si, Ti, Cr, Cu, Mn, Ni, P, Pb, V, Zn, Zr and S were used as the dependent variables for the model. In order to obtain values for the attributes, 200 soil samples were collected in the field at a depth of 0-10 cm and 10-30 cm. The conditioned Latin hypercube method was used to set up the sampling grid, due to difficulty in accessing the area caused by the extremely mountainous relief. The coordinates of each sampling point were recorded with the Leica GS08 Plus GNSS receiver. The data were processed using the Leica Geoffice 8.0 software, employing the fixed station of the Brazilian Institute of Geography and Statistics (IBGE) in Vitória, ES, for transportation from the base to the area of the basin.
The soil samples were air dried, the clumps were removed, and the soil sieved through a 2-mm aperture mesh. In the laboratory, the samples were macerated in an agate mortar, sieved through a 1-mm aperture mesh, and then placed in a standard mold and manually compressed to form tablets, 15 mm in diameter and 2 mm thick, that were used to take readings in the X-ray fluorescence analyzer of the Soil Laboratory (ALVES et al., 2015) of the Federal University of Viçosa -UFV, to obtain the total content of the elements in each sample. For this, the Shimadzu Micro-EDX-1300 analyzer was used.
Results that presented outliers were identified and their data replaced with values estimated by means of regression imputation, where the new value is calculated through regression of the other values of the variable, thereby avoiding a reduction in the number of analytical data.
The response variables at both sampling depths were tested for normality by means of the Kolmogorov-Sminorv test (K-S) (p<0.05).
Attribute maps were generated using the R v3 software. To select the most important independent variables, those with a correlation index greater than 0.99 were eliminated. Subsequently, each model selected the most significant variables to predict each of the attributes, these being used to generate the final maps. Simplified models were sought, i.e. models that used the smallest possible number of variables with satisfactory R² values, where a loss of up to 5% in the R² values was allowed in favor of a more economical model.
Point values for each of the independent variables were extracted using the ArcGis software, employing the 'extract value to points' command. In order to have greater area representativity, the mean of the central point with its neighbors was considered as a typical value for the sampled site.
The performance of the models was evaluated using the cross-validation procedure with 10 repetitions, in which comparisons were made between the observed and predicted values of each dependent variable. These values were expressed by the coefficient of determination R². The index was calculated according to the equation: (4) where: Pi e Oi are the predicted and observed values at location i respectively, and n is the number of samples.
The RMSE value (root mean square error) was also calculated.
( 5) where: RMSE is the root mean square error, and Ɩ is the number of points intended for validation.

RESULTS AND DISCUSSION
In relation to the predictions of the models under test, only those elements whose R² value was greater than 0.30 for at least one of the methods are presented. As a result, the applied methods were not efficient in predicting Ca, Mg, Na, Si, Cr, Cu, Ni and Zr at the depth of 0-10 cm, or Mg, Na, Si, Cu, Mn, Ni and Zn at 10-30 cm. Table 2 shows the descriptive statistics for the modelled elements at both depths, i.e. Al, Fe, K, Mn, P, Pb, S, Ti, V and Zn at 0-10 cm, and Al, Ca, Cr, Fe, K , P, Pb, S, Ti, V and Zr at 10-30 cm.
With the set of data from the first depth, a negative asymmetric distribution was seen for Al, K and Zn, and a positive asymmetric distribution for the remainder, while at a depth of 10-30 cm, negative asymmetric distribution occurred only for Al and Fe. With positive asymmetry, mean values are generally higher than the median, indicating a high frequency of values below the mean (LIMA et al., 2010). Such behavior was seen for P, S and Zn at 0-10 cm and for all the elements at 10-30 cm, except Fe.
Negative asymmetry indicates concentration of the data (tail elongation) to the left of the mean, and positive asymmetry shows the data concentrated to the right of the mean. Values closer to zero indicate greater symmetry and, therefore, normal data distribution (GROENEVELD; MEEDEN, 1984).
Kurtosis indicates the degree of data flattening, where the distribution is classified as leptokurtic (when the kurtosis value is <0.263), mesocurtic (=0.263) or platicurtic (>0.263). At both depths, the data have a leptokurtic distribution, characterized by a more funneled curve, with a higher peak than normal (mesocurtic), except for P and S (0-10 cm) and Ca (10-30 cm), which have a platicurtic distribution, this curve being flatter than either the mesocurtic or leptokurtic.
Digital mapping of soil attributes using learning machine At the depth of 0-10 cm, the R² values (Table 3), calculated based on the regression between the observed values and those predicted by the models, show that RF was superior for three of the variables (Fe, Mn and V), with RIDGE superior for two (S and Zn) and the remainder (PLS) for one. PLS and PCR were equally efficient for predicting P (R² = 0.42), while RIDGE, cubist, PLS, PCR and FOBA for Ti (R² = 0.50). R² values at this layer ranged from 0.23 to 0.54.
Regarding the lowest R² values, FOBA stood out for not presenting any of the lower correlation coefficients. For Al, Mn, P and Pb, more than one method presented the lower values. In general, PLS, PCR, GBM and GLMBOOST presented lower predicted R² values.
In relation to the elements, the highest R 2 values were found for Ti (0.50) and V (0.54). Ti minerals are very weather resistant; apparently under reducing conditions Fe 2+ ions are adsorbed on the surfaces of Ti minerals, Ti being able enter the structure of some silicates and probably adsorbed on the surface of Fe-Mn concretions. As it is one of the most stable elements, Ti is present in small amounts in the soil solution, about 0.03 mg/L, (KABATA-PENDIAS, 2010).
V is distributed fairly uniformly in soil profiles, and variations in soil content are inherited from the parent materials (KABATA-PENDIAS, 2010). As such, the highest concentrations of V (up to 500 mg/kg) are reported for Cambisols (KABATA-PENDIAS, 2010), levels which probably facilitate their mapping, since this type of soil is widely distributed over the study area. The V in soils seems to be mainly associated with hydrated Fe oxides and with organic matter. In some soils, clay minerals can also control the mobility of this element (KABATA-PENDIAS, 2010).
As expected, the RMSE values (Table 4) tended to be smaller the higher the R² value. In predicting P, where R² was the same for both PLS and PCR, it can be inferred that PCR was superior due to the lower RMSE value. However, the same was not possible for S or Ti, whose RMSE values remained the same for the best predictors.
In the 10-30 cm layer, there was greater disparity as to the best predictor in relation to the R² values (  Table 3 -R 2 values generated by the three methods for predicting chemical attributes of the soil at a depth of 0-10 cm can generate differences in the prediction ability of each model. RF was highlighted in six variables (Al, Ca, Fe, K, P and Pb) and was equal to FOBA for the variable Zr, with R² = 0.34. FOBA was superior for Cr, GBM for Ti and V, and PLS and PCR obtained the same R² value for S (0.37). The R 2 values ranged from 0.24 to 0.62.
As in the 0-10 cm layer, V and Ti were the elements with the best prediction, presenting an R² of 0.62 and 0.56 respectively. RIDGE, cubist and GLMBOOST proved not to be efficient predictors in the 0-30 cm layer, obtaining none of the higher R² values, and having some of the lowest values found for the correlation coefficient. FOBA, Table 4 -RMSE values found for Al, Fe, Mn, P, Pb, S, Ti, V and Zn in the 0-10 cm layer, for the different methods under test *Al, Fe, K, S, Ti in dag/kg, and Mn, P, Pb, V Zn in ppm. Random Forest-RF, Ridge Regression-RIDGE, Partial Least Squares-PLS, Principal Component Regression-PCR, Adaptive Forward-Backward Greedy Algorithm-FOBA, Generalized Boosted Regression Models-GBM and Gradient Boosting with Component-wise Linear Models -GLMBOOST Digital mapping of soil attributes using learning machine just as in the 0-10 cm layer, obtained none of the lower R² values, the same as RF in this layer.
RMSE behavior in the 10-30 cm layer was similar to the first layer, i.e. the lowest values were generally found for the highest estimates of R² (Table 6). The only exception was for P, where the lowest RMSE value was obtained with RIDGE. With lower RMSE values, it can be inferred that RF was superior to FOBA in predicting Zr, whose R² had been similar. S presented its highest R² values with the PLS and PCR methods, however with similar RMSE values in all methods.
For the principal variables used in the prediction models, in the 0-10 cm layer half of the mapped elements (Al, Fe, K, Pb and S) considered the geophysical data to be important variables. The following should be mentioned: the ratio of uranium to potassium (Al, Pb), the ratio of thorium to potassium (Al, S), uranium (pb), potassium (Al, S), thorium (K) and the analytic signal of the total magnetic field (Al, Fe, Pb).
For the 10-30 cm layer, with the exception of Fe, all the elements considered the geophysical data to be important. Including the ratio of uranium to potassium (Al, Ca), the ratio of thorium to potassium (Ca, S), the ratio of uranium to thorium (K), uranium (Zr), potassium (Al, Cr, S), thorium (K, Ti, V), the total magnetic field (Cr), the analytic signal of the total magnetic field (Cr), and the total magnetic field exposure rate (P).
In relation to the variables created by the distance relationships, bands and geophysical data, these were broadly selected by all the elements as important covariates in prediction. The ratios between the bands were widely used (K, Mn, V and Zn at 0-10 cm, and Al, Cr, K, Ti and V at 10-30 cm), as well as the ratios between K, U and Th (Al , K, Pb and S at 0-10 cm, and Al, K and S at 10-30 cm).
Except for Al and K in the 0-10 cm layer, and for K and Pb in the 10-30 cm layer, the distance relationships (x, y, d1, d2, d3 and d4) were selected as the principal relationships, and for Ti and P, were the only relationships selected in the first layer.
These purely spatial approaches are almost entirely based on geostatistical methods, such as kriging and co-kriging, with one of their problems being the artificial boundaries they establish on the map (MCBRATNEY et al., 2003). Such artificial boundaries were seen in this work, especially for those elements whose spatial variables were the most relevant, appearing either exclusively or in the top positions, such as Ti and P in the 0-10 cm layer and S in the 10-30 cm layer (Figure 1 and 2).
When other non-spatial predictor variables are inserted, these limits are softened, as in the case of Fe and V at 0-10 cm ( Figure 1) and Ti at 10-30 cm ( Figure  2). The satisfactory R² values, ranging from 0.38 to 0.54 (0-10 cm) and from 0.32 to 0.62 (10-30 cm), show the relevance of considering these relationships in predictions and associating them with the terrain attributes.
In this study, the terrain attributes considered as most relevant predictor variables were: Terrain Surface Convexity (Al at 0-10 cm); Valley Depth (Al, Mn and Zn at 0-10 cm, and Ca and P at 10-30 cm); Real Surface Area (Al at 0-10 cm, and Ca, Fe and Pb at 10-30 cm); Terrain Surface Texture (K, Mn and V at 0-10 cm); Diurnal Anisotropic (K and S at 0-10 cm); Convergence Index (K at 0-10 cm, and Fe, Cr and K at 10-30 cm); Standardized