SoilGrids 2.0: producing quality-assessed soil information for the globe

. SoilGrids produces maps of soil properties for the entire globe at medium spatial resolution (250 metres cell size) using state-of-the-art machine learning methods to generate the necessary models. It takes as inputs soil observations from about 240 000 locations worldwide and over 400 global environmental covariates describing vegetation, terrain morphology, climate, geology and hydrology. The aim of this work was the production of quality-assessed global maps of soil properties, with cross-validation, hyper-parameters selection and quantiﬁcation of spatially explicit uncertainty, as implemented in the


Materials and Methods
This study uses Quantile Regression Forest (Meinshausen, 2006), a method with a limited number of parameters to be tuned and that has proven an effective compromise between accuracy and feasibility for large datasets.Selected primary soil properties as defined and described in the GlobalSoilMap specifications (Arrouays et al., 2014) were modelled.The following sections describe each step of the workflow (Figure 1) in detail.These include: 1. Input soil data preparation 2. Covariates selection 3. Model tuning and cross-validation 4. Final model fitting for prediction 5. Predictions with uncertainty estimation.
As a result, not all data standardized :::::::::: standardised : in WoSIS are freely available to the international community.Hence, this study considers two 'sources' of point data.

Soil properties
For the purposes of SoilGrids, "soil" is up to 2m thick unconsolidated material at the Earth's epidermis in direct contact with the atmosphere; thus sub-aqeous and tidally-exposed soils are not considered here, neither are materials deeper than 2m.This decision has consequences for computations of total stocks, in particular "soil" organic carbon.
Table 1 describes the soil properties that are considered in this version of SoilGrids: organic carbon content, total nitrogen content, soil pH (measured in water), cation exchange capacity, soil texture fractions, and proportion of coarse fragments.
These properties were modelled for the six standard depths intervals as defined in the GlobalSoilMap specifications (Arrouays et al., 2014): 0-5, 5-15, 15-30, 30-60, 60-100 and 100-200 cm.'Litter layers' on top of minerals soils were excluded from further modelling using the following assumptions.Consistency in layer depth (e.g.: , sequential increase in the upper and lower depth reported for each layer down the profile) in WoSIS was checked using automated procedures.In accord with current internationally accepted conventions, such depth increments are given as "measured from the surface, including organic layers and mineral covers" (FAO, 2006;Schoeneberger et al., 2012).
Prior to 1993, however, the start (zero depth) of the profile was set at the top of the mineral surface (the solum proper), except when "thick" organic layers as defined for peat soils (FAO-ISRIC, 1986) were present at the surface.Then the top of the peat layer was taken as the soil surface.Organic horizons were recorded as above and mineral horizons recorded as below, relative to the mineral surface (Schoeneberger et al., 2012) (pp. 2-6).Insofar as is possible, "superficial litter" on top of mineral layers was flagged as an auxiliary (Boolean) variable, also with reference to the original soil horizon designation when provided, so they can be filtered out during auxiliary computations of soil properties.

Transformation of texture data
A transformation was applied to the texture fractions, as follows.The relative percentage of sand, silt and clay can be treated as compositional variables, as the sum of the components always equals 100%.Therefore, these components were transformed using the Addictive Log-Ratio (ALR) transformation with the Gauss-Hermite quadrature (Aitchison, 1986).ALR has previously been applied to soil texture data (Lark and Bishop, 2007;Akpa et al., 2014;Ballabio et al., 2016;Poggio and Gimona, 2017a), and it has been shown (Lark and Bishop, 2007) that ALR-transformed variables preserve information on the spatial correlation and maintain the compositional integrity of the original components.In this study, clay was used as the denominator variable.Therefore the two ALR components that were interpolated can be defined as:

Spatial stratification of observations
Random splitting of profile observations into n validation :::::::::::: cross-validation : folds is not suitable in this context, considering the high spatial variation in observation density as it would provide biased results (Brus, 2014).For regions like Europe and North America there are over 4 profiles per 10 km 2 , whereas for large countries in Asia, such as Kazakhstan, India or Mongolia the number of available profiles is still quite limited (< 1 profile per 100 km 2 ) (see Batjes et al. (2020) for further details).

Environmental covariates
Over four hundred geographic layers were available as environmental covariates for this work.These were chosen for their presumed relation to the major soil forming factors, including long-term soil conditions, i.e., the "time" factor.Appendix A provides a list of the products used as covariates and their sources.The layers considered can be grouped as follow: -Climate: temperature, precipitation, snowfall, cloud cover, solar radiation, wind speed.
-Geology: soil and sedimentary thickness, rock types.
-Land Use/Cover: from sources such as the European Space Agency (ESA) and U.S. Geological Survey (USGS).
-Elevation and terrain morphology: including numerous morphology indexes and landform classes.
-Vegetation Indexes: such as the Normalized Difference Vegetation Index (NDVI), Enhanced Vegetation Index (EVI) and Net Primary Production (NPP).
-Raw bands from Landsat and Modis products.
-Hydrography: global water table, inundation and glaciers extent, and surface water change.
All covariates were projected to a common coordinate reference system (CRS), i.e., : Goode's Homolosine projection for land masses applied to the WGS84 datum.This projection was selected since among the equal-area projections supported by open source software it is the most effective minimising distortions over land (de Sousa et al., 2019).The projected covariates were imported to GRASS GIS in a normalised raster structure with cells of 250 m by 250 m.Covariates, and hence mapped areas, were restricted to land areas without built-up, water and glacier areas using a mask created from the ESA Land Cover layer for 2015 (Buchhorn et al., 2020).Thus properties of urban and subaqeous soils are not considered.

Covariates selection
Considering the large number of available environmental layers, a standardized and reproducible procedure to select covariates used for modelling was implemented to i) reduce redundancy between covariates, ii) obtain a more parsimonious and computationally-efficient model, iii) to decrease the risk of over-fitting (Gomes et al., 2019) and iv) to avoid a biased assessment of variable importance (Strobl et al., 2008).The covariates selection procedure consisted of two steps, de-correlation and Recursive Feature Elimination.

De-correlation analysis
De-correlation analysis was carried out as initial step to reduce the redundancy of information from more than 400 environmental layers.Only covariate layers that had a pairwise correlation coefficient <= 0.85 with all other covariates were included in the subsequent analyses.For each pair of covariates correlated above this threshold, only the first one in alphabetical order was selected for inclusion in the modelling phase.This step reduced the number of initial covariates to approximately 150 layers.

Recursive Feature Elimination
Recursive feature elimination (RFE) (Guyon et al., 2002) is a methodology that has proven effective to select an optimal set of covariates for regression trees models (Gomes et al., 2019;Hounkpatin et al., 2018).In this study, the RFE procedure implemented in the caret package for the R language (Kuhn, 2015) was used, as it offers a good compromise between accuracy and computation time.The algorithm starts by fitting a model using all covariates, assessing its performance and ranking covariate importance.The least important covariates are then removed from the pool, and again the model is fitted, assessed, and the least important covariates removed.The procedure repeats down to a pool between 0 and N covariates.This procedure is based on Out-Of-the-Bag (OOB) cross-validation and does not test all covariates combinations, but it is considered one of the most robust covariates selection approaches for models like Random Forests (Nussbaum et al., 2018).
The RFE procedure on the full set of observations and covariates would prove computationally prohibitive.To improve computational feasibility for large datasets, additional steps were developed.Four sets of observations were used for RFE, each obtained using three cross-validation folds (see Section 2.1.3for further details): set1 contained folds 1 to 3, set2 folds 4 to 6, set3 folds 7 to 9 and set4 contained fold 10 and 2 other random selected folds.In a first step, the RFE procedure from caret was run independently on each set with default model hyper-parameters for :: the :::::::: Random :::::: Forests :::::::: algorithm ::: as :::::::::: implemented ::: in ::: the : ranger :::::: package : (i.e., : ntree as 500 and mtry as the rounded square root of the number variables).In each set the optimal number and combination of covariates was automatically selected when the model performances stopped increasing, i.e., : when the loss function reached its minimum.In this study, the loss function was the OOB RMSE.
In the second step, the RFE procedure was applied with all observations and all covariates selected in at least one of the four sets used in the previous step.The final covariate set was the set minimising the loss function.

Hyper-parameter selection and cross-validation
Figure 2 summarises the approach used for the selection of the model hyper-parameters and the cross-validation.Further details are provided in the following sections.
Each of the resulting combinations of ntree and mtry parameters was used to train a different model with observations from nine folds.Predictions were then assessed on the remaining fold with classical performance measures, i.e., root mean squared error (RMSE) and model efficiency coefficient (MEC; Janssen and Heuberger, 1995).MEC is equal to the fraction of the explained variance based on the 1:1 line of predicted versus observed that is defined as 1 minus the ratio between residual sum of squares and total sum of squares.The final hyper-parameter selection was based on an optimisation of model performance and computational constraints, in this case memory consumption.For example an increase of the ntree parameter above 200 provided a minor increment in the metrics (usually less than 0.1%, not reported here) while requiring considerably more memory and computation time.

Model fit
The final model for each soil property was fitted with all available observations, the covariates and the hyperparameters selected in the previous steps.Observation depth was included in the model as a covariate.It was calculated at the mid-point of the sampled layer or horizon.
Models were obtained with the ranger package (Wright and Ziegler, 2017), with the option quantreg to build Quantile Random Forests (QRF; Meinshausen, 2006).With this option the prediction is not a single value, e.g., the average of predictions from the group of decision trees in the random forest, but rather a cumulative probability distribution of the soil property at each location and depth.
In order to compute the prediction uncertainty for soil texture, the back-transformation was applied at the level of individual tree predictions, and the quantiles of the tree prediction distributions obtained from the resulting values.

Uncertainty
The percentage of validation :::::::::::: cross-validation : observations contained in the 0.9 prediction interval was calculated (Prediction Interval Coverage probability, PICP) (Shrestha and Solomatine, 2006).Ideally the PICP is close to 0.9, indicating that the uncertainty was correctly assessed.A PICP substantially greater than 0.9 suggests that the uncertainty was underestimated, a substantially smaller PICP indicates that it was overestimated.
Furthermore, to visualise the uncertainty as a map, the following indicators were calculated: 1. 90 th prediction interval (PI90) 2. ratio of the interquantile range over the median (Prediction Interval Ratio, PIR): P IR = q 0.95 − q 0.05 q 0.50 (3)

Qualitative evaluation of spatial patterns
Expert judgement was used to evaluate the reasonableness of the maps, by comparing well-known spatial patterns at global, regional, and local scales with SoilGrids predictions (see subsection 3.4).Obviously these are not definitive evaluations, only indicative.

Software and computational framework
SoilGrids requires an intensive computational workflow, with numerous steps integrating different software.SoilGrids is en- tirely based on open source software, in particular: SLURM (Yoo et al., 2003) for job management, GRASS GIS (GRASS Development Team, 2020) for data and tiles management, and R statistical software (R Core Team, 2020) for model fitting and statistical analysis.
Predictions were computed in a high-performance computing cluster.A dynamic geographic tiling system was developed with GRASS GIS to maximise the use of memory for each job.Technical details on this parallelisation scheme are given in de Sousa et al. (2020).
The predictions were multiplied by a conversion factor of 10 or 100 to maintain the required precision while using integer type in the file geotiff to reduce space occupied on disk.Application of the conversion factor resulted in mapped layers with units differing from those of the input observations (see Table 1).
The total computation time with the selected covariates and hyper-parameters differed per property.On average, the complete computation of the 24 maps (mean and three quantiles for each of six standard depths) for a single property, including: (i) RFE, (ii) model training, and (iii) prediction, took approximately 1 500 CPU-hours.The prediction accounted for about two thirds of the total time.
3 Results and discussion Figures 3 and 4 show examples of observation density of the soil calibration data for two soil properties, pH water and proportion of coarse fragments, that show a large difference in density.
In future studies, it will be relevant to identify beforehand areas of the world with a low observation density that are not yet represented by a high density of observations in other areas with similar soil-forming factors.A set of synthetic profiles could then be generated to describe these areas, by consulting soil scientists knowledgeable on the soils and soil properties of these areas.

Model tuning and hyper-parameters selection
Model hyper-parameters selected for each property are presented in Table 3.
The numbers of covariates selected using the two-step approach for covariates selection was fairly small in comparison with the full set (Table 3), resulting in more parsimonious models.Figure 6 shows two examples of the loss-function for RFE for two soil properties with different number and distributions of input observations.In both cases there is a clear improvement of performances while using 15 to 20 covariates.The curve reaches a minimum of the loss function and then stays on a plateau with a slight decline after the identified minimum.
All final models were trained with a maximum of 200 decision trees, a number beyond which performance gains did not noticeably increase.
The mtry parameter mainly depended on the number of covariates and was always between 1.5 and 2 times the square root of the number of covariates, which is the default provided by common random forest packages such as ranger (Wright and 305 Ziegler, 2017).This confirms the need to determine optimum model hyper-parameters, especially when dealing with large numbers of input data (Nussbaum et al., 2018) as is the case here.less well modelled than the other two particle-size classes.This may be an effect of the chosen ALR transformation that had    (Lark and Bishop, 2007).Metrics of the mean were always better than or equal to those for the median for all properties.
Overall, these metrics are in line with continental or large regions DSM studies (Keskin and Grunwald, 2018).However, they are slightly lower than those preseneted ::::::: presented : by Hengl et al. (2017b).The latter difference can be explained by the more prudent cross-validation approach now taken, with spatially balanced folds and all observations belonging to the same profile in the same fold.This prevents using data from the same profile both for calibration and validation :::::::: numerical :::::::: evaluation.
Table 5 shows the MEC for mean predictions by depth interval.Performances decreased with depth, in line with many other DSM studies (Keskin and Grunwald, 2018).This pattern can be explained in part by the reduced number of observations for deeper layers, but also by :::::: mainly ::: by weakened relationships between environmental layers and soil properties of the deeper horizons ::::: layers.
Table 6 summarises the PICPs, globally and by predicted depth intervals.Most of the values are between 0.88 and 0.92, indicating that the predictions intervals obtained with QRF are a realistic representation of the prediction uncertainty, as the expected value for a 90% prediction interval is 0.90.Exceptions are the models for coarse fragments with higher values around 0.95, indicating an overestimation of prediction uncertainty.The texture components have values with a larger spread, around 0.78 to 0.80 for sand and closer to 0.96 for silt and clay.These indicate a potential under-estimation of prediction intervals for sand and over-estimation for silt and clay.These results may be related with the range of these properties in the input observations.The transformation method used to derive the prediction intervals for the texture components could also be a contributing factor.Further exploration of the causes is worthwhile.
However, this is not feasible when only legacy data are available.In this case, a :::::::: so-called : "cross-validation" : approach is often used.Cross-validation :::: This needs to be tuned to avoid over-or under-estimation of the ::::::: numeric evaluation metrics, especially in case of large differences in observation density, i.e. : , clustered spatial observations.This is especially important at global scale, as the distribution of the soil observations is not uniform across the globe.It can not be guaranteed that the ::::::: numeric evaluation metrics derived from cross-validation are unbiased estimates of the true validation ::: true ::::::: numeric ::::::::: evaluation metrics, i.e., those that would have been computed on a probability sample of the whole population.It is also not possible to quantify how close the cross-validation metrics estimates are to the true evaluation metrics, as it is not possible to obtain confidence intervals (Brus et al., 2011).When using cross-validation it is important to prevent over-or under-optimistic estimates.For example, it is likely that prediction errors are smaller in areas where the sampling density is higher.Because of their high sampling density, such areas will be over-represented in the sample as the percentage of cross-validation points in clustered areas will be higher than the percentage of the total land area covered by those areas.Results of standard cross-validation will be strongly influenced by the performances in clustered areas.Using spatial cross-validation as suggested by Meyer et al. (e.g. 2018), where it is ensured that calibration data are never too close to a validation ::::::::::::: cross-validation point, on the other hand could produce over-pessimistic results.In order to address some of these concerns, this study followed ::::::: adopted a practical solution where : in :::::: which the folds were created to guarantee a spatially balanced distribution between validation folds, i.e., maintain ::::::::::::::: spatially-balanced ::::::::: distribution ::::::: between :::::::::::::: cross-validation ::::: folds, :::::::::: maintaining the same densities of the input data in each fold so that they represent approximately the same population.
Although the numerical evaluation procedure used in this work takes into account the spatial distribution of the observations and their density, further improvement is necessary :::::: possible ::: in :::: both ::::: model ::::::: training :::: and ::::::::: evaluation.For example, the weight assigned to ::::::: weights ::::::: assigned :: to ::::::::::: observations :: in heavily sampled areas could be reduced.The USA and large regions of Europe and Australia have very high numbers of observations that could be reduced to further :::::::::::: down-weighted :: to : strengthen the spatial robustness of the validation :::::::: evaluation : procedure.Declustering or debiasing techniques (Goovaerts, 1997;Deutsch and Journel, 1998) have been applied with success in other geo-statistics exercises and could be adapted to the particular case of global soil mapping.The creation of the folds could also be modified to take into account the density of the observations.

Qualitative evaluation of spatial patterns
At global scale well-known patterns are reproduced, and typical properties associated with many World Reference Base for Soil Resources (WRB) (IUSS Working Group WRB, 2015) Reference Soil Groups can be recognised.
Many regional patterns are also clear, for example the pH transition from Sahara through Sahel to the West Africa coast, and the PSC transitions from the Des Moines glacial lobe to the pro-glacial loess deposits in Iowa (USA) as well as the PSC transition from clayey marine sediments along the North/Baltic sea coasts through the sandy plains to the central German loess belt.The CEC map identifies contrasting areas of Vertisols (e.g, "black belts" in Alabama/Mississippi and Texas USA).The coarse fragments map shows the detailed pattern of the basin-and-range region of the western USA and the ridge-and-valley region of Appalachia.
However, at the local scale, a preliminary assessment of SoilGrids in the USA, comparing with a gridded version of the national detailed gSSURGO (NRCS National Soil Survey Center, 2016) soil geographic database based on detailed field survey, reveals that SoilGrids may fail to account for local parent material transitions, e.g., sedimentary facies of coastal plain marine sediments, as well as glacial features such as proglacial lacustrine sediments and relic beach lines, so that the local PSC pattern is not accurate, sometimes on the order of 20-30% of a particle-size class.
Important local differences are clear: the low sand concentrations of the clayey soils in the limestone valleys trending SW-NE and the high concentrations in soils from glacial till developed on sandstones in the north, as well as the residual soils on the resistant sandstone ridges of the Ridge and Valley province of Appalachia in the south.These do generally agree with the detailed soil survey.

Prediction uncertainty
In general, the least sampled areas present the highest prediction uncertainties as expressed by the PICP.Figures 8 and 9 show an example for two properties and depths (Maps for all properties and depths can be accessed at https://data.isric.org).
Figure 10 shows an example representing the quantiles for pH water for the 60-100 cm layer.The north of Russia and the centre and north-west of Canada are large regions for which few soil observations are available, therefore prediction distributions are wider than in more densely sampled areas.However, these patterns are different for different properties.For example, arid areas actually have the narrowest prediction ranges of pH water .The uncertainty range is often wide for properties and regions with a wider range of the property being modelled.This can be explained by the modelling approach performing more accurately within a limited range of options.These regions also have larger local spatial variation with more difficulties for predictions.
The communication of uncertainty is an open challenge (Arrouays et al., 2020).Uncertainty should provide information for policy makers and other stakeholders and not only scientists and modellers.The maps computed with eq. 3 are a first step in this direction, but their limitations must be understood.For properties that have values at or near zero, e.g., coarse fragments, they do not provide an entirely accurate uncertainty estimate.The use of uncertainty classes could be a further step to help domain stakeholders.that ::: this ::::::: product ::: has ::::: some ::::::::: limitations, :::::: which ::: will ::: be ::::::::: considered :: in :::::: further ::::: work.

Figure 1 .
Figure 1.Workflow of the methodological approach.

Figure 2 .
Figure 2. Detailed workflow for the Hyper-parameter selection and cross-validation.

Figure 4 .
Figure 4. Number of of observations per grid cell ( 70 000 km 2 ) for coarse fragments.

Figure 8 .
Figure 8. Mean soil organic carbon content (dg/kg) prediction and range between 5% and 95% quantiles in the 5 cm to 15 cm depth interval.

Figure 9 .
Figure 9. Median total Nitrogen prediction (cg/kg) and associated uncertainty for the 15 cm to 30 cm depth interval.

Table 1 .
Soil properties description and units.
Table 2 breaks down the distribution of the legacy soil observations for each soil property by depth interval.Table B1, in Appendix B, shows the number of observations by bio-climatic region.