Countrywide mapping and assessment of organic carbon saturation in the topsoil using machine learning-based pedotransfer function with uncertainty propagation

Stakeholders and policymakers have been becoming more and more interested not just in the potential organic carbon (SOC) saturation level of soils but also in spatially explicit information on the degree of SOC deficit, which can support future policy and sustainable management strategies


Introduction
Soil organic carbon (SOC) is widely known as a key property regarding fertility, soil health, and various ecosystem services.The SOC content may rapidly change in time and space, thus it varies in a wide range in soils due to environmental and anthropogenic conditions.However, the degree of possibly stabilized SOC is limited, which is often called saturation.There is an increasing demand on the determination, assessment and mapping of the potential saturation level of various soils as enrichment in SOC improves soil quality, provides an effective and economical solution for greenhouse gas control of the atmosphere (Lal, 2004a(Lal, , 2004b) ) and contributes to achieving land degradation neutrality (Keesstra et al., 2018;Stavi and Lal, 2015).Beyond the complex molecular interactions and processes between the mineral phase and organic matter content of the soil, which are mostly unknown (Inagaki et al., 2020), the term SOC saturation is suitable and applicable also for policymakers and stakeholders (O'Rourke et al., 2015).It, therefore, provides useful indicator for sustainable management.Accordingly, an adequate and proper map based database of potential SOC saturation is a necessity to take the four per mille (Minasny et al., 2017) objectives into practice.However, the theoretical saturation value of a certain soil is still not easy to determine, no universal prediction method is accepted (Sanderman et al., 2010).The difficulty is that many synergic and independent properties -such as climatic, geological, topographic, and land use conditions of the field and the mineralogical and grain size distribution of the soil -affect SOC storage and saturation (Briedis et al., 2018;Cotrufo et al., 2019;Fekete et al., 2021).
The actual SOC content reflects the dynamic equilibrium between the organic matter input and mineralization by the microbiome, highlighting the role of environmental properties rather than organic matter composition (Schmidt et al., 2011).If the organic matter production differs by orders from mineralization, the SOC content changes rapidly.The inhibited mineralization by water saturation, for example, causes organic matter accumulation such as histic horizons even under deficient biomass production (Aide et al., 2020).As under the above circumstances, SOC cannot be saturated, thus saturation concept should be limited only to mineral soils.Nevertheless, many mineral soils are also oversaturated based on different saturation models (Wiesmeier et al., 2014).
The stabilization of organic matter is thus related to its protection against microbial availability.This protection is reported as binding directly to the mineral surface (Chenu et al., 2019;Kögel-knabner et al., 2008) or by the shelter effect of soil aggregates (Six et al., 2002).In early studies, the ratio of the fine fraction (particles < 20 or 50 µm) was considered as the primary driving force of potential SOC saturation (Hassink, 1997).Besides, measurements prove that there are differences among land uses and soil depth in the fine fraction related SOC content (Chen et al., 2019(Chen et al., , 2018)).On average, only half of the SOC was associated with the fine fraction (Wiesmeier et al., 2014), thus Barré et al. (2017) stated that the SOC holding capacity of the fine fraction is irrelevant for the SOC saturation value of the bulk soil.The increasing number of investigations provided a wider dataset for identifying various indicators.In many studies, the mineral composition of the clay fraction was found to be the most important property determining the SOC concentration and, therefore, the saturation level.Especially the role of the sesquioxides or the extractable amount of aluminum was determined important (Rasmussen et al., 2018).In other studies, the specific surface area, the amount of exchangeable cations or the base saturation determined the SOC capacity mostly (Juhos et al., 2021).
Nonetheless, these indicators do not take particulate organic matter (POM) into account.According to the traditional approach, POM is of raw plant residues without relevant mineral protection, thus does not contribute to the saturation.In sandy soils, however, POM is the main fraction of soil organic matter (SOM) (Hanegraaf et al., 2009) and may also persist in the long run (Witzgall et al., 2021).Moreover, the SOC stock increase due to forestation or tillage shift is mainly triggered by the increasing POM content (Cardinael et al., 2015;Chimento et al., 2016).Thus the theoretical concept of saturation can only hardly be interpreted directly into practice.However, some recent papers still keep the traditional approach considering only the mineral phase associated OM as sequestered and the indicator of saturation degree (Chen et al., 2018).For instance, Chen et al. (2019) estimated SOC storage potential for the bulk soil and limited the term "SOC saturation" only to the mineral phase associated OM.In the present study, SOC saturation is meant as the highest available SOC content of the bulk soil independently of pools (Barré et al., 2017;Chenu et al., 2019;Six et al., 2002).
Nowadays, Barré et al. (2017) recommended two approaches for SOC saturation estimation: (i) predicting storage kinetics and estimate the final capacity value for a given land use scenario, and (ii) reference establishment predicting the highest SOC concentration.Both approaches are applicable only by using empirical observations of SOC content and storage and model-driven approaches.Thus, the application of the highest measured SOC content as the possible saturation limit is still an axiom.Barré et al. (2017) suggested the highest 10% of SOC contents for a given agroclimatic region under forest to be considered as the potential saturation value.However, the pedological and topographical heterogeneity for such an agroclimatic zone triggers low spatial representability of the highest 10% (Suleymanov et al., 2021).On the other hand, the created SOC saturation models are suitable only for those conditions which were considered during the model construction, thus the results cannot be extrapolated.Feng et al. (2013) suspected the linear regression SOC saturation models for large datasets being dubious due to the presence of unsaturated points and resulting in oversaturated predictions.Thus, both the measured data and modeldriven approaches should be considered parallel to achieve the best SOC saturation prediction in a countrywide scale investigation.
Since the late 1980s, pedotransfer functions (PTFs) have been developed to estimate soil properties, functions and services from other more easily and/or cheaply measurable soil properties to satisfy specific or complex demands on information relating to soils (Bouma, 1989;Van Looy et al., 2017).In the last few years, a number of novel approaches, techniques and thoughts have been adapted, for example, the application of machine learning (ML) techniques in developing PTFs (Makó et al., 2017;Shiri et al., 2017;Xiangsheng et al., 2016).Data-driven predictive models given by ML algorithms has proved to be highly efficient, however, we should note that resulting data-driven models are hardly interpretable (i.e., not really transparent for users).A further novelty is to use not just soil properties as predictors in developing PTFs but also further environmental covariates (Gupta et al., 2021;Szabó et al., 2019) characterizing the soil forming factors.Besides, a number of papers also pointed out that if there is spatially exhaustive information on the predictors available, then PTFs can also be used for predictive mapping (Gupta et al., 2021;Tóth et al., 2017;Zhang et al., 2018).However, no map is error free (Heuvelink, 2018) and thus the quantification and assessment of spatial uncertainty has become an essential part of every predictive mapping procedure in soil science (Heuvelink and Webster, 2022;Szatmári and Pásztor, 2019).This is of great importance, as it can help end-users to make appropriate and responsible decisions based on the resulting maps to achive the desired goals.
The objective of our study is to develop a machine learning-based PTF for predicting the saturated SOC content of the topsoils (0-30 cm) in Hungary with a resolution of 100 m, and then quantify and assess its uncertainty using error propagation analysis.The resulting map is then compared with the actual SOC content map of Hungary (Szatmári et al., 2019b;Szatmári and Pásztor, 2019) in order to assess the degree of SOC deficit in the country.

Concept for determining and mapping SOC saturation
The approach we used in this study is based on the principle that topsoils under permanent forest land cover due to the unlimited plant residuum input can be practically considered as saturated in SOC.About 30% of the territory of Hungary is covered by forest on a wide range of soil types, including Luvisols, Podzols, Arenosols, Cambisols, and even Chernozems, among others.Theoretically, some permanent grassland sites may be saturated as well, their involvement, however, is neglected owing to the less reliable land-use history or the role of occasional hydric conditions that might happen on these lowland sites or the effect of grazing.
The quasi SOC saturated forest topsoils have high SOC concentration variance, and thus they are not suitable for a general reference group.Besides, various landscape positions or changes in topographical properties must also affect the way and threshold of SOC saturation for the local scale.Accordingly, only topsoils with similar physicochemical properties (e.g., texture, pH) and environmental conditions (e.g., topography, climate) are comparable directly in terms of SOC saturation.Our concept, therefore, focuses on the measured soil properties and environmental conditions and aims to find relationship between them (i.e., fit a PTF), which can be used to predict saturated SOC content at unvisited sites.Using the developed PTF, the theoretical (potential) SOC saturation value will be predicted across Hungary.Comparing the saturated SOC content with the actual SOC content, the saturation deficit can be determined and assessed.

Reference soil profiles and harmonization
In this study, we used the soil profiles of the Hungarian Soil Information and Monitoring System (SIMS) (Fig. 1), which is a countrywide soil monitoring system providing geographically referenced biological, physical and chemical information on the status and temporal change of Hungarian soils.It consists of 4,859 soil horizons belonging to 1,236 soil profiles.The following three types of soil profiles are distinguished in SIMS (Fig. 1): (i) "I" points (n = 865), which have been placed on agricultural lands (i.e., arable lands, pastures and orchards), (ii) "E" points (n = 183), which have been placed in forests, (iii) "S" points (n = 188), which have been placed in so-called "hot spot" regions representing different types of environmental hazards (e.g., mining sites, landfills or industrial areas) Soil profiles placed in forests (i.e., "E" points) have been specifically designed for providing spatio-temporal information on soils in the context of forest ecosystems and therefore they are appropriate for the objective of this study.This means 183 soil profiles (Fig. 1) with 618 soil horizons in total.According to the concept on determining SOC saturation presented in Section 2, we could assume that the topsoils represented by these profiles are practically saturated in SOC and therefore they can be used as reference soil profiles for developing an empirical, data-driven predictive model between SOC saturation, and further soil properties and environmental covariates.
In this study, we used the following chemical and physical soil data associated with the soil horizons of "E" soil profiles: soil organic carbon (SOC) content [unit: %], pH [unit: -], calcium carbonate content [unit: %], as well as sand (>50 µm), silt (50-2 µm), and clay (<2 µm) content [unit: %].In the following, we will consider data on SOC in these soil horizons as saturated.The role of the other soil properties listed previously in SOC saturation is frequently discussed in the literature.For instance, a number of papers demonstrated the specific role of the fraction of fine particles (i.e., silt and clay content) in SOC saturation (Chen et al., 2018;Hassink, 1997).Others highlighted the role of specific surface area of the soil (Beare et al., 2014).The presence of cations such as calcium or sodium also has an effect on SOC saturation by creating complexes with SOM molecules that make soil organic matter stable and more resistant to mineralization (Juhos et al., 2021) or more mobile for leaching or decomposition, respectively (Wong et al., 2010).Furthermore, pH provides indispensable information on the acidity/ alkalinity of the soil, which basically determines the quality and the degree of polymerization of SOM molecules in soil (Rasmussen et al., 2018).
Since the depth and thickness of soil horizons varies from soil profile to soil profile, there was a need to harmonize all soil data for the topsoil (0-30 cm).We used the mass-preserving spline technique (Bishop et al., 1999;Malone et al., 2011) for modelling the vertical distribution of each of the soil properties listed above at each reference soil profile.The fitted splines were used to derive the values of soil properties for the topsoil at each reference soil profiles.The splined, so-called harmonized values were used in further modelling.

Spatially exhaustive information on soil and environmental covariates
In Table 1, we summarized the spatially exhaustive information on soil and environmental covariates used in this study.Under the aegis of DOSoReMI.hu(https://www.dosoremi.hu),a number of soil related properties, functions and services at various depths (e.g., soil depths specified by the GlobalSoilMap initiative (Arrouays et al., 2014)) have been mapped for Hungary using advanced digital soil mapping techniques (Pásztor et al., 2020(Pásztor et al., , 2018(Pásztor et al., , 2017(Pásztor et al., , 2015)).In this study, we used the maps of soil properties listed in Section 3.1 as spatially exhaustive G. Szatmári et al. information on soil.We should note that all these maps have been directly compiled for the topsoil (0-30 cm) with a spatial resolution of 100 m and quantified prediction uncertainty.
In addition to soil information, we also used spatially exhaustive environmental covariates, which may affect SOC saturation.Topography and its attributes were characterized by a countrywide digital elevation model (DEM) (source: Bashfield and Keim, 2011) and a number of geomorphometric parameters derived from that DEM, respectively.The derived geomorphometric parameters are also summarized in Table 1.Their role in SOC saturation is also decisive.For example, LS factor integrates the length and steepness of a slope and therefore provides quantified information on soil erosion susceptibility.Topographic wetness index is in relation with soil moisture associated with topographic position and therefore affects microbiological activity and SOM decomposition.Vertical and horizontal distance to channel network also provide information on the potential water-affectedness of soils.
We characterized climatic conditions by the data layers of long-term mean annual precipitation, temperature, evapotranspiration and evaporation provided by the Hungarian Meteorological Service (Szentimrey and Bihari, 2007).The role of climate in point of SOC saturation is also important because climatic conditions quantitatively determine mineralization of organic carbon in soils.Due to the various data sources, we resampled each of the environmental covariates into a common geographic reference system with a resolution of 100 m.

Developing pedotransfer function
A regression matrix was created, which contained the location of the reference soil profiles in its rows, as well as the harmonized values of soil properties and the extracted values of environmental covariates associated with these soil profiles in its columns.This regression matrix and the vector of the harmonized values of saturated SOC content were the basis for developing a machine learning-based PTF between saturated SOC content (as response variable), as well as the other soil properties and environmental covariates (as predictors).
Our aim was to apply a machine learning (ML) technique that not just performs outstandingly but also provides fairly easily interpretable predictive model.The cubist algorithm suits for these prerequisites since it is a frequently applied ML algorithm with high performance (Adhikari et al., 2014(Adhikari et al., , 2013;;Minasny and McBratney, 2008;Viscarra Rossel et al., 2015) and provides transparent model structure (Malone et al., 2017;Miller et al., 2015;Viscarra Rossel and Webster, 2012).In brief, cubist, which is an amalgamation of several methodologies developed mostly in the 1990s by Quinlan (1993Quinlan ( , 1992Quinlan ( , 1987)), is a machine learning technique for generating rule-based predictive models (Kuhn and Johnson, 2013).Using the vector of the response variable's data and the regression matrix of the predictors, cubist partitions the response data into subsets within which their characteristics are similar with respect to the predictors used.A series of rules defines these subsets and these rules are arranged in a hierarchy.Each rule has a specific multivariate linear regression model and takes the form of "IF-THEN-ELSE", where.
(i) "IF" defines the conditions for the subset, (ii) "THEN" contains a specific multivariate linear regression model fitted to the subset, (iii) "ELSE" stands for the next rule in the hierarchy.
When prediction of the response variable is targeted, it is checked whether the new values of predictors meet the rule's conditions.If they meet the conditions, then the rule's specific multivariate linear regression model is appropriate for predicting the value of the response variable.If they do not meet the conditions, then the next rule in the hierarchy is visited and its conditions are used for checking the new values.It is repeated till the new values meet one of the rules' conditions in the hierarchy and thus its specific multivariate linear regression model is appropriate for predicting the value of the response variable.

Quantification and propagation of uncertainty
In Hungary, we have made great efforts in recent years to provide prediction uncertainty to each digital soil mapping product in order to meet the end-users' demand for reliable spatial or spatio-temporal soil information (Pásztor et al., 2020;Szatmári et al., 2021Szatmári et al., , 2020Szatmári et al., , 2019b;;Szatmári and Pásztor, 2019).Therefore, the spatial uncertainty of the digital soil property maps presented in Table 1 had been quantified and validated earlier (Pásztor et al., 2020) using geostatistical and/or machine learning approaches (Szatmári and Pásztor, 2019).Additionally, the covariance between the interpolation errors of these maps had also been quantified using multivariate geostatistics, as soil properties often show spatial interdependency with each other (Webster and Oliver, 2007), which affects the reliability of uncertainty quantifications especially when uncertainty propagation is also targeted (Heuvelink, 1998).
Here we should note that we had no quantified information on the uncertainty of the environmental covariates used in this study.Hence, we were not able to directly account for their uncertainty.
We used a first-order Taylor expansion to propagate uncertainty through the fitted cubist-based PTF, as several studies demonstrated its applicability with PTFs (e.g., Román Dobarco et al., 2019bDobarco et al., , 2019a;;Styc et al., 2021;Styc and Lagacherie, 2021).Uncertainty propagation analysis can be formulated mathematically as follows (Heuvelink, 1998): *Actual SOC content was only used for assessing SOC deficit.

Y = g(Z)
where g is the fitted cubist-based PTF, Y is the saturated soil organic carbon content as the output of the PTF and Z = [Z 1 , Z 2 , ⋯, Z n ] T is the vector of the PTF's input variables.The objective is to determine the uncertainty in Y, given the PTF g and the inputs Z and their associated uncertainty (Heuvelink, 2018).The uncertainty in Y is identified by the variance of Y, which is given by (Heuvelink, 1998): Eq. 2.
where n is the number of the PTF's input variables, ρ ij is the correlation coefficient between the uncertainties in z i and z j , σ i and σ j are the uncertainties in z i and z j , respectively, δg δz i (μ) and δg δz j (μ) are the first derivatives of g with respect to z i and z j , respectively, and μ = [μ 1 , μ 2 , ⋯, μ n ] T is the vector of the mean values of the n input variables.
We should note that if there is no correlation between the input variables (i.e., ρ ij = 0), then σ 2 Y is simply a summation of parts, each attributed to one of the inputs.This partitioning property allows to analyse how much each input contributes to the final uncertainty (Heuvelink, 2018(Heuvelink, , 1998)).This is of great importance when deciding which of the inputs' uncertainty should be diminished to reduce the final uncertainty.

Cubist-based PTF
Table 2 summarizes the descriptive statistics of the harmonized data on saturated SOC content and further soil properties used as predictors in this study.The development of the cubist-based PTF was complemented with 5-times repeated 10-fold cross-validation.According to the results of cross-validation, an R-squared value of 0.56 was obtained, i.e., the developed PTF explained almost 60% of the total variation of the saturated SOC data.Table 3 presents the model structure of the PTF, which not just summarizes the conditions and the specific multivariate linear regression models fitted to the different subsets of saturated SOC content data partitioned in model fitting but also shows the hierarchy via the number of rules (Table 3, first column).More detailed information on the fitted PTF function can be found in the Supplementary Material.Fig. 2 presents the spatially exhaustive soil and environmental covariates, which were found to be informative at the significance level of 0.05 based on the fitted cubist-based PTF (Table 3).
Fig. 3 (left map) presents spatial information on which of the rules and the associated multivariate linear regression model is appropriate for predicting the saturated SOC content on a particular area in Hungary.We should note that the rules identified notable geographical regions: (i) Rule 1 covers the sandy areas of Hungary with low SOC content, (ii) Rule 2 refers to the hilly and mountainous regions of Hungary with acidic soils, (iii) Rule 3 relates to non-sandy lowlands with neutral or alkali soils, (iv) Rule 4 is for the most elevated areas of Hungary with mostly acidic soils.
Increasing rule numbers predict an increasing SOC saturation value along the rules (Table 3).Rule 1 covers the sandy areas of Hungary (Fig. 3), which have the lowest SOC storage capacity due to the lack of the fine fraction.The clay content and the pH are directly connected, whereas the sand content and the mean annual temperature are inverse relationships with the predicted SOC saturation value.Rule 2 relates the acidic areas with lower sand content and slope steepness < 14% (Fig. 3).In this group, higher clay content, altitude, pH, slope angle, and length indicate higher predicted SOC saturation, while the higher evaporation, sand content and mean annual temperature triggers lower SOC saturation values.Rule 3 predicts the SOC saturation values for non-sandy sites with neutral or alkali conditions (Fig. 3).This is the only group where the sand content is not a predictor variable.In contrast, evapotranspiration, pH and slope steepness all decrease the SOC saturation values.Rule 4 covers the acidic sites with high slope steepness (Fig. 3), where both sand content and topographic position index increase the saturated SOC content.

Applicability of the developed PTF
We should note that the fitted cubist-based PTF was elaborated on sampling points located in permanent forest in Hungary and therefore it is worth investigating whether there are areas where the applicability of the PTF is not recommended.As it was presented in Table 3, the fitted PTF is a set of specific multivariate linear regression models, each having a range within which the given regression model is reasonable to use for predicting the saturated SOC content.However, out of this range the applicability of the given regression model is definitely not recommended due to extrapolation.Therefore, we used the specific multivariate linear regression models (Table 3) to identify areas throughout the entire area of Hungary where extrapolation of the fitted PTF should be expected (Fig. 3, right map).It was found that for 7,922.7 km 2 the fitted cubist-based PTF cannot be used because of extrapolation.Considering the total area of Hungary (93,030 km 2 ) this means about 8%.Although the areas, where the application of the fitted PTF is not recommended, show scattered pattern throughout the country, notable regions of Hungary can be identified.Almost 90% of the areas

Table 2
Summary statistics of saturated soil organic carbon content (SOC sat ) and further soil properties harmonized for the topsoil (0-30 cm) at the reference soil profiles (n = 183).Abbreviation: SD: standard deviation.Table 3 The model structure of the cubist-based pedotransfer function used for predicting the saturated soil organic carbon (SOC sat ) content in the topsoil (0-30 cm).Note that a significance level of 0.05 was used to select the informative covariates.Annotation: See the Supplementary Material for more detailed information on the fitted pedotransfer function.characterized by salt-affected soils (e.g., Danube-Tisza Interfluve, Hortobágy) are proved to be not suitable for the application of the fitted PTF, which can be attributed not just to their extreme sodicity, alkalinity or salinity but also to the fact that their formation is highly influenced by shallow groundwater, permanent or temporary waterlogging.Additionally, areas characterized by highly or moderately water-affected soils are also highlighted along the main rivers (e.g.Danube, Tisza, Dráva) and in the eastern part of the country, which is completely acceptable as the concept introduced in Section 2 is referred to mineral soils and not to water-affected soils, where the balance between SOM accumulation and mineralization is strongly affected by surplus water.This also true for the peatlands, which can be easily identified on the map too (e.g.southestern part of Lake Balaton).

Spatial prediction of saturated SOC content with uncertainty propagation
Fig. 4.A presents the spatial distribution of saturated SOC content over Hungary as a result of the application of the fitted cubist-based PTF (Table 3) on spatially exhaustive soil and environmental covariates (Fig. 2).Next to this map we also presented its variance (Fig. 4.B), as one of the main results of the uncertainty propagation analysis performed.This map can be interpreted as the uncertainty in the saturated SOC content spatial predictions.We found that higher uncertainty is associated with areas where extrapolation of the fitted PTF is expected (Fig. 3, right map).We should also highlight that the final uncertainty associated with the saturated SOC content predictions takes into consideration not just the uncertainty associated with the fitted PTF but also the prediction uncertainty of the digital soil mapping products listed in Table 1.As was mentioned in Section 3.4, the partitioning property allows to analyze how much each input contributes to the final uncertainty.Thus, we determined the contributions and relative contributions of the input soil maps (Fig. 4.C and E) and the cubist-based PTF (Fig. 4.D and F) to the uncertainty in the saturated SOC content spatial predictions (Fig. 4.B).As we had no information about the uncertainty of the environmental covariates (Fig. 2, second and third row), we could not determine their contributions to the final uncertainty.However, their uncertainty contributes to the uncertainty in the fitted PTF (Fig. 4. D and F).This is because the cubist-based PTF was elaborated on their uncertain data and not on measured ones as in the case of soil inputs, where we used actual measurements from SIMS.Hence, uncertainty in the environmental covariates (Fig. 2, second and third row) is also   3. Note that settlements and open water bodies were left blank.represented in the PTF uncertainty.By comparing the relative contributions (Fig. 4.E and F), it was found that the contribution of PTF uncertainty to the final uncertainty is much larger than the contribution of the soil maps uncertainty.

SOC deficit in Hungary
We compared the saturated SOC content map with the actual SOC map of Hungary (Fig. 5.A) (Szatmári and Pásztor, 2019).In brief, the actual SOC map also refers to the topsoil (0-30 cm) and has been created by using a combination of geostatistics and advanced ML techniques.A detailed description on the soil data, environmental covariates and digital soil mapping methodology used for predicting the spatial distribution of the actual SOC content can be found in Szatmári and Pásztor (2019).By subtracting the saturated SOC map from the actual SOC map, we compiled a difference map presenting the SOC deficit for the topsoil in Hungary (Fig. 5.C).On the map, negative values (i.e., red colors) present areas with SOC deficit, whereas positive values (i.e., blue colors) show areas where soils are saturated or oversaturated in SOC.According to the compiled difference map, not just large parts of the country (ca.80%) can be characterized by SOC deficit but it also shows high spatial variability across Hungary.Note that the uncertainty in the spatial predictions of saturated SOC content (Fig. 4.B) is an order of magnitude larger than the uncertainty in the spatial predictions of actual SOC (Fig. 5.B).Because of this the uncertainty in the SOC deficit map (Fig. 5. C) must be also large, which does not allow to delimit areas with statistically significant SOC deficit in the country.

Interpretation and discussion of the results
The sand content was the most important predictor, which was included in three out of the four regression models.In two cases, it was inversely proportional with the SOC saturation value, even though in Rule 4, on the steepest parts sand content was proportional with SOC saturation (Table 3).This may result from the increased infiltration due to the coarser texture resulting in a lower amount of runoff and soil/ carbon loss (Nagy et al., 2020).The other general predictor is pH which regulates the predicted SOC saturation value in the Rules 1-3 (Table 3).It has a positive effect on sandy soils and under acidic and neutral conditions.In contrast, in the alkali range, pH has a very strong inverse relationship with SOC saturation.This may indicate the mitigated SOC saturation potential under sodic conditions (Rasmussen et al., 2018;Wong et al., 2010), even though in the current case, the spatial prediction is unreliable for sodic soils due to extrapolation (Fig. 3, right map).Regarding the climatic effects, both evaporation and evapotranspiration decrease SOC saturation on non-sandy and non-hilly fields.Mean annual temperature mitigates SOC saturation as was reported by several studies (Wiesmeier et al., 2014;Zhang et al., 2015).Topography provided ambivalent results as slope steepness decreased SOC saturation values on the flat areas, but topographic position index and the LS factor were in direct linkage with the saturation.Taking the spatial resolution of the present study into account, the erosion and soil redistribution processes (Ghosh et al., 2021) may result in a more detailed spatial pattern (Centeri et al., 2014).
We should highlight that the compiled map of saturated SOC content (Fig. 4.A) should be interpreted together with the PTF's applicability map (Fig. 3, right map), as it was found that for 8% of the area of Hungary extrapolation of the fitted PTF should be expected.Therefore it is better to disregard those areas from the assessment, where the PTF's applicability is not recommended.As a general trend in Hungarian topsoils, the highest SOC deficits (Fig. 5.C) occur in areas with medium to high actual SOC content (Fig. 5.A).Base saturated Chernozems and Kastanozems storing more than two percent of SOC have the capacity to sequester an additional one percent of SOC mainly on the plains of Hungary (Fig. 5.C).The mountainous parts of the country as well as the acidic fields have lower capacity, however, it still means 0.1-0.5% additional SOC storage potential.

Limitations and further research
The estimated SOC saturation deficit value is rather theoretical with practical applicability.The saturated forest sites used as references have no tillage but practically unlimited organic matter input.On cropfields, however, a considerable amount of organic matter is gathered as yield and straw thus, the OM input is limited.Moreover, tillage increases SOM mineralization.Consequently, the theoretically calculated saturation value is hardly available for the practice without land use change (Stewart et al., 2007).Ensuring food security, most cropfields must be kept as arable land in the long run, accordingly loading the total amount of calculated SOC deficit is unlikely.Nevertheless, the application of conservation agriculture technics is believed to relevantly increase the SOC content under cropfields (Gelybó et al., 2022;Jakab et al., 2022;Madarász et al., 2021).Thus the amount of SOC saturation deficit may be an efficient indicator to identify hot spots where cultivation shift would benefit most, independently of the theoretical amount of SOC saturation.
On the other hand, we also should note that considerable parts of the country have been indicated as saturated or oversaturated (Fig. 5.C).The highest values of oversaturation are mostly related to peatlands (e. g., southwestern part of Lake Balaton) and other water affected areas.As it has been already mentioned in the Introduction, on these sites, the temporal or former water coverage inhibited the mineralization of soil organic matter, which process cannot be captured by the developed PTF (Fig. 3, right map) since the approach used in this study was based solely on forest sites.Further reason of (over-)saturation is related to the areas covered by Arenosols, especially in the Danube-Tisza Interfluve (Fig. 5. C).These soils can be characterized by basically low SOC concentration, where low changes may trigger oversaturation.Under the above conditions, the role of particulate organic matter or fresh plant debris may cause increased variability in Arenosols resulting in oversaturation.
In this study, we used the monitoring points of SIMS located in permanent forests (i.e., the "E" points) as reference soil profiles, which is a coherent dataset because the description of soil profiles, field sampling and laboratory analyses have been carried out according to a predefined protocol based on the Hungarian Standards.We should note that only a limited number of soil profiles located in permanent forests are available (n = 183), therefore more effort should be made in the future to collect additional soil samples from permanent forests across Hungary.The age of these permanent forests may also affect the current saturation degree if they are afforested in the recent past thus also, the age of forests would be useful to include in further studies.
The uncertainty associated with the saturated SOC content spatial G. Szatmári et al. predictions was larger than the uncertainty of the actual SOC content by an order of magnitude (Fig. 4.B and 5.B), which made it impossible to delimit areas with statistically significant SOC deficit in Hungary.Note that this is not unique in the literature, a number of papers demonstrated that large prediction uncertainty can put restraint to digital soil assessment (Heuvelink et al., 2020;Szatmári et al., 2021Szatmári et al., , 2019b)).However, the methodology presented in this study and the compiled maps provided useful preliminary results on identifying and delimiting areas with SOC deficit in Hungary, which should be fine-tuned in the future with probably more additional observations on saturated SOC content.As it was pointed out by the results of uncertainty propagation analysis, this large amount of uncertainty can be mostly attributed to the fitted PTF (Fig. 4.D and F).This is of great importance, as it basically gives the direction on how to design the additional sampling campaign in order to successfully improve the cubist-based PTF and at the same time reduce its uncertainty.Besides, the cubist algorithm also proved to be an efficient ML technique not just for developing PTF but also for making the resulting PTF more transparent for users, an issue frequently addressed in recent studies (Szatmári et al., 2020;Wadoux et al., 2020).The transparency of the model structure can help us not just to better understand soil and environmental conditions affecting SOC saturation but also to optimize the number and location of additional sampling points making the additional soil survey more cost-effective.In the last few years, a number of studies have addressed the issue of optimizing sampling design with known model structure and demonstrated that uncertainty can be reduced (Ließ, 2015;Szatmári et al., 2019aSzatmári et al., , 2015;;Wadoux et al., 2019).

Conclusions
Our objective was to develop a cubist-based PTF, which can be used for predicting and mapping the saturated SOC content of the topsoils in Hungary and then compare the resulting map with the actual SOC map in order to assess the degree of SOC deficit in the country.The fitted cubist-based PTF proved to be efficient in predicting SOC saturation across Hungary, furthermore, the cubist algorithm gave a fairly transparent model structure, which helped to better understand the factors and driving forces of SOC saturation.We have highlighted that not just the physicochemical properties of soils, but also environmental conditions, such as topography and climate, characterizing landscape are important factors in predicting the level of potential SOC saturation.Our results also pointed out that there is SOC deficit on large part of the country (ca.80%), which shows high spatial variability.Besides, the most considerable potential for additional SOC sequestration was found related to soils with medium to high actual SOC content.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig. 1 .
Fig. 1.Location of the soil profiles of the Hungarian Soil Information and Monitoring System (SIMS) (n = 1,236).

Fig. 2 .
Fig. 2. Spatially exhaustive soil (first row) and environmental covariates (second row: climatic variables, and third row: topographical variables) found to be informative for predicting saturated soil organic carbon content in Hungary.Annotation: Settlements and open water bodies were left blank.

Fig. 3 .
Fig. 3. Spatial distribution of the rules defined by the cubist-based pedotransfer function (left map) and the applicability of the pedotransfer function over Hungary (right map).Annotation: Conditions and multivariate linear regression models associated with the rules can be found in Table 3.Note that settlements and open water bodies were left blank.

Fig. 4 .
Fig. 4. Saturated soil organic carbon content for the topsoil (0-30 cm) (A) and its variance as a result of uncertainty propagation analysis (B).Contributions and relative contributions of the input soil maps (C and E) and the cubist-based pedotransfer function (D and F) to the uncertainty in saturated soil organic carbon content spatial predictions (B).Abbreviations: SOC: soil organic carbon, and PTF: pedotransfer function.Annotation: Settlements and open water bodies were left blank.

Fig. 5 .
Fig. 5. Spatial distribution of the actual soil organic carbon (SOC) content (A) and the associated prediction uncertainty (B), and the the map of SOC deficit (C).Annotation: Settlements and open water bodies were left blank.

Table 1
Summary of the spatially exhaustive information on soil and environmental covariates used in this study.Abbreviation: DEM: digital elevation model, LS factor: slope length-gradient factor, MRVBF: multiresolution valley bottom flatness, MRRTF: multiresolution ridge top flatness, and SOC: soil organic carbon.