EstSoil-EH v1.0: An eco-hydrological modelling parameters dataset derived from the Soil Map of Estonia

.

The manuscript improved from the initial version.The dataset is highly valuable and deserves publication.However, the authors provide a plethora of information, which is very difficult to connect and trace through the manuscript.Especially when "expert knowledge" converts to numerics it remains challenging to understand the steps and the data.For each step the authors apply a set of concepts.But they do not really clarify on these concepts and about the respective implications for the usage of their data product.
I have full confidence that the authors can and will improve these shortcomings and hope that the following comments can assist their revisions.
We appreciate the valuable comments and hope to have addressed the points raised.

General Comments
As a data paper there appears to be a structural challenge with the manuscript.On the one hand it is important to report the resulting data, including transparency about the sources and validity of the derived values.On the other hand the methods and challenges to derive this data are taking a lot of room, which obscure clarity of the presentation of the methods and results.I suggest to rigorously revise the structure and content of the manuscript under the view of "how to provide the readers the means to understand and work with the dataset".
We have restructured the document with attention on the workflow steps as depicted in the main figure 1.We have reduced the technicalities a lot and put them in an additional technical supplement.In the results section we have added additional error metrics and characteristics for the reader to evaluate various use case, e.g. in arable lands, grassland or forested areas.
I am sure the authors aim to do so from the start.Figure 1 is capable to provide a solid basis for this.Hence, the introduction could work exactly towards the four major steps/aspects plus a brief clarification of the motivation of the study.In the present state, my capabilities to distill information from the introduction is massively exceeded.One possible avenue to follow could be: Analysing effects of land management decisions is an important issue.SWAT-type models are used and heavily rely on soil data at high resolution and level of detail.Although a global issue, it is found at National level as well.This study is about Estonia where -as in many other places -mapped soil data is very difficult to convert to numerical data for the intended model application.Moreover, the required aspects/variables are lacking in the original soil maps.This study presents a soil data base, which includes layering, texture and pedo-physical variables as well as site characteristics… We have restructured the manuscript added additional background for several paragraphs where the expert inputs came in, which now also underpinned with references like an EU project report where several decisions have been made.
Further, I am under the impression that some of the debate (i.e. about SOC) might rather suit the discussion section?
We have reorganised some sections in the introduction and moved a paragraph also to the discussion section.
The methods-section gives a lot of information.Although I find it penetrable after some intense study, I feel somewhat overwhelmed and left alone at the same time.I think the structure is already laid out well in Figure 1 and requires to be strictly followed.Hence I suggest to offensively start with Figure 1 and motivate the different aspects as classical soil data, topographic data, specific soil variables (SOC, BD), pedo-hydrological parameters.This methodological introduction could include the general idea behind each step.The remainder of the section can host the respective methodological steps.I would suggest to give each subsection a similar pattern: Available data, required data, how to come from a to b, how can we evaluate the method, where is the code.Through this, I expect that it becomes much easier to follow, the subsections will become shorter and less technical.Moreover, I sincerely urge the authors to point out the conceptual ideas instead of listing the parsing grammar and error handling.I agree that this is important, but at the current state I feel massively distracted by reading strings which parse and links to external files.
We have reduced the technicalities a lot and put them in an additional technical supplement.
The tables 1 and 2 might suffice to document the remapping with a little explanation?Yes, that is right and was our intention from the get-go.We added also a technical report link to back up choices of "expert knowledge".
Your second step distills topography-related site properties (Sec.2.3).I am sure that not only an estimate of SOC concerns with such information.E.g.HAND and similar indices (Gharari et al. 2011, Loritz et al. 2018) are pointing out the use of such information.Maybe a simple table with the applied indices can be listed and that is enough?
We appreciate the provided information.The terrain related indices we only used for SOC prediction under the hypothesis that slope affects deposition and thus erosion and mineralisation are indirectly correlating to the level of SOC.However, clay and sand content, soil group and land use were by far the strongest indicators and only then came additional terrain indices with very low influence.A new and more complex model could likely be a paper on its own as reviewer 2 implies, too.
The derivation of SOC appears maybe to be the most vague point since it employs few observations and many assumptions.Hence, I would see this step and the subsequent application of Rosetta and calculation of USLE_K to be maybe somewhat a different level of "data" and should be treated as such?I consider it out of the scope of your study to really prove the effect different ways to reproject the soil types and to derive topographic indices on the PTF and RF.In the same pattern as before, I think it can be handled rather straight forward to present the foundation data and the PTF / RF application.Again, I think a table/figure what inputs convert to which outputs with what assumptions would likely elucidate on the step much concisely?
Based on reviewer 2 comments we dropped USLE_K as immediate variable due to weak points in additional variable derivation for USLE K equation.However, hydraulic conductivity is only based on texture properties sand, silt and clay.We explicitly omit the variation where we could use BD as additional parameter for Rosetta due to the high uncertainty.For the calculation of BD it is expressly stated that it depends only on the SOC step during the same section in the manuscript.
I have seen some passages about possible evaluation strategies of the derived data.Since this is a data publication, I would suggest to give this a much stronger focus.Again, most appears to be there and might require just a little more clarification.It could follow the same pattern in each aspect.
For the validation of SOC I was wondering if the random split sampling results in subsets which still cover most of the sample patches.As such, the validation would maybe not give validation for spatial extrapolation.Could you check to cross validate between different regions?As the data set appears bimodal, I am also not quite convinced about the validity of a global R2 as "uncertainty" estimate.For the cluster of moderate values, there are deviations of >400%!I think this application is a valid approach, however there should be clear notice of the relatively large uncertaintyespecially since SWAT modelling is intended as application where people might relate findings to the reported SOC values.
We put a lot of effort into additional error descriptions.We added RMSE and the normalised median absolute deviation error metric to SOC, subsets of samples where observed data for BD is available, as well as sub samples for sand, silt and clay.We refer to the metrics and there limited explainability over the whole dataset, however, we also break down SOC prediction error for different landforms with additional descriptive statistics, such as error quantiles.This will help the user of the data to make choices that are more informed.
The results section should present the derived data and maps in a brief manner.I can imagine that an essence of Fig. 4-9 could be shown here in max. 4 panels.I would expect a description of the dataset here.Referring to Table 6 the different "sections" could be related back to the applied methods.Instead of the technical definition of the data_type I would favour a clarification about the data reliability, which could come in the remainder of the section as evaluation of each processing step.Again, I think most of the material is there and the authors can be very specific about each evaluation finding.In the present form, the results highly underrate the valuable outcome of the study and give little guidance about how to use it.Rather technical details could be given as appendix (e.g.texture remapping) and make room to really point to the reliability of the compiled data.
As explained above, we added additional error metrics, broken down for main land uses in Estonia (arable lands, open grassland, forest and wetlands).We the panels suggestion sounds great, but fear for the lack of quality/visibility if we reduce the image sizes.
As stated before, I would see many paragraphs in the introduction and methods section actually suiting the discussion section.I think the authors could remain more specific to their estimates of step C and D instead of this rather general discussion.This would also be a perfect place to link to the topic of the special issue and highlighting where soil mapping and landscape functioning are challenging/easy to connect.
We also added several sentences to link better to the special issue.
As I suggest to rework the manuscript rather strongly, I omit to give specific comments at this stage.However, I would remove "End" in Fig. 1.Moreover, I would remove most of the file names and links from the text and either put them to the Bibliography or as a file description in the Appendix.
We reworked the manuscript a lot.We also removed many of the links to scripts and supplemental materials.

General comments:
Authors have answered most of the questions raised about the first version of the manuscript.They have improved the manuscript based on the suggestions, but clarity and information on uncertainty needs further input.It is understandable that quantifying the uncertainty of the variables is quite challenging for the authors.It would need significant amount of time to perform it, especially to find the correct method for each mapped variable/soil property.It might be an interesting analysis for a separate paper.The presentation of the present state of the dataset is already valuable.In this manuscript the main strength is the description on how the enormous and very valuable soil data collected in line with the Soviet system could be translated into a dataset which can be used for quantitative analyses.The followings should be further described and highlighted in case of all mapped variables under discussion: -how the variables were derived, e.g.: based on expert rules/ characteristic mean values are assigned/ computed with PTFs based on characteristic particle size distribution, etc.; -what are the limitations of their use.
We appreciate the valuable comments and have addressed the points raised.We have added error metrics where observed data was available, even if only for partial subsets of the data.We thus also addressed the limits and preferable use cases, especially for landforms where errors for predictions are higher or lower.
Still terminology used in international literature should be adapted in the manuscript.It is needed not only for the texture, but for all expressions related to soil science in the entire text, tables, e.g.: in WRB the word "soil type" is not used, there are reference soil groups (http://www.fao.org/3/i3794en/I3794en.pdf).Please provide information in the text about how WRB reference groups and qualifiers were derived and mention possible limitation of providing this information.
We have changed the wording across the manuscript where appropriate.After all, the Estonian terminology only calls it soil type.But we provide guidance, from where we transition the user towards the wording of soil reference group.Unfortunately, one important soil expert who provided the intial insights has passed away during the work on the manuscript (Dr.Arno Kanal).In fact, he was one of the few experts who was familiar with the Soviet system.Notwithstanding his tragic loss, we could support and extend his work with collaboration with the Ministry for Environment in Estonia, where we gained access to additional observed data to calculate error statistics for fine earth fractions and bulk density.We also link to a report where Estonia has been collaborating in a large EU project (BioSoils), where WRB soil reference groups where designated for various Estonian soil types.May be the derived particle size distribution (SOL_SAND, SOL_SILT, SOL_CLAY) could be referred in the manuscript as clay, silt and sand content characteristic for the texture class of the unit and you could add reference of the conversion method.
In the flow text we now only refer to sand, silt and clay content or fractions.Only when referring to the variable name in the dataset we left the SOL_ prefix.The conversion is based on the tables 1 and 2 as explained in the text.We have put more emphasis on this, to be clearer to the reader.The values for sand, silt and clay content have also gained some error metrics, but only for forest area was observed data available.
Further checking the grammar and terminology would be important, e.g.among others P15 L13, P17 L9, L12-13, P18 L15.Some suggestions are provided under specific comments.
We updated Table 6 (now table 5) to focus only on the described parameters in the paper.There are additional variables in the dataset which where temporarily used during the machine-learning process.On request from users we let them in.
Just randomly checking the database for some cases it was not logical that nlayers=1 and clay, silt, sand and rock content is given for layer 2 as well, e.g.: orig_fig=194041, 191116, or nlayer is 2 and clay, silt, sand and rock content is given for layer 1 and 3, e.g.: orig_fig=372656.It might be useful to not use zero (0) for those rows where there is no data.Please check and correct these.
We agree, that instead of zero (0) we should employ NULL or similar measures in order differentiate between a value of 0 and no value.However, the use of the dataset implies to only use the variables which have defined layers.We have uploaded an updated version of the dataset.
Please clarify in the text the followings, which were mentioned in the first round of the review as well: -How were Estonian soil types translated into WRB reference soil groups and qualifiers?Is there a reference document for it?Yes, there is a reference report now added.Initially we had only received "expert knowledge" input, but now it is underpinned with a documented EU project (BioSoils).
-Predictors used in the random forest method could be listed under materials and methods section.

Specific comments:
We have significantly restructured and revised the manuscript.We hope to have addressed all specific points.P2 L9-12: please finish the sentence Edited.P2 L23-24: … sand, silt and clay content, amount of coarse fragments, organic carbon content and carbon stocks at seven soil depths … please use the word "content" in the entire text in case of the above mentioned soil properties.
We have edited the manuscript accordingly.P2 L25-26: please consider that also SoilGrids provide harmonized soil database for Europe, please rephrase the sentence.The following can be deleted: ", and also covers Estonia", it is logical.
Edited.P2 L30-31: it is not clear why HYPRES dataset is mentioned.In this case LUCAS or EU-HYDI datasets could be mentioned as well.Please consider the message of the text and revise the sentence accordingly.
We have added EU-HYDI.P3 L23: … related to water and carbon cycle … is it correct?AWC is missing from the listing.
Yes, we added cycle in several occasions.We added AWC.P3 L27: … There is no countrywide spatial dataset of soil organic carbon content and bulk density for Estonia ... is it correct this way?Please finish this thread, e.g.: it was needed to derive predictions for both soil properties which made it possible to map them.
We have added the clarification.P3 L27-30: could be moved after L18.P4 L18-19: instead of soil profiles would it be appropriate to write soil layering?Soil profile related also to the sampling process as combined wording for layers and taking of physical sample.P4 L20: … related to water and carbon cycle… is it correct?Yes, added.P4 L21: … from the historical soil maps of Estonia -surveyed between 1949 and 1991 -to support modelling … is it correct?Indirect.We added clarification that the original soil surveys mainly were done for land evaluation and planning and assessment of agricultural use.The organoleptic determination resulted only in a categorical label.The translation into a numerical value is based on Table 2 and "expert knowledge".We have now also several numerical error metrics (at least for forest soils).P10 L12: … We compared the derived sand, silt and clay content values with two different datasets.… is it correct?Correct, soil grids and a county level size manual based extract.P10 L22, L26: on P13 L20 you mention that Ksat was calculated with Rosetta PTFs, thus it was not derived from EU-SoilHydroGrids.Please revise it.
That is correct, we have not stated that K sat is derived from EU-SoilHydroGrids, only AWC.P10 L27-29: Please describe more detailed the differences between EstSoil-Eh and SoilGrids.
We introduce SoilGrids already in the introduction, main difference is possibly that EstSoil is vector based and formed from categorical data and SoilGrids is raster based and formed from nurmerical samples.This is also stated in the discussion.P12 L15: … as predictor variables for the calculation of SOC and BD … Edited.
P12 L21: please note that number of randomly selected variables -during each split -and number of trees in the forest are usually optimized. Ok.
P12 L30: the following can be deleted: "for machine learning".
The technical code supplement "04_soilmap_SOC-RF_BD.ipynb"describes the modelling process indepth.But the manuscript also states that the models uses the fine earth fractions (sand silt clay content), coarse fragments, topographic variables as in section 2.3 , and drainage where used as predictors.
P13 L5: based on your answer and Equation 4 texture was not used to calculate BD, please delete the following: "texture values and".
Texture values related to sand, silt and clay values, which were used to predict SOC, and SOC was then used as PTF for BD.Deleted.P13 L6: please shortly describe why you choose that PTF to compute BD, why not other PTF was used, e.g.: applicability/ training set used to derive the PTF was similar.
Added a short comment, that is was already used successfully in Estonia before.P13 L9: there is no information in the brackets, please check it.P14 L10: please check the reference, humic or peaty topsoils do not have blocky, platy or massive structure.
We removed the USLE K based on your recommendation.P14 L11-12: Please provide more information with reference about how soil structural class was derived.Based on solely texture and amount of course material structure cannot be given.
We removed the USLE K based on your recommendation P14 L1-27: based on the present information, deriving structural class is a weak point.Therefore, I would suggest to not include the USLE K erodibility factor in EstSoil-EH dataset and in the manuscript.
We removed the USLE K based on your recommendation P16 L10-11: It is the repetition of the information given in materials and methods, therefore could be deleted.deleted P16 L25: Is it correct that accuracy of BD could not be analysed because there is no Estonian dataset with measured values?If that is true, please mention it.If there is a dataset where accuracy can be calculated, please perform the analysis.
Partially correct, we finally obtained some data.P16 L27: please describe map of BD as well.
Figure 6 P16 L29-30: mentioned in the materials and methods, please delete the sentence.

Deleted.
P17 L1-2: if structure class cannot be derived with a more robust method, I would suggest to delete information on USLE K from the entire manuscript and the database.
We removed the USLE K based on your recommendation P18 L14-15: may be the following could be considered: … with a reproducible workflow, which is unique in the case of Estonian soil datasets … added P17 L24-25: "are informed to some extent by previous reports" it is not clear please rephrase it.
Rephrased, aiming at SoilGrids that had likely also sample information available from Estonia.P17 L30-31: sentence starting with "A direct" is not clear, please rephrase it.It would be better to move the sentence starting with "From the point" under results section.
Edited.P17 L31-32: … based on the layering of the original texture code per mapped soil units… Edited.
P18 L8-12: It is a very good idea to use an additional class for peat, but the following sentence might be confusing therefore should be revised or deleted: "From that perspective peat soil units are currently modelled with assumptions to have a similar behaviour to clay hydrologically."Several studies have shown that the shape of the shrinkage characteristics of peat soils were significantly different from those of clay soils (Van den Akker and Hendriks, 1997;Oleszczuk et al., 2003;Hendriks, 2004.)Much appreciated hint.We will look into it.Currently, seem to have some success with using it in SWAT like this for runoff behaviour.

Introduction
Eco-hydrological numerical models like the Soil and Water Assessment Tool (SWAT; https://swat.tamu.edu/)or the Regional Hydro-Ecologic Simulation System (RHESSys) have been developed and applied during the past 30 years to evaluate the effects of alternative management decisions on water resources and non-point-source pollution in river basins through the simulation of physical processes (Arnold et al., 1998;Douglas-Mankin et al., 2010).SWAT is widely used internationally and is increasingly applied in Northern European and Baltic watersheds to better assess the hydrological state of the environment based on modelling of the most relevant physical processes (Piniewski et al., 2018;Tamm et al., 2016Tamm et al., , 2018)).However, a main input factor for many of these models is detailed soil data, which does not exist for many countries on national scale or which exists with insufficiently fine spatial resolution.In addition, it is complicated to derive the values of the model parameters.
The objective of the present study was to develop a numerical soil database for modelling and for predicting processes in Estonia.In this study, we derived numerical values for the key characteristics for the whole Soil Map of Estonia at a 1:10 000 scale for soil profiles (e.g., layers, depths), textures (clay, silt, and sand contents), coarse fragments and rock content, and physical variables related to the water and carbon cycle.There is no countrywide spatial dataset of BD, SOC, hydraulic conductivity, available water capacity) for Estonia.Thus, it was needed to derive predictions for these soil properties in order to map them.
Existing national-scale soil datasets that have been developed to be used by SWAT currently only exist for the United States.Cordeiro et al. ( 2018) developed an official soil dataset for SWAT for Canada.Apart from these efforts, no consistent methodology has been used to develop soil datasets at national, continental, or global scales so that the data is immediately applicable for the use in SWAT (Batjes, 1997;Dobos et al., 2005).
At the global level, two main soil databases are available.The first was made available by the United Nations Food and Agricultural Organisation (FAO) through its Soils Portal: the Harmonized World Soil Database (HWSD) v1.2 (Fischer et al., 2008; http://www.fao.org/soils-portal/soil-survey/soil-maps-and-databases/harmonized-world-soil-database-v12/en/).The dataset resulted from a collaboration between FAO and Austria's International Institute for Applied Systems Analysis (http://www.iiasa.ac.at/),ISRIC-World Soil Information (https://www.isric.org/), the Institute of Soil Science of the Chinese Academy of Sciences (http://english.issas.cas.cn/), and the Joint Research Centre of the European Commission (https://ec.europa.eu/info/departments/joint-research-centre_en).HWSD is a 30-arc-second raster database with more than 15000 different soil mapping units.It combines existing regional and national updates of soil information from around the world, including key soil and terrain properties of the Soil and Terrain database (SOTER, https://www.isric.org/explore/soter), the Ecological Site Description system of the U.S. Department for Agriculture ESD (ESD, https://esis.sc.egov.usda.gov/Welcome/pgReportLocation.aspx?type=ESD), the Soil Map of China (https://esdac.jrc.ec.europa.eu/content/soil-map-china),and homogenized sets of soil property estimates in the World Inventory of Soil Emission Potentials (WISE, https://www.isric.org/explore/wise-databases).It also contains information from the 1:5 000 000-scale FAO-UNESCO Soil Map of the World (FAO, 1990; http://www.fao.org/soils-portal/soil-survey/soilmaps-and-databases/faounesco-soil-map-of-the-world/en/).
The other global-level soil dataset is SoilGrids250m, which provides global gridded soil information derived with machine learning methodsbased on machine learning (Hengl et al., 2017) and is made accessible via an interactive Web interface (https://soilgrids.org/) with sophisticated standards-based data access via the OGC Web Coverage Services (https://www.opengeospatial.org/standards/wcs).SoilGrids250m provides values for sand, silt, clay, and rock fractions, and organic carbon and carbon stocks at several depths, which can be used as inputs for SWAT.SoilGrids also provides a harmonized soil database for Europe.At a the regional level, the European Soil Database v2.0 (Panagos et al., 2012; https://esdac.jrc.ec.europa.eu/content/european-soil-database-v20-vector-and-attribute-data) is athe only harmonized soil database for Europe, and also covers Estonia.It contains the soil geographical database SGDBE (vector data), which includes a number of essential soil attributes, and an associated database (PTRDB), with attribute values that have been derived through pedotransfer rules.The European database also includes the Soil Profile Analytical Database, which contains measured and predicted soil profiles for Europe as well as soil organic carbon (SOC) projections for Europe that include 26 European countries at a resolution of 1 km.Further European databases that collate soil-hydraulic data and pedotransfer functions (PTFs) include the European Hydropedological Data Inventory (EU-HYDI, https://esdac.jrc.ec.europa.eu/content/europeanhydropedological-data-inventory-eu-hydi ) Wösten et al. (1999) developed thea database of HYdraulic PRoperties of European Soils (HYPRES Wösten et al. (1999)).
Existing national-scale soil datasets that have been developed to be used by SWAT currently only exist for the United States.Cordeiro et al. ( 2018) developed an official soil dataset for SWAT for Canada.Apart from these efforts, no consistent methodology has been used to develop soil datasets at national, continental, or global scales so that the data is optimised for use in SWAT (Batjes, 1997;Dobos et al., 2005).
In Estonia, systematic large-scale soil mapping was launched in 1949, with agronomy students assisting (Estonian Landboard, 2017; "mullakaardi_seletuskiri.pdf" ).Starting in 1954, a special survey was carried out under the supervision of the Ministry of Agriculture.Aerial photographs were used as the basis for this activity.By 1992, Estonia's soil cover had been mapped by the Soil Survey Department of the former Institute of Estonian Agroprojects at a scale of 1:10 000.In addition to inspecting arable land, forests, and other land types between 1989 and 1991, the remaining former Soviet military areas were also mapped.Between 1997 and 2001, the soil map was digitized and attribute data was inserted into the database, resulting in the official National Soil Map of Estonia as a vector dataset that mapped 750 000 soil units at a scale of 1:10 000 (Estonian Landboard, 2017; https://geoportaal.maaamet.ee/est/Andmed-ja-kaardid/Mullastiku-kaart-p33.html).It is a very detailed and information-rich dataset for soils in Estonia.For each soil unit, it describes the soil type, quality, texture, and layer information using a series of complex text codes.However, to use it as an input for process-based models such as SWAT, these text codes must be translated into numbers.Processing a class-based soil dataset into the required numerical variables is a time-consuming process because not all data are readily available (Bossa et al., 2012;Rahman et al., 2012).Various datasets have been created that generalise values for agricultural soils in Estonia to produce less detailed versions at scales of 1:100 000 and 1:200 000 (Kõlli et al., 2009;Tamm et al., 2018).However, there is no national scale dataset of measurements or predictions of SOC or bulk density (BD) for Estonia and no large-scale high-resolution soil database is currently available with numericalwith numerical data for a range of typical eco-hydrological process-based modelsparameters.
The objective of the present study was to develop a reproducible method for deriving numerical values as inputs for modelling and for predicting hydrological processes with SWAT in Estonia.By developing this method, we aimed to develop a fully numeric soil database for Estonia.In this study, we derive numerical values for the key characteristics for the whole Soil Map of Estonia at a 1:10 000 scale for soil profiles (e.g., layers, depths), textures (clay, silt, and sand contents), coarse fragment and rock content, and physical variables related to water and carbon (bulk density, hydraulic conductivity, SOC) for all of the mapped soil units.
There is no national scale dataset of measurements or predictions of SOC or BD for Estonia.Prévost (2004)  Nevertheless, they constrained their finding by noting that their estimates were calculated based on the mean SOC stock for each soil type and the corresponding area in which the soil type was distributed.Putku (2016) used the large-scale Soil Map of Estonia at the polygon level for SOC stock modelling for mineral soils in arable land of Tartu county.Ramcharan et al.
(2018) assimilate more than 200 datasets with SSURGO in order to predict various soil properties, SOC and BD, at 100-m spatial resolution for the conterminous United States using statistical machine learning.Also, SOC and soil-hydraulic predictions for Estonia need to consider that Estonia is located relatively far north and hosts large areas of peatlands.
In summary, other datasets are available for use in Estonia-based modelling contexts.However, vector-based European or even global soil datasets are very coarse and excessively generalise large parts of the diverse Estonian landscape.
High-detail datasets such as Soilgrids250m are predicted data on a grid 1km/250m, based on much less data points for Estonia.
In this study, we derived numerical values for the following data in all of the mapped soil units in the 1:10 000 soil map: soil type (i.e.soil reference group), texture class, soil profiles (e.g., layers, depths), texture (clay, silt, sand components, and coarse fragments), rock content, and physical variables related to the water and carbon cycle (organic carbon content, bulk density, hydraulic conductivity, available water capacity and erodibility factor).We present also describe the development of a reproducible method for deriving numerical values from a the Soil Map of Estonia to support modelling and prediction of eco-hydrological processes with the popular Soil and Water Assessment Tool and we create provide an extended ready-to-use dataset containing the additional parameters.

Pre-processing and screening of the initial soil database
The source datasetthe original Soil Map of Estonia -as described in the article, is not based on modelled but on fully observed data (e.g.texture, soil profile depth, rockiness, presence of organic layer etc).Systematic mapping of Estonian soils to produce soil map in scale 1:5 000 and 1:10 000 was started in 1954 (Reintam et al., 2005), with most intensive field studies in period 1965-1969, for the main purposes of land evaluation and assessing potential for agricultural use.Generally field mapping was carried out in scale 1:10 000 but in hilly or undulating areas with higher soil diversity in scale 1:5000.In 1982-1988 older mapping data was updated and new areas were included with full-area soil quality (primarily fertility, rockiness, water regime, texture, erodability) assessment.In 1988-1990 soil field studies were performed in non-arable land and new mapping of ameliorated land.Forest soils were mapped in period 1976-1989.During large-scale field mapping of soils, the texture was determined in situ based on organoleptic methods (feel methods) and for reference profiles laboratory analyses were performed.This enabled calibration between texture defined by organoleptic method by each researcher participating in field survey and texture determined in laboratory (Estonian Land Board, Explanation to the soil map, https://geoportaal.maaamet.ee/docs/muld/mullakaardi_seletuskiri.pdf?t=20091211092214).As a result of large-scale soil mapping, 119 soil varieties in Estonian national classification system have been distinguished and more than 500 combinations of textural status have been described.About 10,000 profiles (1 profile per 330ha) have been sampled and analysed for characterisation of mineral soils (Reintam et al., 2003(Reintam et al., , 2005)).Thus, the texture codes and soil types assigned to the ca.750000 mapped soil units (polygons) are based on many decades of in-situ land surveying practices.
The original Soil Map of Estonia is available vector layer for geographical information system software that can be downloaded from the Republic of Estonia Land Board Web site (https://www.maaamet.ee/en) in several formats under a permissive open data license.A copy with the original shapefile dataset, the related required documentation and checksums has been archived for reference (Estonian Landboard, 2017; https://datadoi.ut.ee/handle/33/103).The soil map contains the following used attribute fields: -Soil type: a designation of the soil name, the Estonian analogue to the WRB soil reference groups -Texture: texture classes defined for fine and coarse fragments, and to which depth the same texture and coarse fragments are observed (layer) These attributes are encoded as "string" values, which include both letters and numbers.The important fields soil type and texture, are not just stored as standardised class values, but are instead a coded description based on abbreviations that are then combined with numbers for example depths and indictors for level of erosion, and are grouped together for different depths within the same attribute field.These description-based attribute values make it difficult to derive the foundational numerical values for sand, silt, clay and coarse fragments from the codes and to make them more consistent and usable in calculations and statistical analyses.In addition, our data screening revealed that the attribute values sometimes contradict the official legend for the Soil Map of Estonia.For example, the soil type reference sheet provided with the soil map lists ca.120 different soil types in Estonia (Estonian Landboard, 2017; "muldade_tabel.pdf")and the soil legend document describes 9 main texture classes and 12 soil skeleton types, i.e., coarse fragments and rock morphology (Estonian Landboard, 2017; "mullalegend.pdf").
However, the database's attribute table contains 7067 unique variations for soil types, which resulted from the use of many specific local derivatives and transcription errors.Similarly, the texture column actually contains 87240 unique values instead of 9, 21 (9+12) or 108 (9x12).Considering the possible permutations of these soil types and textures, it would be prohibitively difficult to develop any kind of reasonable standardisation for the soil parameters before cleaning and unifying the dataset.
Therefore, we performed extensive database standardisation on the original Soil Map of Estonia as the working basis and derive all further variables based from the standardised dataset sequentially.Figure 1 illustrates the four major working packages to derive the desired eco-hydrological parameters.The subsequent four sections are structured accordingly: Section 2.2 represents step A (Textural properties); step B (Additional topographic variables as predictor variables) in section 2.3, section 2.4 describes step C (SOC RF model and BD); and step D (Hydrological parameters) is described in section 2.5.

Deriving sand, silt, and clay fractions and rock content
The Soil Map of Estonia's "texture" field encodes the texture and general soil layer structure for each mapped soil unit in a structured, rule-based format (based on old Soviet-era paper maps).To derive meaningful numerical values for texture and other soil variables from the soil map, we developed a computer program that encodes these rules into a computer-readable grammar.In addition, we provided a lookup table for wrongly encoded texture codes and historical data-entry errors.The program provides a complete internal data structure that represents the analysed grammatical representation, which can be evaluated and used to generate numerical values for a variety of variables.The complete parser Python module including an extended technical description is provided as supplemental material (Kmoch et al., 2019b; "soil_lib/LoimisGrammarV2.py").
The main implementation of this program is based on the Python library "Arpeggio" (Dejanović et al., 2016; https://pypi.org/project/Arpeggio/),which is a recursive-descent parser based on parsing expression grammars (also known as the Packrat parser; http://bford.info/packrat/).This let us express rules and symbols (i.e., the grammar) in such a way that our software could parse arbitrary text and find the various definitions of the texture in the same way as the rules are described in the map legend handout.
Listing 1 provides an example of a parsing grammar.At the start of the program, the basic elements are defined, starting with the 9 main fine-textured soil types: "plsl, pl, tsl, tls, dk, sl, ls, s, l".The parser honours the order of their definition.
Without these ordered rules, the parser will never find the more complex expression "plsl", because it would stop as soon as it encounters the "pl" part of the name.We also defined the coarse fragments types, and peat land soils.
The function "def fine_textured(): return Optional(kPlus), fine-textured_list, Optional(amplifiers), Optional(depth_range)" demonstrates the flexibility of how a parsing expression grammar parser can be configured.The function can find even optional (0 or 1) elements such as prefixes or suffixes within an arbitrary text stream.
Subsequently, several separators and special indicators must be defined that can precede or be appended in combinations to the abovementioned soil type elements.These were often formatted as subscripts or using special characters.
This proved to be a major source of data-entry errors, encoding mistakes, and ambiguities, which had to be handled via additional error-checking code, e.g., lookup tables which are provided in the supplemental materials (Kmoch et al., 2019b; "soil_lib/LoimisLookups.py").
The mapped soil units also encode variations in the soil profile within a given soil unit.Thus, we must differentiate between a vertical separator for the observed soil layers, and a horizontal separator.However, we only considered the vertical component (soil horizons).In addition, these discrete vertical layers are only based the description in the original texture code.
To capture and fully evaluate the possible texture codes, it was necessary to capture the meaning of any additional (rare) horizontal separators.
Because there are various data-entry errors and other ambiguities in the actual codes in the soil map dataset, we manually analysed all codes that could not be successfully evaluated by the grammar.Manual inspection was particularly required for codes that did not conform to the general rules described in the original soil legend handout.A full list of nonlogical expressions, data-entry errors, and other grammar expressions that could not be easily or usefully standardised is provided as a supplemental Excel spreadsheet (Kmoch et al., 2019a;"texture_error_lookup.xlsx").
The parser for the defined grammar builds a data structure that can be evaluated for physical numerical parameters such as layers, depth, and the sand, silt, clay, and rock contents.This data structure is a Python dictionary object, i.e. a lookup table with nested key-value pairs that hold the parameters and the found values.In the example in Listing 2, it becomes apparent that there is a "/" vertical layer separator (at the bottom, the "code" parameter shows the original texture code for this soil unit), and that depths and fine fractions are accessible separately from the data structure.If a coarse fraction were defined in the texture code, then additional to the fine earth information, an additional "constituent" (the coarse fraction type) would be part of the respective layer (i.e. the "soilparts" object).
There are detailed studies on reference soil profiles in Estonia, Latvia and Lithuania that relate original soil texture, so called Katchinsky texture system (Kachinsky, 1965) to USDA soil system (Calhoun et al., 1998) and erosion modelling case studies where based on laboratory analyses transfer functions from Katchinsky to USDA texture classes were developed (Laas and Kull, 2003).The relationship between the Katchinsky and Atterberg systems were provided by R. Kask (2001).
The USDA soil taxonomy and World Reference Base soil classification systems use 12 textural classes, which are defined based on the sand, silt, and clay fractions (Ditzler et al., 2017).However, the USDA system defines fine particles as having a diameter ≤ 2 mm, whereas the Soviet-era maps use a diameter of ≤1 mm.The Soviet soil classification also mostly ignores the silt fractions, and focuses on the clay fraction (Ø ≤ 0.001 mm).Based on the available analysis data and its structure, we derived meaningful numerical texture values using a lookup table (Table 2) that represents our best efforts to account for the size difference between the USDA and Soviet systems and lack of silt data in the Soviet system.The original observations were classified into the Estonian texture code system based on Katchinsky (1965) soil particle size standards at the time of observation (not by us).We translated the texture codes back into numbers.The foundational numerical values for fine earth and coarse fragments fractions of soil are solely derived from the extracted processed and translatedEstonian texture classes as demonstrated in Table 2.
We defined the variables data for the extracted numerical texture input parameters for each layer as follows: -SOL_CLAY# (layer 1-4): clay content (% soil weight) -SOL_SILT# (layer 1-4): silt content (% soil weight) -SOL_SAND# (layer 1-4): sand content (% soil weight) -SOL_ROCK# (layer 1-4): coarse fragments content (% volumetric) Regarding uncertainties related to that process -as we take these as observed data -achieving 5% accuracy in organoleptic determination of clay content for lower value classes while possible error increased in case of heavy texture classes.There are many finely scaled texture classes in the Estonian system.We assigned USDA texture classes based on the now defined numerical values for the sand, silt and clay fractions.Table 2 shows examples of the rules.For each texture code, the table provides the combination of sand, silt, and clay contents.These texture transfer rules to select USDA particle size distributions from the Estonian texture classes were composed by the authors according to Estonian guideline "Field Soil Survey -Muldade väliuurimine" (http://pk.emu.ee/yldinfo/uudised/uudis/2013/02/20/muldade-valiuurimine)where matches of Estonian/Soviet and USDA/FAO classes for field survey is provided.It is not possible to retrospectively redefine minor differences in boundaries between different classes between texture systems, but we consider natural variation of texture within the soil mapping unit in scale 1:10 000 more significant than that of different texture systems.In additional we introduced two more classes beyond the well-known USDA textures classes, i.e. "PEAT" and "GRAVELS".The former states that this soil unit is a peatland, where the peat layer thickness is at least 30 cm.For hydrological modelling reasons we decided to still assign sand, silt and clay fractions to these units in order to provide a continuous hydrological soil surface.To soil units with the class "PEAT" a high clay content was assigned in order to represent the low vertical conductivity at the bottom theses peat bogs.
However, for applications that critically evaluate clay content for soil units, the additional "PEAT" texture class (in the LX_TYPE1-4 variable) can be used to apply additional rules to mask these soil units accordingly.The latter class "GRAVELS" is intended to demark soil units or discrete layers therein, where only a coarse fragment type but no fine textures have been coded in the original texture codes.In these cases, depending on the type of the coarse fragment the layer can consist of gravels, large rocks or massive rock.We stored the extracted Estonian fine earth type (texture class) per layer in the variable EST_TXT# (layer 1-4).For each Estonian texture class we assigned the related USDA texture class to the parameter LX_TYPE# (layer 1-4).
Similar to the Estonian texture classes there exist Estonian stoniness classes that describe a certain type of coarse fragments within the soil profile.An additional number in connection with this rock type identifier indicates the amount/volume of these rocks in 1kg of soil.We used this indicator number to designate numerical values for the coarse fragments.Table 3 shows how we derived the rock content from the coarse fragments indicator that we obtained from the soil map encoding.
This first and fundamental step concluded with a set of variables for each mapped soil unit that include now separate standardised Estonian and English/USDA texture classes per soil layer, number and depths of layers of the mapped soil unit and numerical values for fine earth and coarse fragments fractions per layer.In addition to the evaluation of layer and depth values we assign the extracted Estonian fine earth type and the related USDA texture class per layer in the variable EST_TXT# (layer 1-4, Estonian texture class) and LX_TYPE# (layer 1-4, USDA texture class).The complete workflow is coded in the supplemental materials (Kmoch et al., 2019b; "01_soilmap_soiltypes_textures_layers.ipynb").

Standardising soil types and assigning WRB soil reference groups
We used the main soil types from the soil type reference sheet that accompanies the Soil Map of Estonia to standardise derive the soil type fields in the spatial dataset and added several widely-used soil types that were not in the original reference list.We developed a short algorithm in Python to find the best match from the soil type reference list (Kmoch et al., 2019b; "01_soilmap_soiltypes_textures_layers.ipynb").The algorithm progressively shortens the name from the right andthat compares the results Estonian with the "Soil type" field and tries to find the name in a in the database (Kmoch et al., 2019a; "soil_types_error_rules_lookup.xlsx") that captures more than 300 entries that provide a soil type substitute code from the extended 135 soil types from the soil reference list.The soil types and the Estonian soil names were then related to the FAO World Reference Base (WRB) soil reference groups (FAO, 2015) after the data have been corrected and standardised for each map unit in the extended soil dataset based on expert input (BioSoils, EUR 24729 EN, https://www.icpforests.org/pdf/manual/BioSoil/Soil_sampling_BioSoil.pdf).An exemplary except is demonstrated in Table 1.The finalised table of the standardised 135 most used main soil types is provided as supplemental spreadsheet (Kmoch et al., 2019a;"soil_types_legend.csv").

Deriving depths and layers
The mapped soil units also encode variations in the soil profile within a given soil unit.In Subsection 2.2.2, we processed the textual code descriptions to compile the exact Estonian texture types in a standardised and readily available data structure.A base assumption is that most soils in Estonia were sampled to a depth of 1 m, as this is the case for a default soil profile.If larger or smaller depth information was encoded in the original soil texture code, then this would be used for the overall depth of that soil sample.For each of the layers, we can also read the analysed depth from the soil surface to the bottom of each layer.We defined parameters SOL_Z# (layer 1-4) for each layer.We defined NLAYERS (the number described of layers in the profile) and depth per layer (SOL_Z#), and we derived the maximum soil depth (SOL_ZMX), which represents the maximum depth of the soil profile (mm).We eliminated additional soil parts from the dataset if their resulting layer thickness would be zero.An additional pragmatic decision was made to exclude cumulative vertical soil parts if their depth could not be reliably inferred.For example, "sand/loamy sand" indicates two layers, separated by "/".The base assumption is that the profiles have been sampled to the depth of one metre when no additional depth information is available.Therefore, for the given example, no depth information is available for the second layer ("loamy sand").In these cases, we decide to drop the second layer and assign the full depth of 1m for the first layer "sand".Another example is, where the first layer depth would be indicated as a range of 70-110 cm.In this case to derive a single number, the average will be taken resulting in 90 cm for the first layer.The remaining 10 cm filling up to 1m can be assigned to second layer.The Soil Map of Estonia holds depths in centimetres, and for widely used conventions we convert the depths from cm to mm in this step.

Evaluation and validation of extracted texture values and soil reference groups
We used two other sources of cross-validation to confirm the accuracy of the derived values.First, we used a manually "decoded" part of the Estonian Soil Map for Tartu county.Tartu County covers about 10% of Estonia and offers a representative subset of the data, as it includes many different soil types, peat bogs, forest, and arable land.It contains 83 364 records.Several members of our research group cleaned and standardised the data on soil types, textures, and depth ranges over the course of several months and collated the results in a spreadsheet.We then compared the software's results with the manual classification results.Each soil unit in question was interpreted by at least two experts, and when their classifications differed, they discussed the difference until they achieved consensus about the correct classification.
Second, we used the SoilGrids250m.For that, we loaded and averaged the raster layers for the provided seven depths from the SoilGrids250m data for the sand, silt, and clay, coarse fragments, bulk density and soil organic carbon contents and saturated hydraulic conductivity from EU-HydroSoilGrids into the layers of the EstSoil-EH datasets..For each parameter, we calculated descriptive statistics and plotted value distribution and an overview of the spatial pattern of the EstSoil-EH parameters against the SoilGrids250m/EU-HydroSoilGrids values to assess the differences (Table 5 and Figure 9).We observed strong similarity in the general patterns.However, the variances ranged from 10 to 30%.One main cause for this high variation is the definition of discrete values for the Estonian texture classes in our data and the more continuous distribution of raster values in the SoilGrids250m dataset.

Adding topographic variables as predictor variables
For the subsequent step of SOC prediction via the Random Forest machine-learning model, we calculated mean, median and standard deviation of several topographic and environmental variables as additional predictor variables.
Topographic variables slope, Topographic Wetness Index (TWI), Terrain Ruggedness Index (TRI), and LS-factor were all calculated by using SAGA-GIS software based on a digital elevation model (Conrad et al., 2015).The LiDAR-based Digital Elevation Model with resolution 1 m was obtained from Estonian Land Board.

Topographic Wetness Index (TWI)
The TWI is a topo-hydrological factor proposed by Beven and Kirkby (Beven and Kirkby, 1979) and is often used to quantify topographic control on hydrological processes (Michielsen et al., 2016;Uuemaa et al., 2018) which also are relevant in the soil evolution.TWI controls the spatial pattern of saturated areas which directly affect hydrological processes at the watershed scale.Manual mapping of soil moisture patterns is often labor-intensive, costly, and not feasible at large scales.
TWI provides an alternative for understanding the spatial pattern of wetness of the soil (Mokarram et al., 2015).It is a function of both the slope and the upstream contributing area: where a is the specific upslope area draining through a certain point per unit contour length (m 2 m -1 ), and b is the slope gradient (in degrees).

Terrain Ruggedness Index (TRI)
TRI reflects the soil erosion processes and surface storage capacity which again is relevant from a soil evolution perspective.The TRI expresses the amount of elevation difference between neighbouring cells, where the differences between the focal cell and eight neighbouring cells are calculated: where xij is the elevation of each neighbour cell to cell (0,0).Flat areas have a value of zero, while mountain areas with steep ridges have positive values.

LS-factor
The potential erosion in catchments can be evaluated using LS factor as used by the Universal Soil Loss Equation (USLE).The LS factor is length-slope factor that accounts for the effects of topography on erosion and is based on slope and specific catchment area (as substitute for slope length).In SAGA-GIS the calculation is based on (Moore et al., 1991): where n=0.4 and m=1.3.

Drainage area per mapped soil unit
In addition, we calculated the area per mapped soil unit in m 2 (area_drain) and in percent of area, which is under drainage (drain_pct).The drainage regimen considered both underground tile drainage and ditch based drainage systems.The drainage information used for this was compiled based on the Estonian Topographic Data Set (ETAK) and the official register of drainage systems (https://portaal.agri.ee/avalik/#/maaparandus/msr/systeemi-otsing)managed by the Agricultural Board of Ministry of Rural Affairs of Estonia.All the variables were calculated using the GIS software packages QGIS and SAGA.

Predicting Soil Organic Carbon (SOC) and Bulk Density (BD)
The main information described in the Soil Map of Estonia is the soil type and the soil texture.However, soil hydraulic properties and SOC data are needed for many different applications in soil hydrology.Pedotransfer functions (PTFs) have proven to be useful to indirectly estimate these parameters from more easily obtainable soil data (Van Looy et al., 2017).
Therefore, several soil parameters like soil organic carbon, bulk density and saturated hydraulic conductivity must be derived via PTFs and other data assimilation methods.To apply PTFs and other data-assimilation methods, third-party datasets can be used as secondary sources.In the previous steps we have prepared a wide set of input variables, including the numerical fractions for the textural properties, standardised classes for soil type and soil textures, and additional topographic variables, which we can apply as predictor variables to model the value distribution for SOC and BD.We develop these two extended soil physical input parameters as organic carbon content in % soil weight (SOL_CBN# layer 1-4), and dry bulk density in Mg/m³ or g/cm³ (SOL_BD# layer 1-4).
In order to map the spatial distribution of SOC in Estonia a machine learning model random forest (RF) was used to predict SOC based on parameters derived from the soil map.RF was preferred to more advanced ML algorithms (e.g., neural networks) because it has shown to be relatively resilient towards data noise and not require preliminary hyperparameter tuning (Breiman, 2001;Caruana and Niculescu-Mizil, 2006).In addition, feature importances can be extracted from the model to determine the most influential predictor variables.
For training, we used measurements of soil organic matter (SOM) or soil organic carbon (SOC) from forest areas (samples sizes: n=100), 4 datasets of samples from Estonian open and overgrown alvars and grasslands (n: 94, 137, 146, 69), peatlands (n=175) and from arable soils transects (n=8964) resulting in 3373 distinct point locations (Kriiska et al., 2019;Noreika et al., 2019;Suuster et al., 2011).Where necessary, the SOM values were translated into SOC via: SOC = SOM / 1.724.Many samples from peatlands and arable fields were often sampled within the same mapped soil unit.For these soil units (polygons) the respective soil measurement data was averaged and joined to the respective soil units to reduce the bias of the prediction.After joining the sample size reduced to the 397 distinct training samples for machine learning (Figure 2).
This data was then randomly split into training (60%) and test (40%) sets and the model was evaluated by predicting SOC based on the predictor variables of the test set.Finally, the model was applied to soil map polygons without available SOC measurements to predict SOC content in Estonian soils.
Subsequently, we calculated soil bulk density based on texture values and predicted soil organic carbon for each layer in each mapped soil unit polygon, with following PTF (Adams, 1973;Kauer et al., 2019), which has been successfully applied in Estonia: where: SOM = SOC × 1.724 The conversion factor 1.72 is a widely used universal value.However, we acknowledge that the real value varies slightly between soils.

Assimilation of additional hydrological variables
In order for this dataset to be more useful in eco-hydrological modelling we developed and added two additional hydrological variables.Saturated hydraulic conductivity (Ksat) relates soil texture to a hydraulic gradient and is quantitative measure of water movement through a saturated soil.In addition to the ability of transmitting water along a hydraulic gradient we also add available water capacity (AWC) as a variable.AWC describes the soil's ability to hold water and quantifies how much of that water is available for plants to grow.We develop two variables saturated hydraulic conductivity (mm/hr, SOL_K# layer 1-4), and available water capacity of the soil layer (mm H₂O/mm soil, SOL_AWC# layer 1-4).We calculated Ksat using the improved Rosetta3 software, which implements a pedotransfer model with improved estimates of hydraulic parameter distributions (Zhang and Schaap, 2017).It is based on an artificial neural network (ANN) for the estimation of water retention parameters, saturated hydraulic conductivity, and their uncertainties.For each standardised texture class from Table 2, we used the numerical fine earth fractions for sand, silt and clay as inputs for the Rosetta3 software and calculated Ksat for each layer in each mapped soil unit polygon.We provide a copy of the Rosetta3 program in the supplemental materials.(Kmoch et al., 2019b; "Rosetta-3.0").
In order to calculate available water capacity, we summarized the field capacity (FC, at −330 cm matric potential −0.03 MPa) and wilting point (WP, at −15,848 cm matric potential −1.5 MPa) variables of the 7 soil depths of the EU-SoilHydroGrids 250m resolution raster datasets (Tóth et al., 2017) for each mapped soil unit for the provided depths of 0, 5, 15, 30, 60, 100, and 200 cm.The available water capacity is then calculated for each of the 7 depths by a raster calculation: AWC = FC -WP (Dipak and Abhijit, 2005).The resulting 7 AWC raster layers are then averaged into the respective depth ranges for each of the discrete layers of the Estonian mapped soil units.The Python code of the process for the extraction of FC and WP from the EU-SoilHydroGrids is provided with the supplemental materials (Kmoch et al., 2019b;"05_hydrogrids_extents_and_awc_extract.ipynb").

Results
In this study, we developed a Python module that is capable of analysing, extracting, and standardising the soil type and soil texture data from the official Soil Map of Estonia into a reusable, reproducible soil dataset that uses World Reference Base and FAO soil classes and texture descriptions.Figure 4 shows a map of the classified topsoil texture classes derived from the original Estonian texture codes.In addition, it shows the peat soils that cover up to 20% of Estonia, and are an important soil type in such northern countries.
To make such soil information usable in an eco-hydrological modelling context, we derived numerical values for each of the soil units.These values include the number of discretized soil layers (NLAYERS) -a maximum of 4 separate vertical distinct soil layers where described in the original texture codes -the depth of each layer (SOL_Z1-4), and the maximum depth of the sampled profile for each mapped soil unit (SOL_ZMX).Based on the layer information and the extract texture classes we could define the percent fractions per volume of sand (SOL_SAND1-4), silt (SOL_SILT1-4), clay (SOL_CLAY1-4), and coarse fragments (SOL_ROCK1-4) per layer.Figure 5 shows the percent fractions for sand, silt, clay and coarse fragments for the top layer.Table 6 contains the full list of variable and parameters per mapped soil unit contained in the EstSoil-EH dataset.

Validation of soil type and texture classes extraction and standardisation
For the main soil types, we achieved 97.7% agreement between the software's result and the manual classification.
The manual verification of the validation revealed several re-labelling issues from the error lookup table.A visual assessment by two soil sciences senior research staff asserted that the level of similarity of the soil types that were selected by the automated process were closely related.However, the mismatches (1943 records, equivalent to 2.3% of the total records) indicated that the soil experts tended to interpret "errors" based on personal knowledge that may not be reproducible in a strictly automated fashion.For example, some landforms (e.g.eroded material filling low slopes or collapsed cliffs) were originally classified as exceptions to the general classification rule based on the local knowledge of the landscape.When standardising these expert interpretations with the same more general soil type, we reduced the number of mismatched soil type identifiers to 0.
Furthermore, it should be emphasised that humans tend to make mistakes when performing repetitive procedures.Therefore, we consider the high accuracy (97.7%) to be a very good result.
For the validation of textures, we used several steps.First, given the high agreement between the software-generated codes and the human-generated codes, we accepted the software's texture codes for use in our subsequent evaluations.Next, we compared the extracted main texture for each layer with the manually coded value:  77 870 of 83 364 records (93.4%) showed identical parsing of the full texture code  71 635 of the records (85.9%) showed identical interpretation of the first layer's texture type (10 312 records were differently coded, and 1417 produced "no value" errors, in which either the source or validation dataset contained no value, preventing a comparison with the other dataset's value)  65 000 of the records (78.0%) showed identical interpretation of the second layer's texture (with 2325 differently coded textures, and 16 038 "no value" errors, of which 15 461 occurred in the automated processed new dataset, and only 577 occurred in the validation dataset)  82 507 of the records (99.0%) showed identical interpretation of the third layer's texture (with most errors caused by a non-existent third layer, 334 differently coded, and 523 with a "no value" error) For sand, silt and clay fractions we could obtain laboratory analysis only for forest soil samples.We calculated the root mean squared error (RMSE) and chose the Normalized Median Absolute Deviation (nMAD) as an additional measure of dispersion of error for non-gaussian distributed data:  RMSE for sand: 13.1 %  nMAD for sand: 9.68  RMSE for silt: 10.7 %  nMAD for silt: 7.0  RMSE for clay: 6.5 %  nMAD for clay: 3.9 Our manual assessment of the mismatches indicated the same problem that occurred with the soil types.The expert assessments aimed to keep as much information as possible available in their decoded classification, and this did not always agree with the automated processing rules.Furthermore, the complexity of the Estonian texture rules and the reliance on human judgement creates high uncertainty in some cases, even for human interpretation.In addition, to derive the grammar rules, we added a few simplifying elements, such as omitting some rarely used additional information in the soil texture descriptions.For example, the Estonian rules allow specification of several soil parts, but as a horizontal distribution within the same mapped soil unit rather than as vertical layers.This is understandably complex, making it difficult to classify this variable soil as a single soil unit.Consequently, it is inevitable that some of these descriptions will not agree with the software's classification.

SOC prediction and validation of Random Forest model
We also calculated several extended soil properties, i. Figure 6 shows the predicted values of SOC and BD for the top layer.On visual inspection the spatial distribution for the SOC content matches comparatively well with known agricultural areas, where low carbon content prevails, as well as with the peat land areas, which have a very high carbon content.
For further description and guidance on errors in the predictions for SOC and BD we calculated the RMSE and nMAD as an additional measure of dispersion of error for non-gaussian distributed data.BD observed data was only available for arable lands and forest soil samples, and should be treated accordingly.
 RMSE for SOC predictions: 2.95 %  nMAD for SOC: 1.44  RMSE for the subsequent BD predictions with PTF: 0.33 g/cm³  Normalized Median Absolute Deviation (nMAD) for BD: 0.15 However, due to the small number and distribution of input samples over four distinct landscapes, namely arable lands, wetlands, forests and open/grass lands, we broke down the error distribution for these for land forms in Table 6.The prediction error characteristics differ, with the smallest errors for arable lands, then wetlands and the largest errors for open grasslands and forest.

Extended Hydrological variables results
Based on the variables derived in previous steps, we could also calculate soil hydraulic parameters, such saturated hydraulic conductivity (Ksat) based on the sand, silt and clay content.and aAvailable water capacity (AWC) was calculated solely by aggregating EU-SoilHydroGrids data of field capacity and wilting point.Figure 7 shows the spatial distribution of Ksat andAWC for the first layer.Rosetta reports the standard deviation for its internal prediction process, which draws many samples for the same input of sand, silt and clay content and then provides the mean as the predicted value for K.The summary of Tthe predicted Ksat values and the standard deviation is summarized in Table 7. fFor peat areas and wetlands the predicted values also corresponds with ranges reported in the literature for the sand-silt-clay ratios provided (Gafni et al., 2011).The USLE K erodibility factor for each soil unit was also calculated (Figure 8).We compiled all parameters into a dataset that can now be easily used with SWAT or other eco-hydrological and land-use-change models.As we are not changing the general geometry or underlying spatial data model of the original soil map, all parameters are only added to the existing mapped soil units and thus, all original soil polygons remain discernible.The dataset (Kmoch et al., 2019a; doi:10.5281/ZENODO.3473290)and the software codes (Kmoch et al., 2019b; https://zenodo.org/record/3473210)have been deposited online.

Discussion and Future Work
The Soil Map of Estonian is a valuable resource for hydrological, ecological and agricultural studies.It is widely used in Estonia.But before our analysis, a large amount of the dataset's information was not readily usable beyond the field or farmscale because of the need to manually interpret the specialised soil types and the complexity of the rules that describe the texture or other characteristics of the soil units.The presenteddeveloped dataset is of very high spatial detail based on the original Estonian national soil map, which was created from directly surveying all of Estonia.Thus, our presented dataset holds to potential to further improve our understanding of eco-hydrological processes in the landscape through the use of advanced numerical and process-based models.The derived information is much more spatially related to the landform/landuse observed there than any other dataset covering soil information for Estonia.Furthermore, the textures and SOC/BD values are directly derived from reliable observed data samples from Estonia, with a reproducible workflow which is unique in the case of Estonian soil datasets, whereas this is not true for many other reported soil datasets that cover the area of Estonia.
Furthermore, the open access availability and transparency of measurement data can provide a reliable building block for advancing studying soil and hydrological processes in Estonia into temporal aspects.tEspecially, properties such as SOC and BD will vary extensively depending on the land use and land cover.In combination which developments that capture also the dynamics of land use change and adaptations under climate change, the evolution of the soils in Estonia could more readily be investigated.
The method created to translate original hard copy soil map (with traditional textual codes) to a digitally readable numerical GIS-based map can be used by several other countries (e.g.Latvia, Lithuania, Ukraine etc) which enables spatially more explicit modelling of ecosystems.We used a multi-step approach to derive a generalised and standardised numerical dataset that will have many potential applications for users of Estonian soil information, including support of economic, agricultural, or environmental management and input for decision-making and to support more reproducible scientific research based on Estonian soil information.Until the present analysis, numerical and process-based eco-hydrological modelling with tools and models like SWAT or the Regional Hydro-Ecologic Simulation System (RHESSys) were greatly hampered by the need to manually derive useful values from the Soil Map of Estonia to support the modelling.
One challenge in terms of validation is that the datasets we used for validation, e.g.SoilGrids and EU-SoilHydroGrids are informed to some extent by previous reportssamples about of Estonian soil characteristics that are not necessarily more accurate than the results of our classification.Although we accounted for this problem by providing additional comparisons, the scale mismatch between continuous raster datasets and polygon-based data inevitably introduced errors and trade-offs into the comparison.One solution to these problems would be to perform supplemental field sampling to ground-truth the source data and confirm the accuracy of our model's classification based on the field data.
A direct interpretation on the derived discrete layer information as soil horizons needs should not be generalized but checked on per case basis.From the point of end-user, the first layer is not a default 30 cm deep top soil layer.A direct interpretation of the derived discrete layer information as soil horizons should not be generalized but checked on per case basis.
All physical, chemical and hydraulic properties are based on the analysis of the original texture code per mapped soil units and the resulting discrete layers per unit.This is an important usage constraint, for example in sense of biological activity, as 30 cm soil layer is most active, but for each soil unit it needs to be checked which layers extend into which actual depths.Also the SOC content and BD are not modelled in a vertical continuum but per discrete values per unit and layer.However, fertile soils, like Luvisols contain a lot of SOC also in deeper layers.But such additional expert knowledge is not encoded in original Soilmap of Estonia, nor in the processing algorithms that derived the extended parameters for this newly generated dataset.
However, such additional knowledge, as well as more appropriate models for peatland areas, could be included as additional rules in a subsequent improvement of this dataset.Kõlli et al. (2009) published estimates of the SOC stocks for forests, arable lands, and grasslands and for all of Estonia.
Nevertheless, they constrained their finding by noting that their estimates were calculated based on the mean SOC stock for each soil type and the corresponding area in which the soil type was distributed.Putku (2016) used the large-scale Soil Map of Estonia at the polygon level for SOC stock modelling for mineral soils in arable land of Tartu county.Carbon content calculations in Estonia have historically been predominantly made for soils in agricultural areas.Exisiting literature and our results in summary are in line with SOC distribution per soil type in mineral soils in arable lands (E.Suuster et al. 2012).
The original purpose of this dataset was to derive values for hydrological modelling purposes and at the same time to stay as close to the original data as possible.From that perspective peat soil units are currently modelled with assumptions to have a similar behaviour to clay hydrologically.Therefore, the spatial distribution of clay percentage in particular, but also the concurrent physical fractions of sand and silt do not make scientific sense for these areas where peat is prevalent.In order to make the dataset as useful as possible and to identify peatland areas, we introduced the additional class "PEAT" into the USDA classification.While sand, silt, clay and rock content are directly derived values from the original texture codes, SOC and Ksat are modelled via statistical machine-learning algorithms, which include additional uncertainty.This should be considered when evaluating BD and USLE K, which are calculated using SOC as an input variable.
The only variable which we did not model based in dependence of already modelled parameters is AWC.Here we summarised the EU-SoilHydroGrids 250m (Tóth et al., 2017) raster datasets for FC and WP as inputs an external data integration.This is not ideal and can be considered a trade-off between introducing too much uncertainty and an external unrelated data source.
In the future, we foresee step-wise improvement of our software by developing better PTFs to estimate parameters and to better integrate the presence of peat soils and other specific landscapes and environments in Estonia.Furthermore, statistical machine-learning or neural network and deep learning methods could be tested in order to improve soil classifications and express more complex relationships between soil types and textures.Currently, one specificity of the newly created EstSoil-EH dataset is its discrete nature, as we are only adding derived numerical variables to the existing mapped soil units (polygons).We do not predict a continuous surface in this study, thus, comparisons with continuous surface parameters predicitons such as in SoilGrids (Hengl et al., 2017) or EU-SoilHydroGrids (Tóth et al., 2017), are not directly possible.
However, the workflow could potentially be extended also for creating continuous surface.With appropriate modification (e.g., to use the soil characteristic codes more consistently for a different country), our methodology could also be applied in other countries such as Lithuania or Latvia that share similar historical land-and soil surveying practices.
P2 L14: please reduce repetition of soil and terrain Removed.P2 L15: Please add meaning of ESD.Added.P2 L20-21: … soil information derived with machine learning methods … Edited.
P5 L1: … based on organoleptic field judgement (feel methods) and … Edited.P5 L6: … combinations have been described considering the texture of soil layers … is it correct?That was not clear to us.The sentence does not exist anymore.P6 L2: the following can be deleted: "instead of 9, 21 (9+12) or 108 (9x12)".It is not clear why 87240 unique values are recorded in the previous dataset.Does it come from the combination of soil type + texture + layering + level of erosion + slope position -similarly to the explanation you provided under previous answers?These technical descriptions have been moved into a technical supplement.But your assumption is correct.P6 L6: … to derive … Edited.P6 L19: please provide reference literature for translating Estonian soil types to WRB reference soil groups and qualifiers.Added as mentioned above already.P8 L27: … to USDA texture classes … is it correct?Yes.P9 L13: How did you calculate the accuracy of organoleptic determination of clay content?Through the organoleptic determination didn't you determine the soil texture class?Or do you determine directly the clay content?Please clarify it.


e. SOC content and BD.The RF regression model was implemented with the RandomForestRegressor function from the Scikit-learn Python library.The model was evaluated by predicting SOC based on the predictor variables of the test set for the 60:40 split.Figure 3 illustrates the cross-validation scatterplots of observed vs. predicted SOC values for the test/validation sample splits.Following characteristics are reported for the chosen RF model:  coefficient of determination (R 2 ) score: 0.69  score of the training dataset with out-of-bag estimate (oob score): 0.58 Pearsons r correlation coefficient, training: 0.90, validation: 0.83 RF feature importances, top 6:  Clay content (SOL_CLAY1): 0.65  Terrain Roughness Index, standard deviation (tri_stdev): 0.04 Sand content (SOL_SAND1): 0.03  LS-factor, median (ls_median): 0.028 Area under drainage in percent (drain_prct): 0.027 Coarse fragments rock content (SOL_ROCK1): 0.024

Figure 1 :
Figure 1: Flowchart for executed processing steps grouped into the four major work packages

Figure 4 :
Figure 4: USDA topsoil textures derived from the original Estonian texture codes by the software developed in the present study, including additional classes "PEAT" and "GRAVELS".

Figure 5 :
Figure 5: Physical soil properties: assigned soil texture fractions of a) sand, b) silt, c) clay, and d) coarse fragments in the first soil layer based on texture classes.

Figure 6 :
Figure 6: Extended physical soil parameters: a) predicted soil organic carbon (SOC) and b) bulk density (BD) of the first layer.

Figure 7 :
Figure 7: Soil hydraulic parameters: a) saturated hydraulic conductivity (K_sat) and b) available water capacity (AWC) in the first layer.

Figure 8 :
Figure 8: Calculated USLE K erodibility factor based on the numerical parameters derived from the Soil Map of Estonia dataset.

Figure 9 :
Figure 9: Comparative overview plots between EstSoil-EH and SoilGrids-based variables, value frequency and spatial distribution (darker is higher).Full resolution plots (histograms and spatial) for all variables are included in the data supplement.
Van Looy et al. (2017)of soil properties from the SOC content, and found that SOC was closely related to soil bulk density (BD) and porosity.Suuster et al. (2011)emphasized the importance of BD as an indicator of soil quality, site productivity, and soil compaction and proposed a PTF for the organic horizon in arable soils.Van Looy et al. (2017)reviewed existing PTFs and documented the new generation of PTFs that have been developed by different disciplines of Earth system science.They emphasized that PTF development must go hand in hand with suitable extrapolation and upscaling techniques to ensure that the PTFs correctly represent the spatial heterogeneity of the soils.Abdelbaki (2018) evaluated the predictive accuracy of 48 published PTFs for predicting BD using State Soil Geographic (STATSGO) and Soil Survey Geographic (SSURGO) soil databases from the United States.They also proposed and validated a new PTF for predicting BD using SOC inputs.
Kõlli et al. (2009), 2016)s for SOC have been difficult to obtain due to a lack of global data on the SOC content of each soil type(Eswaran et al., 1993).Very few SOC datasets are available for countries or regions.For example, the Northern Circumpolar Soil Carbon Database(Tarnocai et al., 2009; https://bolin.su.se/data/ncscd/) was developed to describe the SOC pools in soils of the northern circumpolar permafrost region.SOC stocks were also predicted under future climate and land cover change scenarios using a geostatistical model for predicting current and future SOC in Europe(Yigini and Panagos, 2016).Kõlli et al. (2009)published estimates of the SOC stocks for forests, arable lands, and grasslands and for all of Estonia.