Mapping Surﬁcial Soil Particle Size Fractions in Alpine Permafrost Regions of the Qinghai–Tibet Plateau

: Spatial information of particle size fractions (PSFs) is primary for understanding the thermal state of permafrost in the Qinghai-Tibet Plateau (QTP) in response to climate change. However, the limitation of ﬁeld observations and the tremendous spatial heterogeneity hamper the digital mapping of PSF. This study integrated log-ratio transformation approaches, variable searching methods, and machine learning techniques to map the surﬁcial soil PSF distribution of two typical permafrost regions. Results showed that the Boruta technique identiﬁed different covariates but retained those covariates of vegetation and land surface temperature in both regions. Variable selection techniques effectively decreased the data redundancy and improved model performance. In addition, the spatial distribution of soil PSFs generated by four log-ratio models presented similar patterns. Isometric log-ratio random forest (ILR-RF) outperformed the other models in both regions (i.e., R 2 ranged between 0.36 to 0.56, RMSE ranged between 0.02 and 0.10). Compared with three legacy datasets, our prediction better captured the spatial pattern of PSFs with higher accuracy. Although this study largely improved the accuracy of spatial distribution of soil PSFs, further endeavors should also be made to improve model accuracy and interpretability for a better understanding of the interaction and processes between environmental predictors and soil PSFs at permafrost regions.


Introduction
Soil particle size fractions (PSFs), including sand, silt, and clay contents, are essential physical characteristics affecting many physical and chemical properties. They have been commonly used as primary inputs to determine hydrothermal properties in the land surface models to simulate hydrological, ecological, and environmental processes [1][2][3]. In recent decades, permafrost on the Qinghai-Tibet Plateau (QTP) has been undergoing degradation due to recent climate change [4,5]. The permafrost degradation has exerted impacts on hydrology and energy balance, carbon cycle, and engineering infrastructure [5][6][7]. Therefore, many concerns have been raised for understanding, assessing, and predicting

The Environmental Covariates
In this study, we used a total of 160 covariates to represent environmental conditions ( Table 1). The climatic covariates include precipitation, air temperature, and land surface temperature (LST) over 2003-2011. The precipitation and air temperature were obtained from the national Tibetan Plateau data center [42], and LST was obtained from the MOD11A1 and MYD11A1 datasets of moderate resolution imaging spectrometer (MODIS) observations. The daytime and nighttime LST were filtered according to the quality indicators of MOD11A1 and MYD11A1 [43]. Hence, the LST difference was generated as well. The maximum, minimum, and mean of seasonal and annual daytime and nighttime LST, and its difference and mean over nine years, were calculated, as was  [36]. The permafrost zones transit from the large extent of continuous warm and ice-rich permafrost to the alpine and mountainous isolated permafrost in the east [36,37]. It covers an area of about 31,270 km 2 . Its altitude varies from 3770 m to 5190 m. The annual mean air temperature is −4.4 • C, and the mean annual precipitation is 435 mm, with less precipitation from the east to the west. The climatic conditions change from the semi-arid cold in the west to the semi-humid cold regions in the east. Consequently, the primary vegetation types vary from the alpine meadow, alpine steppe to alpine desert. BQ embodies all the soil types of the permafrost region in the eastern Qinghai-Tibet Plateau, and mainly developed on the alluvial and fluvial deposit and slope deposit [36,38]. The sampling sites of the two study areas were selected with full consideration of parent material, terrain, vegetation, human activities, and accessibility. Referring to the Standard of Soil Profile Description (created by the Institute of Soil Science, Chinese Academy of Sciences) and the Field Book for Describing and Sampling Soils [39], all the soil profiles have been well documented [36]. A total of 73 sites in the WQ region were visited for sampling from September to October in 2009 and 55 sites in the BQ region from July to August in 2011. Samples were air-dried and then passed through a 2 mm sieve. After averaging the samples that share the same covariate value, 59 samples and 49 samples were left for the WQ and BQ region, respectively. Since environmental covariates have more intense impact on the topsoil than the deep soil [40], this study focused on the soil PSFs of surficial horizon (0-30 cm). The sand (particles 2000-50 µm), silt (particles 50-2 µm), and clay (particles < 2 µm) contents were determined using the pipette method [41] for samples from the WQ region and samples from BQ were analyzed by laser diffraction (LS13320, Beckman Coulter Inc., Brea, CA, USA).

The Environmental Covariates
In this study, we used a total of 160 covariates to represent environmental conditions ( Table 1). The climatic covariates include precipitation, air temperature, and land surface temperature (LST) over 2003-2011. The precipitation and air temperature were obtained from the national Tibetan Plateau data center [42], and LST was obtained from the MOD11A1 and MYD11A1 datasets of moderate resolution imaging spectrometer (MODIS) observations. The daytime and nighttime LST were filtered according to the quality indicators of MOD11A1 and MYD11A1 [43]. Hence, the LST difference was generated as well. The maximum, minimum, and mean of seasonal and annual daytime and nighttime LST, and its difference and mean over nine years, were calculated, as was the maximum, minimum, and mean of seasonal and annual air temperature and precipitation. The digital elevation model (DEM) data were derived from the WorldClim database [44], and the DEM derivatives were computed using SAGA GIS v2. Vegetation conditions were represented by net primary productivity (NDVI), enhanced vegetation index (EVI), and gross primary productivity (GPP), which were derived from the MOD13A3, MYD13A3, MOD17A2H, and MYD17A2H datasets over nine years [43]. The annual NDVI, EVI, and GPP were processed by the maximum value composite approach. The vegetation and climate covariates in the sampling year and the multiyear average were extracted to the sampling sites. All the covariate layers were projected to the Albers conic equal area projection and resampled to 1 km by the bilinear method. Besides, a land cover classification dataset, namely the dataset of land cover in Northwest China from 1990 to 2010 [45], was used to mask out water bodies, residential areas, pavement, bare rock, snow cover, and glaciers.

Existing Datasets of Soil PSFs
Three existing PSF datasets were applied in comparison with our prediction. A China soil characteristic dataset (CSCD) [46] was developed by the polygon linkage method based on 8595 soil profiles and the soil map of China. The content of clay and sand was captured by two layers with a spatial resolution of 30 arc seconds (~1 km). In this study, the clay and sand content in the top horizon (0-30 cm) was used. A China dataset of soil properties for land surface modeling (CDSL) [10] was developed for the application in land surface modeling by the polygon linkage method as well. The vertical variation of each soil property was captured by eight layers, and the soil fractions of the top four vertical layers (i.e., 0-0.045, 0.045-0.091, 0.091-0.166, 0.166-0.289 m) were used in this study. The SoilGrids250m dataset (SG250) was developed by ensemble machine learning approaches with 150,000 sites spread over all continents [8]. It generated soil attributes at seven standard depths with a spatial resolution of 250 m, and the uppermost layers (0, 5, 15, and 30 cm depths) were generalized using Equation (1) in this study.

Conversion of PSFs in the Surface Layer
The original soil PSFs were recorded along genetic horizons in the United States Department of Agriculture (USDA) soil taxonomy. The PSFs from 0 to 30 cm depths were generalized using the following: where f (x) is a soil PSF (i.e., clay, silt, or sand) at the depth of 0-30 cm, x i is the soil PSF of the ith horizon, and d i is the depth of the ith horizon. The PSFs conversion of the CDSL and SoilGrids250m datasets followed the same rule.

Transformation of Compositional Data
Former studies have evaluated the performance of four commonly used log-ratio transformations. The additive log-ratio (ALR) transformation [12] was the most classic and widely used, but arbitrarily choosing the divisor in ALR transformation may result in problems in compositional data analysis. To solve the problems of ALR transformation, isometric log-ratio (ILR) transformation [15] was proposed, and ILR has been evaluated with satisfactory performance in soil PSF mapping [1,15]. Therefore, in this study, ALR and ILR transformations were employed. The ALR transformation is defined as: The ALR back-transformed equation is defined as: The ILR transformation [15] is defined as: and its back-transformed equation [47] is defined as: where ILR(x 0 ) = ILR(x D ) = 0 and i = 1, 2, . . . , D, which is then back-transformed to the original soil PSFs via the following equation: where x i represents the content of the soil PSFs. ALR(x i ) and ILR(x i ) represent the transformed values at the sampling site i. ALR(x i ) and ILR(x i ) represent the back-transformed PSFs, and D represents the dimension. The ALR and ILR functions were implemented with compositions package [48] in R 3.6.3 (R Development Core Team, 2020). The predictive results were back-transformed to three soil PSFs (i.e., clay, silt, and sand) using alrInv and ilrInv functions.

Covariate Selection Techniques
The Boruta method was applied to sort out all-relevant covariates for soil PSFs prediction. It is a wrapper method, developed on the basis of random forest (RF) algorithm. Therefore, similarly to RF, it can handle nonlinear as well as linear relationships [21,29]. To identify the irrelevant variables, the shadow attributes are created, and their importance is used to identify irrelevant variables [29,49]. Minimal-optimal features were selected to develop parsimonious models for PSF prediction. Instead of exhaustively searching all possible subsets of all relevant sets, which are computationally intensive and impractical, strategic subset selection techniques suggested by Xiong et al. [21] were employed. In this study, greedy forward (GF), greedy backward (GB), hill climbing (HC), and simulated annealing (SA) were implemented to compose the minimal-optimal sets [50]. The greedy algorithms, including GF, GB, and HC, search for local optima. The GF algorithm starts evaluating subsets with only one predictor variable. Then, by adding predictor variables one by one, GF finds its best subset. On the contrary, the GB algorithm begins with the whole set and drops predicting variables in turn until it finds its best subset. HC algorithm starts with a subset composed of arbitrarily selected variables. It finds a better subset by incrementally including or eliminating one variable only when a subset has higher predictive power. SA is a heuristic algorithm searching for global optima. It resembles the annealing procedure in metallurgy and starts from a certain higher "temperature" (i.e., initial temperature) to find the optima [51,52]. As the temperature decreases at a certain rate, the optima in the algorithm gradually stabilizes, but this may be a local optimum rather than a global one. To find the global optima, SA introduces a random update to a subset, making it possible to accept a worse prediction to escape local optima and reach the global optima [50]. These four covariate selection algorithms were implemented with Boruta [29], rpart [53], caret [54], and Fselector [55] packages in the R statistical Programming Language.

Development and Assessment of Predictive Models
On the basis of exhaustive and selected covariate sets, two log-ratio transformations (i.e., ALR and ILR) were combined with random forest (RF) and Cubist, consisting of four models (i.e., ALR-RF, ILR-RF, ALR-Cubist, and ILR-Cubist) for soil PSFs prediction ( Figure 2). RF is currently the most widely used machine learning approach in Digital Soil Mapping (DSM) [19,24]. It is an ensemble learning algorithm that improves its accuracy and robustness by combining the predictions of a random population of regression trees [17,21,56]. At each split, it randomly selects predictor variables to grow trees, while on the tree level, it randomly selects samples from the training set [56]. Cubist has recently increased in popularity in DSM [18]. Cubist is a rule-based regression technique that builds multivariate linear regression models at the terminus of a tree [57]. At the terminal node of a given tree, the final model shows a collection of machine learning regression models for calculating predicted values [24]. RF and Cubist were implemented with randomForest [58] and Cubist [59] packages in the R environment. The prediction of transformed and back-transformed soil PSFs was assessed by the 10-fold cross-validation and evaluated using the coefficient of determination (R 2 ), Root Mean Square Error (RMSE), and bias. Bias was calculated as the observed mean minus the predicted mean. The others were calculated as follows: where x i represents the observed PSF, x i is the simulated PSF, x i is the mean of x i , and n represents the sample size. R 2 and RMSE are used to measure the data dispersion and reflect the difference between the predictions and the observations. Bias can be used to detect the over fitting of the predictive models.
(RMSE), and bias. Bias was calculated as the observed mean m The others were calculated as follows: where xi represents the observed PSF, xi' is the simulated PSF, represents the sample size. R 2 and RMSE are used to measure reflect the difference between the predictions and the observat detect the over fitting of the predictive models.

Descriptive Statistics of Observations
Surface soil textures across the BQ region are mainly sandy l with a few samples of loamy sand ( Figure 3). While across th surface soil textures are loamy sand and sandy loam, and a few sa clay loam are less dominant (Figure 3). Table 2 provides desc regions; the average clay content (8.8%) at the BQ region is much sand (49.2%), and sand is the most variable fraction (SD = 14.9% region, sand content (76.5%) is much higher than that of clay (14 the most significant variation (SD = 8.7%). The sand content of W the soil texture is coarser than that of BQ.

Descriptive Statistics of Observations
Surface soil textures across the BQ region are mainly sandy loam, loam, and silt loam, with a few samples of loamy sand (Figure 3). While across the WQ region dominant surface soil textures are loamy sand and sandy loam, and a few samples of sand and sandy clay loam are less dominant (Figure 3). Table 2 provides descriptive statistics for both regions; the average clay content (8.8%) at the BQ region is much less than silt (42.0%) and sand (49.2%), and sand is the most variable fraction (SD = 14.9%). Meanwhile, in the WQ region, sand content (76.5%) is much higher than that of clay (14.0%) and silt (9.5%), with the most significant variation (SD = 8.7%). The sand content of WQ was much higher, and the soil texture is coarser than that of BQ.

All-Relevant Variable Set
In the BQ region, the Boruta technique retained a seri surface temperature), and precipitation covariates, composin clayalr ( Figure S1). The s2_dn_e_1 (the diurnal LST in summe the most relevant covariate to explain the clayalr variation composed of a series of vegetation, terrain, LST, and precipita NDVI_9 was the most relevant covariate. In terms of the ILR AR set for clayilr prediction composed of vegetation, terra  In the BQ region, the Boruta technique retained a series of vegetation, LST (land surface temperature), and precipitation covariates, composing the all-relevant (AR) set of clay alr ( Figure S1). The s2_dn_e_1 (the diurnal LST in summer of the sampling year) was the most relevant covariate to explain the clay alr variation. The AR set of silt alr was composed of a series of vegetation, terrain, LST, and precipitation covariates, in which the NDVI_9 was the most relevant covariate. In terms of the ILR-transformed fractions, the AR set for clay ilr prediction composed of vegetation, terrain, LST, and precipitation covariates ( Figure S1). Slope (Slp) was the most relevant covariate. Whereas for silt ilr , the s2_dn_e_1 was the most relevant covariate. None of the topographic covariates was relevant to ILR-transformed fractions. Moreover, the air temperature was not relevant to neither of the transformed PSFs in this region.
In the WQ region, vegetation and LST covariates were retained in AR sets to infer the variance of ALR-and ILR-transformed PSFs ( Figure S2). The GPP_1 and s4_n_i_9 showed the most robust relevance to clay alr and silt alr , respectively. The s4_df_e_1 and GPP_1 were the most relevant covariates of clay ilr and silt ilr , respectively ( Figure S2).
In this study, the common covariate categories selected in both study areas were vegetation and LST. The AR sets contained terrain and precipitation covariates in the BQ region but not in the WQ region. The difference of all-relevant covariate sets probably indicated the difference in the formation of soil PSFs in two study areas.

Minimal-Optimal Variable Set
In both study areas, the GB (greedy backward) sets were composed of similar covariates as AR sets to predict ALR-and ILR-transformed fractions. For the clay alr in the BQ region, HC excluded annual and winter LST on the basis of the AR set, SA (simulated annealing) further excluded vegetation covariates, and GF (greedy forward) merely retained three summer LST covariates. In terms of silt alr , HC (hill climbing) excluded terrain covariates, SA further excluded spring LST, and GF merely retained a vegetation covariate, two spring LSTs, and a precipitation covariate. For the prediction of clay ilr , HC excluded annual LST, SA excluded summer LST, winter LST, and precipitation, and GF merely retained a topographical covariate and a winter LST. Meanwhile, for silt ilr modeling, compared with the AR set, the SA technique excluded vegetation covariates, HC further excluded annual LST, and GF merely retained a vegetation covariate and a summer covariate (Table S1).
In the WQ region, based on the AR set of clay alr , HC excluded annual LST, SA excluded vegetation covariates, and autumn LST, while GF retained a winter LST covariate and an annual LST covariate. For the silt alr prediction, SA excluded autumn LST, HC further excluded vegetation covariates and annual LST, while GF retained an annual LST and two winter covariates. In terms of ILR-transformed fractions, HC excluded spring and annual LST for clay ilr prediction, SA further excluded summer LST, and GF retained two biotic covariates, and one spring LST covariate. For the silt ilr , HC excluded spring, autumn, and annual LST covariates, SA excluded spring, summer, annual LST covariates, and GF excluded vegetation and spring LST covariates (Table S1).

Assessment of Model Performance
According to the 10-fold cross-validation of the two ALR-transformed PSFs in the BQ region (Table S2), the RF (random forest) predictions based on AR (all-relevant variable) sets were better than the AV (exhaustive covariate) sets, and the best prediction of the two ALR-transformed PSFs were both based on the HC (hill climbing) sets. Meanwhile, for the predictions by Cubist, the AR sets performed better than AV sets for the prediction of silt ilr and worse for the prediction of clay ilr . As for the predictions with minimal-optimal variable sets, SA (simulated annealing) set had the best prediction on clay alr , and GB (greedy backward) set had the best prediction on silt alr . Both of them were better than the prediction based on AV and AR sets. As for the back-transformed PSFs, the ALR-RF model performed better than ALR-Cubist. According to the predictions by ILR-RF, the AR sets performed better than the AV sets. Among the minimal-optimal variable sets, SA and HC sets performed best on the prediction of clay ilr and silt ilr , respectively. As for the predictions by Cubist, AR set performed better than AV set on clay ilr and worse on silt ilr . Among the minimal-optimal variable sets, SA sets had the best predictions of both components. The ILR-RF model exhibited superior prediction compared to all the other predictions in the BQ region.
In the WQ region, AR sets outperformed AV sets by ALR-RF. Among the minimaloptimal variable sets, SA and HC sets had the best prediction on clay alr and silt alr , respectively, which were better than AR sets. As for the predictions by ALR-Cubist, AV sets outperformed AR sets. GB and HC sets had the best prediction on clay alr and silt alr , respectively, which were better than AV sets. According to the back-transformed PSFs, ALR-Cubist outperformed ALR-RF. According to the prediction of transformed PSFs by ILR-RF, AR sets performed better than AV sets. Among the minimal-optimal variable sets, GB and GF sets had the best prediction on clay ilr and silt ilr , respectively, better than AR sets. When using ILR-Cubist, AR sets performed better than AV set. Among the minimaloptimal sets, GF and SA sets had the best prediction on clay ilr and silt ilr , respectively, better than the AR sets, and outperformed all the other predicted soil PSFs in the WQ region (Table S2).
The model assessment on back-transformed PSFs was presented in Table 3. In terms of RMSE, RF (random forest) performed better than Cubist, and ILR transformation performed better than ALR in general. The prediction by ILR-RF was best in both study areas. For the prediction of clay ilr at the BQ region, SA retained vegetation (i.e., evi_1, evi_9, ndvi_1, and gpp_9), terrain (i.e., SWI and VRM), and LST (i.e., yr_n_i_9) covariates. In terms of the silt ilr prediction, HC retained terrain (i.e., s1_d_a_1, s2_df_a_9, s2_df_a_1 and s3_d_a_9) and climatic (i.e., pre_s1_9) factors. The SA and HC set covered all categories of covariates contained in AR sets of ILR-transformed PSFs. In the WQ region, GB retained vegetation (i.e., evi_1, evi_9, ndvi_1 and gpp_9) and LST (i.e., s1_d_i_9, s1_n_i_9, s2_d_e_1, s2_df_e_1 and s4_df_e_9) covariates for clay ilr prediction. Furthermore, GF only retained LST covariates (i.e., s2_df_a_1, s3_df_i_9, s4_d_e_9, s4_n_e_1, s4_n_i_9 and yr_n_i_9) for silt ilr modeling. The GB and GF sets contained all types of all-relevant covariates as well. Table 3. Best performance by the combination of log-ratio transformation, covariate selection, and machine learning approach in predicting soil particle size fractions. Predictions based on five covariate sets were different, even using the same prediction model. In general, all-relevant models had comparable performance to exhaustive models across all four prediction methods, indicating that the AR sets contained almost equivalent predictive power to exhaustive variables. Including the irrelevant variables in the exhaustive models resulted in essentially similar model accuracy but dramatically increased model complexity. Furthermore, compared with predictions on the AV or AR set, minimal optimization techniques helped improve the model performance (Table S2), even though the specific subset searching strategies were different in modeling methods and study areas. The improved predictive accuracy on the basis of minimal-optimal sets indicates that minimal optimization techniques can further decrease data redundancy on the basis of AR set. Furthermore, redundant variable information may interfere with the learning process, thereby reducing the prediction accuracy. Hence, covariate selection should be prior to model fitting in soil PSFs prediction.

Spatial Distribution of the Predicted Soil PSFs
Even though ALR and ILR models were developed on the basis of different covariate sets, they included the same categories of covariates, and the log-ratio models generated similar spatial patterns of soil PSFs (Figures S3 and S4). Higher clay and silt content were mainly distributed in the eastern part in the BQ region, while higher sand content was mainly distributed in the northwestern part. Soil PSFs showed a similar distribution trend at the WQ region, where higher clay and silt and lower sand content distributed in the southeastern part.

Covariates Most Relevant to Soil PSF Mapping
The covariate collection relies on a priori knowledge so that the explicit covariate exhaustive sets are often unlike [21,26]. Their selection determines the soil mapping performance to a great extent, especially when the number of soil samples is limited, but soil spatial heterogeneity is high [25]. The strategy of this study was to collect environmental covariates as comprehensively as possible, and the selection of key covariates was achieved by covariates reduction techniques.
AR (all-relevant covariate) sets were used to reveal the underlying process of soilenvironment systems of interest [21]. Although the specific covariates retained were differential among AR sets in this study, the Boruta technique retained vegetation and LST (land surface temperature) covariates in both BQ and WQ regions. This indicated that vegetation and LST covariates are the key covariates to identify the spatial variance of soil PSFs of the surface layer. LST is the crucial indicator for the distribution of permafrost [37], and vegetation has an interactive influence on the active layer [60]. Permafrost degradation on the QTP is in company with the shift in vegetation types and species composition [61,62], decrease in vegetation biomass, productivity, and species abundance [60,63,64]. Under a warming climate, degrading permafrost has profoundly and extensively affected alpine ecology [60]. With the degradation of alpine ecosystem, the soil nutrient declined and surface soil texture became rough [30,64].
With different search techniques, minimal-optimal sets were composed of different covariates. It is because the covariates selected by local optimization methods (i.e., GB, GF, and HC) depend on the initial set that the algorithm starts searching with. In general, each data reduction technique has its advantages. For instance, the GF models are the simplest with the least variable redundancy, while GB models retain the most variable information, reflecting more processes [21]. However, they all have their disadvantages as well. The GF models only reflect the major processes and often fell short in model performance. The SA algorithm escapes the local optima by accepting some models that have been degraded due to including or excluding additional sets of variables [52,65]. In this study, the selected covariates in minimal-optimal sets were different for modeling ALR-and ILR-transformed PSFs. However, they all involved all categories of relevant covariates. For example, in the BQ region, topographic factors were relevant for ALR-and ILR-transformed PSFs. None was retained to predict clay alr , while one topographic factor was retained to predict siltl alr . Topographic factors were also selected for clay ilr prediction while excluded for silt ilr prediction.
Former studies also emphasized the covariate selection in digital soil mapping, indicating covariate selection can help improve model performance [25,26]. However, we cannot explicitly assign a universal covariate reduction method for soil PSFs prediction because the predictions based on selected covariates can differ in PSF transformations, modeling techniques, and study areas. It is also the reason why we did not inverse the prediction of transformed PSFs on the basis of the same selection strategy. We prefer to apply and compare multiple covariate selection strategies for each transformed soil fraction based on our results.

Prediction Models
Considering the compositional nature of soil PSFs, we applied ALR (additive log-ratio) and ILR (isometric log-ratio) transformations before model fitting in this study. Some studies also reported application of the log-ratio transformations coupled with machine learning approaches. For example, Wang et al. [17] implemented the ALR and ILR transformation with the boosted regression tree (BRT), RF, and regression kriging for a river basin. The results indicated that machine learning approaches improved the mapping performance rather than regression kriging, and the ILR-RF model outperformed the ILR-BRT model. However, Amirian-Chakan et al. [1] cast doubts on the necessity of using log-ratio transformations in soil PSFs prediction. They predicted transformed and untransformed PSFs by RF and applied the predictions to estimate available soil water capacity (AWC) and the total amount of irrigation water (TIW). The results indicated no apparent difference in the prediction of PSFs, AWC, and TIW. Moreover, they suggested that the log-transformations may lead to biased estimation of PSFs, and the bias can further propagate to pedo-transfer functions. Therefore, they believed the untransformed PSFs were still valid for predicting AWC (or TIW). Some other studies may also support this concept, for they applied Digital Soil Mapping (DSM) without log-ratio transformations in soil PSFs mapping but gained fairly good accuracy [8,22]. More studies are still needed to testify to the rationality and the feasibility of log-ratio transformation in DSM at different scales.
Former studies used the variable importance of the RF algorithm to interpret the relation between covariates and response variables. For example, Ran et al. [5] found seven variables, including the freezing degree-days, thawing degree-days, leaf area index, snow cover days, elevation, soil moisture, and soil bulk density, are selected to estimate the mean annual ground temperature. The relative importance results in Wang et al. [17] showed that soil organic carbon, NDVI, elevation, precipitation, and temperature were the main predictors to explain the variability of transformed PSFs in the Heihe River basin (area = 14.67 × 10 5 km 2 ). According to the study [1] at a region in southwestern Iran (area = 4600 ha), the relative importance showed that MRVBF, NDVI, elevation, slope, and band 3 of Landsat 8 were the significant covariates in predicting clay, silt, and sand when using terrain attributes and optical images. While on a larger spatial scale, Akpa et al. [66] indicated that climatic elements (precipitation and temperature), Landsat8 bands (band 2 and 3), elevation, and geology were the most influential predictors for the distribution of clay and sand in Nigeria. Liu et al. [22] found that climatic elements (e.g., solar radiation, wind speed, temperature seasonality, daytime LST), altitude, and regolith thickness were the essential factors in predicting soil PSFs in China. Hengl et al. [8] illustrated that depth, climatic elements (precipitation, precipitable Water Vapor images, daytime, and nighttime LST), and terrain factors were the most influential covariates. For soil PSFs prediction, vegetation condition is more influential on a regional scale than a national or global scale. Climatic elements are more critical on a large spatial scale, such as national and global scales.
In this study, variable importance by the parsimonious ILR-RF models at both the BQ and WQ regions were given in Figures S5 and S6. The variable importance in RF algorithm was given with two aspects. One is the mean decrease in accuracy and the other is the mean decrease in Gini index when a particular predictor variable is removed [58]. The VRM (Vector Ruggedness Measure) and s2_df_a_9 (multiyear average of maximum diurnal LST difference in summer) were the most important covariates for clay ilr and silt ilr prediction in the BQ region, respectively ( Figure S5). In the WQ region, s4_df_e_9 (multiyear average of mean diurnal LST difference in winter) and yr_n_i_9 (multiyear average of annual minimum night LST) were the most influential covariates. Compared with the BQ region, the winter LST seemed to have a more substantial impact on soil PSFs prediction at the WQ region since a total of four winter LST covariates were used in the ILR-RF models. The vegetation (i.e., EVI, NDVI, and GPP) and LST (i.e., yr_n_i_9 and s2_df_a_1/9) covariates are the common predictors used in ILR-RF models at both regions, indicating vegetation and LST were crucial in predicting soil PSFs of the surface layer at both study areas.
Many studies found that using machine learning methods can generate soil maps with higher accuracy and smoother transitions compared with polygon linkage approach [22,63]. Nevertheless, Wadoux et al. [19] indicated that most studies emphasized the prediction and accuracy of the predicted maps for applications, while few studies accounted for existing soil knowledge in the modeling process or quantified the uncertainty of the predicted maps. Current research still falls short in revealing the underlying mechanism and the internal relationships between the soil properties and potential covariates [19,26]. Some studies found that even irrelevant predictors, i.e., pseudo predictors, can generate reasonably accurate evaluation statistics of target properties [67,68]. Behrens et al. [69,70] attributed the reason to scale and information content and proposed the "information horizon" for the interpretability of spatial environmental predictors. Besides, pedogenetic knowledge is still crucial as the key step in DSM [25,26]. However, it is hard for non-expert users to assess the feasibility of a large number of potential covariates [26]. With this respect, Qin et al. [71] and Peng et al. [26] suggested to acquire reference from the formalized covariate selection knowledge in existing applications and applied those covariates into studies of their own. For future developments, machine learning could incorporate three core elements: Plausibility, interpretability, and explainability, which will trigger soil scientists to couple model predictions with pedological explanation and understanding of the underlying soil processes [19]. Expert knowledge can be incorporated into DSM, but DSM can also lead to the discovery of knowledge [26,72]. The combination of data-driven and knowledge-based methods can promote even more significant interactions between pedology and DSM [72].

Spatial Distribution
The predicted soil PSFs showed a similar distribution trend in both study areas. The finer soil texture was distributed in the southeast or east, while the coarse soil texture was distributed in the northwest. We compared the spatial distribution of soil fractions (take sand content for example) and some of the important predictors for both regions ( Figures S7 and S8) to test if the predicted PSF distribution followed the same spatial pattern as the corresponding important predictors. Since the predicted silt and clay had a similar distribution pattern, though sand was opposite, we took sand fraction as a representative. We found that in the BQ region, even though VRM had the highest variable importance in predicting clay ilr , its spatial distribution was not that similar to sand. The most important predictor of silt ilr , s2_df_a_9, had a very similar spatial distribution as the sand fraction in the northwest part. However, it did not depict the spatial variance of the sand fraction in the southeastern part. In contrast, some other predictors, such as EVI_1, s1_d_a_1, and s3_d_a_9, showed more a similar spatial distribution as the sand fraction. According to the land use type ( Figure S7), coarse soil texture was usually distributed in the alpine desert grassland, where the elevation was relatively low, and the daytime LST (i.e., s3_d_a_9) and the LST difference (i.e., s2_df_a_9) were relatively high. The finer soil texture was usually distributed in the alpine meadow and alpine steppe. While in the WQ region, the s4_df_e_9 and yr_n_i_9 showed a similar spatial distribution pattern as the sand fraction, especially in the northwest part. The coarse soil texture was mainly distributed in the northwest WQ ( Figure S8), dominated by alpine steppe and alpine desert grassland. In the northwest WQ, the minimum LST was lower, and the LST difference was larger. It indicated that the predicted distributions of soil PSFs were determined by the compound impact of multiple predictors rather than solely impacted by the most influential predictor.

Comparison with Existing Maps
Former studies have offered soil texture information at multiple spatial scales [8,22,46,[73][74][75]. However, due to the lack of field observations, the availability of these datasets in permafrost areas was difficult to evaluate. The PSFs were extracted from three available datasets (Section 2.2.3), and their descriptive statistics indicated that the silt and sand content at BQ and WQ were comparable (Table 4). While the descriptive statistics of observations indicated the silt content at BQ and WQ was quite different (Table 2). Hence, the three legacy datasets failed in depicting the spatial variance of soil PSFs, especially the higher content of sand in the WQ region.
The scatter plots (Figure 4) show that the SoilGrids250m, CSCD, and CDSL overestimated the clay content and cannot capture the silt and sand variance in BQ. Besides, all three compared datasets overestimated silt fraction and underestimated sand fraction in WQ ( Figure 5). They could not capture the variance of clay either. Remote Sens. 2021, 13, x FOR PEER REVIEW 16 of 22       According to the spatial distribution of soil PSFs in this study, the clay and silt content increased from northwest to the southeast in both BQ and WQ (Figures 6 and 7). We notice that the CSCD and CDSL also showed a similar trend as ILR-RF prediction at BQ, while SoilGrids250m did not ( Figure 6). The spatial patterns were not that consistent with ILR-RF at WQ (Figure 7). The assessment of the three legacy datasets showed low prediction accuracy on soil PSFs (Table 5).
There are some reasons to explain the low predictive power of the three legacy datasets. Firstly, observations from permafrost regions are quite deficient. The CDSL and CSCD datasets were developed on the Chinese soil profile database, which barely included observations in the permafrost distributed region. The SoilGrids250m used this database as well. Even though some vacant areas were filled with pseudo-points by expert knowledge [8], the inadequate soil samples result in insufficient learning process in model fitting. Secondly, the methodology implemented in these soil datasets were different. The CDSL and CSCD datasets used the polygon linkage approach while SoilGrids250m and our study employed the machine learning approaches. From the perspective of the earth system model, Dai et al. [76] have reviewed global soil properties maps and compared the maps generated by polygon-linkage methods with those generated by DSM. They found that the soil datasets produced by these two methods are quite different. These datasets may not be comparable since the linkage method results in an abrupt change between the boundaries of soil polygons, while the DSM simulated the soil properties with a continuous spatial change. In that way, the DSM-derived datasets can provide a more realistic estimation of soils than those derived by linkage methods. Even so, Arrouays et al. [77,78] argued that spatial information systems for polygons and grids complement various applications of soil maps. Therefore, both are needed from local to global scales. Thirdly, the initial spatial scales and spatial resolution of the datasets are different. This study was conducted at 1 km spatial resolution in two local areas. The CDSL and CSCD were developed in mainland China with a spatial resolution of 1 km, while SoilGrids250m was developed for the global scale with a spatial resolution of 250 m. When used at a local scale, they may not be able to depict the detailed information precisely. Liu et al. [22] found that the SoilGrids250m and polygon-linkage-based maps showed lower accuracy than their predictive maps of soil PSFs in China.
OR PEER REVIEW 18 of 22 Figure 6. Spatial distribution of soil particle size fractions (PSFs) at the BQ region using isometric log-ratio (ILR)-random forest (RF) and extracted from three soil datasets. Figure 6. Spatial distribution of soil particle size fractions (PSFs) at the BQ region using isometric log-ratio (ILR)-random forest (RF) and extracted from three soil datasets. Figure 6. Spatial distribution of soil particle size fractions (PSFs) at the BQ region using isometric log-ratio (ILR)-random forest (RF) and extracted from three soil datasets. Figure 7. Spatial distribution of soil particle size fractions (PSFs) at the WQ region using ILR-RF and extracted from three soil datasets. Figure 7. Spatial distribution of soil particle size fractions (PSFs) at the WQ region using ILR-RF and extracted from three soil datasets.

Conclusions
Due to the lack of field observations, the datasets focused on the spatial distribution of soil PSFs in permafrost regions of the Qinghai-Tibet Plateau are still scarce. We carried out the mapping of soil PSFs at two typical permafrost distributed regions by integrating two log-ratio transformation methods, five variable selection techniques, and two learning approaches. The Boruta all-relevant technique retained vegetation and LST covariates, and excluded air temperature in both regions. The difference in variable selection in the two regions is that precipitation and topographic covariates were retained at BQ but excluded at WQ. In general, the AR models had comparable accuracy with the exhaustive models, and the parsimonious models exhibited superior prediction compared to the exhaustive covariates. It confirms the necessity of variable selection prior to model fitting. However, we did not find a universal method for covariate selection. In addition, we prefer to compare multiple variable selection techniques than recommend a specific one. The ILR-RF model outperformed all the other models in both study areas. These results suggest that the combination of log-ratio transformation, covariate selection, and machine learning approaches are feasible in mapping soil PSFs in both study areas.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/rs13071392/s1, Figures S1 and S2: Importance of the all-relevant covariates of ALR-and ILR-transformed PSFs in the BQ and WQ regions. Figures S3 and S4: Spatial distribution of soil particle size fractions in the BQ and WQ regions predicted by four log-ratio models. Figures S5 and  S6: Fitted variable importance plots for clay ilr and silt ilr by the ILR-RF model in the BQ and WQ regions. Figures S7 and S8: Comparison between the distribution of sand and the predictors in the BQ and WQ regions. Table S1: Selected covariates identified by Boruta, GF, GB, HC and SA techniques in the two study areas. Table S2: Assessment on the predictions of two transformed fractions by the exhaustive, all-relevant and parsimonious models.