Estimation and mapping of pasture biomass in Mongolia using machine learning methods

Abstract Mongolian pasture plays an essential role in the national economy. Reliable pasture biomass estimation is indispensable to support the agricultural sector and also sustainable livelihood in the country. The aim of this study is to determine an appropriate method to estimate and map pasture biomass in a forest-steppe area of Mongolia. For this purpose, machine learning methods such as random forest (RF), support vector machine (SVM), and partial least squares regression (PLSR) were compared. As data sources, spectral indices derived from Sentinel-2B image of 2019 and field-measured biomass sample datasets were used. To determine the optimal spectral predictor variables, initially, 20 spectral indices were evaluated using the PLSR. Of these, five indices (i.e. ATSAVI2, EVI, GRVI, IPVI and MSR) with the highest correlation coefficients (r ≥ 0.94) were considered for further analysis. These indices were also examined and validated by a variable importance analysis. Then, the RF, SVM, and PLSR models were applied to predict and map pasture biomass using the selected five indices. The PLSR method demonstrated the highest accuracy with coefficient of determination (R2) =0.899 and root mean square error (RMSE)=10.560 g/m2. The SVM technique showed the second highest accuracy with R2=0.837 and RMSE = 12.881 g/m2. The RF model gave the lowest accuracy with R2=0.823 and RMSE = 13.430g/m2. Our research showed that different machine learning models might be applied (because in all cases R2>0.82) for a pasture biomass estimation and mapping in the selected test site, but the best result could be achieved by the use of the PLSR.


Introduction
Pastureland plays an important role for the Mongolian animal husbandry, because they are grazing home to over 60 million livestock (NSO 2021).It makes up more than 70% of CONTACT Nyamjargal Erdenebaatar nyamjargale@mas.ac.mn total land area of the country and represents the largest remaining contiguous area of common pastureland in the world.In recent years, the Mongolian pastureland has been seriously deteriorated and pasture yields have decreased.The severe droughts, unregulated land use, and growing number of livestock have been the main factors for the pastureland degradation in most places of the country (Amarsaikhan 2017).In addition, pasture condition and yields are very much dependent on the time duration of growing season.Generally, Mongolia's pasture has a very short growing season, influenced by the highly changing temperatures, and varying precipitation.Due to all of these influencing factors, pasture growth usually begins in mid-May and discontinues after mid-August (Environmental impact assessment 2012).
In recent years, the status of pasture condition in Mongolia has been debated with different discussions, including various advanced methods for monitoring and evaluation of pasture carrying capacity and some other indicators (Special Report 2017).The carrying capacity is a measurement of how much forage a piece of ground can produce on an average year.It is expressed as the maximum number of livestock that can be grazed for a specific time period without compromising the future production capacity (Meehan et al. 2018).The carrying capacity is an important factor that influences the human environment and sustainable development in pastoral areas (Zhang et al. 2019).One of the determinants of the pasture carrying capacity is the above ground biomass (AGB).The accurate and timely quantification of pasture biomass has a potentially significant role in helping herders and planners achieve effective grazing management practice (Chen et al. 2021).
The estimation of biomass is a challenging task, especially in areas with complex landscapes and varying environmental conditions, and it requires accurate and consistent measurement methods (Kumar et al. 2015).The AGB can be evaluated by the use of ground-based conventional methods and modern remotely sensed technology (Chen et al. 2018;Chiarito et al. 2021).Traditional field methods include visual estimation, cut-dryweigh, rising plate meter (Hakl et al. 2012) and field spectrometry (Psomas et al. 2011).There are also some commercially available vehicle-mounted techniques based on height detection (King et al. 2010).However, these methods can be labor-intensive, time-consuming and unsuitable for studies of large areas in comparison with the current remote sensing (RS) technology (Chen et al. 2021).The RS-based biomass evaluation techniques can provide effective operational tools for assessing the state and changes of AGB in a target area (Lourenco 2021).They can also be used for prompt mapping and cost-effective monitoring of pasture biomass.These capabilities of RS have been increased, especially, with the launch of Sentinel-1,2 satellites (Kumar and Mutanga 2017;Wang et al. 2019).The images acquired by the Sentinel-1,2 are free of charge, and have higher spatial and temporal resolutions compared to many traditional satellites (e.g.Landsat series) (Song et al. 2021).
Generally, RS-based biomass estimates use four types of remotely sensed datasets, including, optical, microwave, hyperspectral, and lidar images with different spatial resolutions.Of these, optical images have wide applications for biomass estimation and the most commonly used datasets include low-resolution AVHRR and MODIS (Li et al. 2018), medium-resolution Landsat and SPOT (Gasparri et al. 2010;Zhu and Liu 2015), and high-resolution IKONOS, Quickbird and Worldview images (Takahashi et al. 2010).The low-resolution images have been found to be more effective for biomass estimation at the national and global scales, but they have not been used much, because of the accuracy and difficulty in linking these datasets with field measurements (Chen et al. 2021).The high-resolution images can be useful; however, they are expensive to acquire, and affected by terrain shadows, resulting in errors for the biomass estimation (Vaglio et al. 2014).In contrast, medium-resolution datasets have been widely used in combination with sample plot data for the AGB estimation at a regional scale, because of the easy access, availability and low cost (Otgonbayar et al. 2018).Recently, unmanned aerial vehicles (UAVs) have started to receive more attention in biomass estimation, as they provide very high-resolution images with accuracies ranging from decimeters to centimeters.The studies have shown that very high-resolution multispectral imageries obtained from UAVs can be very useful in assessing not only AGB but for accurate measuring other important biophysical parameters in pasturelands (Matese et al. 2015;Zhang et al. 2018;Lussem et al. 2019;Bazzo et al. 2023).
Over the years, different biomass estimation techniques such as RF, SVM, artificial neural network (ANN), cubist, K-nearest neighbour (KNN), Bayesian network, vegetation indices (VI)s, linear or nonlinear regression models have been developed and applied to optical datasets (Lu 2005;Mills 2011, Cui et al. 2012;Wang et al. 2017;Filho et al. 2020;Zeng et al. 2021).Many authors have used either one or combination of these methods and made different judgments.For example, Xie et al. (2009) evaluated grassland aboveground dry biomass in Xilingol River Basin, Inner Mongolia, China using Landsat ETM þ data.They applied ANN and multiple linear regression (MLR) models, and concluded that the ANN method provided more accurate estimation than the MLR.Edirisinghe et al. (2011) evaluated the biomass of grazing pasture in Australia using RSbased NDVI and regression model and judged that the model produced good outputs.Grant et al. (2013) estimated biomass in two grassland ecoregions of Alberta, Canada, using 8 VIs derived from SPOT imagery and several transformation models, and underlined that renormalized NDVI and transformed VI provided satisfactory results.Rasanen et al. (2017) applied site-specific and cross-site empirical regressions to map the vegetation AGB using high resolution images and highlighted that the results were acceptable.Zeng et al. (2021) evaluated the grassland AGB in the Three-River Headwater Region, China using RS and other ancillary datasets.For the study RF, cubist, ANN and SVM models have been applied and emphasized that the RF method had the best performance.
In Mongolia, different RS-based pasture biomass studies have been conducted since the mid-1980s (Amarsaikhan 2019).One of the first satellite-based research was conducted by Adyasuren (1989), and in the study, the relationships between ground-based biomass measurement and NOAA AVHRR reflectance values were explored.Although, henceforth, a great number of investigations related to the Mongolian pasture, its condition, carrying capacity and dynamics have been carried out (Purevdorj et al. 1998;Erdenetuya 2004;Javzandulam et al. 2005;Karnieli et al. 2006;Bat-Oyun et al. 2010;Narangarav and Lin 2011;Hilker et al. 2014;Lamchin et al. 2015;Bayaraa et al. 2021), the country still lacks advanced methods to predict and map pasture biomass for its planning and management (Otgonbayar et al. 2018).
Over the last two decades, a great number of satellite data-based vegetation spectral indices have been developed and investigated by different authors for various test sites (Anderson et al. 1993;Todd et al. 1998;Clevers et al. 2007;Zhang et al. 2007, Edirisinghe et al. 2011;Fang et al. 2011, Alexandridis et al. 2014, Dusseux et al. (2015); Otgonbayar et al. 2018;Wang et al. 2020;Zeng et al. 2020).Nonetheless, there are a very few investigations related to the Mongolian case.Kogan et al. (2004) estimated pasture biomass in Mongolia using vegetation health indices derived from NOAA AVHRR data (biomass anomaly R 2 ¼ 0.658).Javzandulam et al. (2005) revealed that EVI and biomass values were better correlated than the NDVI for a mountain steppe zone.Sternberg et al. (2011) explored the relationships between field-survey data and NDVI to examine desertification processes in a dryland region.Lamchin et al. (2016) selected NDVI, TGSI (topsoil grain size index), and land surface albedo (as indicators for representing land surface conditions from vegetation biomass, landscape pattern, and micrometeorology) in order to conduct land cover change and desertification assessments in a steppe area.Bayaraa et al. (2021) compared linear regression models between above-ground pasture biomass and seven such indices as NDVI, RDVI, GDVI, SAVI (soil-adjusted vegetation index), OSAVI (optimized soil adjusted vegetation index), ARVI (atmospherically resistant vegetation index), and EVI.As seen, different VIs and individual spectral bands along with other advanced methods are widely used for an improved biomass estimation.However, the research conducted by Otgonbayar et al. (2018) showed that the VIs have better performances than the individual bands.Therefore, this study aims (a) to investigate the applicability of different RS-based VIs to biomass estimation in Mongolia and determine the optimal VIs based on the regression algorithm, and (b) to estimate and map pasture biomass using machine learning methods and determine the appropriate method through comparison of the results.

Study area
The test site was selected in Bornuur soum of Tuv aimag, situated in 100 km to the north-west from Ulaanbaatar, the capital city of Mongolia (Figure 1).Although, the soum is spread over an area of more than 1140 sq.km, the study area chosen for the present study extends from the west to the east about 17.3 km and from the north to the south about 23.8 km.It occupies western part of the Khentii Mountain range, which is one the three major mountain ranges in the country.In terms of physical geography, the soum area belongs to a forest-steppe zone and land cover mainly includes such classes as perennial grasses, coniferous forest, perennial forb species (i.e.Stipa krylovii Roshev., Carex duriuscula C.A. Mey., Leymus chinensis Trin., Artemisia Frigida Willd), soil, and agricultural field (Hugjliin 2010).The altitude range is between 1000 m and 1500 m with the average altitude of 1100 m above sea level.The soil in this area is fertile and suitable for crop production.The mean annual precipitation is between 200 and 250 mm.It has a cool summer and harsh winter.Mean annual temperature in July is þ20 C, while it is À30 C in January (Bornuur soum 2020).

Satellite data preprocessing
In the present study, we selected an orthorectified Sentinel-2B image acquired on July 09, 2019 to estimate the pasture AGB (Figure 2).Sentinel-2B images have 13 spectral bands and different processing levels (Sentinel-2 User Handbook 2015).For our research, data with processing level-2A has been used and downloaded from the ESA's Copernicus Open Access Hub (https://scihub.copernicus.eu/dhus/#/home).Using the metadata associated with the satellite imagery, initially, radiometric correction was applied to improve the quality of the image.Dark object subtraction was applied for atmospheric correction, which was intended to remove the atmospheric effects.Both radiometric and atmospheric corrections were performed using Semi-Classification Plugin of QGIS.It was not necessary to thoroughly georeference the selected Sentinel-2B data, because the image was in a correct WGS84/UTM zone 48 N system.

Field measurement
At the beginning of July 2019, fieldwork was conducted and different sample plots with various biomass conditions were selected.Usually, pasture vegetation in this area reaches its maximum production at the beginning of August (Tserendash 2006).Our field survey revealed that pasture vegetation had not reached its maximum height and production, yet.Therefore, selections of the sample plots were mainly based on the distribution of pasture vegetation with similar conditions.The sampling locations were recorded using a handheld global positioning system (GPS) and biomass from each plot was manually collected byb cutting grass to the ground surface in a 70 cm x 70 cm area.The samples were airdried in the field and oven-dried for 24 h at 80 C before weighing.Then, the weights were converted to g/m 2 and the values of biomass ranged from 39.23 g/m 2 to 176.73 g/m 2 with a mean biomass value of 77.08 g/m 2 having three levels: low, medium, and high biomass.Frequency distribution of the field-measured biomass for the 40 selected sample plots and locations of sampling sites positioned on the Sentinel-2B image are shown in Figure 2a, b.

Variable determination
Generally, vegetation and soils have been found useful in measuring biophysical properties of different vegetation communities.The underlying reason behind the usage of VIs is that they are important spectral indicators defined to enhance spectral features sensitive to a vegetation property, while reducing different undesired effects (Dorigo et al. 2007).With the advent of modern RS technology, spectral bands can be combined in numerous ways to create various VIs, and most of them are successfully used as predictors for vegetation variables along with a variety of statistical models.
In the current study, one of the main research requirements was to determine the appropriate independent variables.So, we have selected 20 indices commonly used for different vegetation analysis (Table 1).For the calculation of these VIs, blue, green, red, near infrared (NIR), and short-wave infrared (SNIR) ranges of the electro-magnetic spectrum have been used.In order to thoroughly consider only pasture, non-pasture areas including the agricultural fields and forest areas had been masked and removed from the original Sentinel-2B data before determining the variables.For determination of the masks, boundaries of these classes were screen digitized in ArcGIS system after applying a maximum likelihood decision-rule.

Machine learning algorithms
In the current study, we used such machine learning methods as a PLSR, RF and SVM for estimation of the pasture biomass in the selected test area.
The PLSR is a quick, efficient and optimal regression method based on covariance.The principle of PLSR is to firstly decompose explanatory variables into a few non-correlated latent variables or components using information contained in the response variable; then to regress the new components against the response variable (Lu et al. 2016).In our study, we applied XLSTAT software and the correlation coefficients (r), RMSE and coefficients of determination (R 2 ) between predicted and measured biomass were the criteria used to select the best model with optimal number of components.
The RF is a family of tree-based models; in the first one, data are stratified into homogeneous subsets by decreasing the within-class entropy, whereas in the second one, a large number of regression trees are constructed by selecting random bootstrap samples from the discrete or continuous datasets (Pflugmacher et al. 2014).The advantage of RF is that it can run effectively on large data sets, relatively robust to outliers, and efficient to complex training data (Hastie et al. 2009;Rodriguez-Galiano et al. 2012).For each tree,
The SVM algorithm transforms the original space and constructs an optimal hyperplane in multi-dimensional feature space, which divides the data into different classes with the largest possible margin of separation.In the field of digital image processing, it is an important statistical learning algorithm with the ability to work well on noisy data, develop necessary support vectors and use relatively small training sample data to produce comparatively higher estimation accuracy than other classification approaches (Sluiter and Pebesma 2010;Lu et al. 2016).In the present study, the biomass estimation was performed by applying this technique to field measured biomass and Sentinel-2B image derived spectral indices.
The selected machine learning methods have been used to remotely estimate the pasture biomass in the selected test area.In each model, the spectral indices with highest r values were considered to be independent variable and field measured biomass values were the dependent variable.For implementation of the models, R software was selected.The flowchart of the applied method is shown in Figure 3.

Results and discussion
In Bornuur soum, many factors such as soil types, precipitation, temperature, land-use, human activities, terrain aspect and slope can influence vegetation growth, carrying capacity, and species composition.Especially, terrain aspect and slope can affect sun illumination and moisture distribution that directly influence plant photosynthesis and biomass accumulation as well as measuring of biophysical properties.Also, different soil types have different physical structures and nutrient components, directly influencing the vegetation growth.Climate conditions, particularly precipitation distribution and temperature are important factors that affect vegetation vigor.For example, an accumulated precipitation in Bornuur soum in July 2019 was recorded 30.8 mm, while temperature was þ22 C (Bornuur soum 2019).Moreover, different human activities also influence the soil degradation, vegetation area disturbance and destruction.Thus, various natural, geographical and human-induced factors influence the biomass condition in the selected study site.
In the present study, initially, based on the indices shown in Table 1, the correlations between the field measured biomass and spectral indices have been estimated using the PLSR model.The resulting correlation coefficients matrix is shown in Table 2.As seen from Table 2, there are 17 indices (i.e.ATSAVI2, CVI, DVI, EVI, GARI, GDVI, GNDVI, GRVI, GVMI, IPVI, MNDVI, MSAVI, MSR, RDVI, TDVI, TVI, and VARI) with r values greater than 0.90.It is also seen that there are 5 indices (i.e.ATSAVI2, EVI, GRVI, IPVI, and MSR) with r !0.94 and all of them have been calculated by the use of visible and NIR bands.For calculation of the ATSAVI2, IPVI, and MSR, red and NIR bands have been used, while for creation of the GRVI, green and NIR bands were applied.For formation of the EVI, blue, red and NIR bands were selected.These explain the extensive use of these bands for creation of spectral indices in different forms.Among the highly correlated VIs, the MSR (r ¼ 0.947) can be considered the best index to explain the ground biomass.Compared to other high correlation indices, the EVI demonstrated the second highest result to describe the AGB.
As known, the traditional VIs mainly used visible and NIR portions in the optical range, because NIR-based indices have strong relationships with vegetation biomass, with R 2 ranging from 0.60 to 0.94 (Poley and McDermid 2020).Since the launch of satellites with SWIR bands, the usage of the SWIR region has been intensified among the RS community.For instance, Hill et al. (2016) retrieved fractional cover of photosynthetic vegetation, non-photosynthetic vegetation and bare soil for tropical savannah based on linear unmixing of the two-dimensional response envelope of NDVI and ratio (SWIR) 32 indices derived from MODIS data.Castro and Garbulsky (2018) developed different spectral normalized indices to accurately predict some characteristics in forage species using visible, near and middle infrared ranges.In order to accurately model the biomass of arid steppe area in Algeria, Benseghir and Bachari (2021) used different VIs combining visible and NIR OLI bands, visible and SWIR OLI bands, and also NIR and SWIR OLI bands.
In the case of our study, we used the SWIR bands of Sentinel-2B for calculation of 4 different indices.As could be seen from Table 2, it is seen that among the VIs determined by the use of SWIR portions of the electromagnetic spectrum, the GVMI and MNDVI have the highest correlations (r ¼ 0.916), whereas the AFRI has the lowest correlation (r ¼ 0.78).For creation of the GVMI and MNDVI, NIR and SWIR2 bands have been selected, while for calculation of the AFRI, NIR and SWIR1 were used.This indicates that a combination of NIR and SWIR2 bands can be used for development of proper spectral indices to evaluate AGB in pasture areas, but a combined use of NIR and SWIR1 is not  suitable for assessing pasture biomass values.Nevertheless, they can still be applied along with other high correlation indices for proper pasture biomass estimation and mapping.
To support the conducted analysis, compared evaluation was carried out graphically illustrating the relationship between the field measured biomass and determined VIs.In practical applications of RS image analysis, such comparisons are important for accurate determination of the appropriate variables.As could be seen from the research, the initial spectral indices and biomass values have been represented in different units.As a result, it was difficult to identify them in the same space.Therefore, to overcome this problem the variable values were standardized by the use of the following formula: Where X is the observed variable; mean and SD are the mean and standard deviations.
After the standardization, all values have been ranged from À1.5 to 3.5.We divided the new values into 11 groups with a 0.5 interval to present them in a graph relating to the number of measurements.The comparison of the standardized biomass and VI values are shown in Figure 4.As seen from Figure 4, the GRVI and MSR indices have very high overlaps with the measured biomass, while the ATSAVI2, CVI, EVI, IPVI, RDVI indices have the second highest relationship with the dependent variable.Thus, it is seen that the relationship between the biomass and VI values of all five spectral indices having the highest r values (0.94 or more) can be graphically illustrated and validated.
After investigating the relationships among the dependent and independent variables, the selected RF, SVM and PLS methods have been used to classify the determined 5 spectral indices (i.e.ATSAVI2, EVI, GRVI, IPVI, and MSR) with the highest r values.The classification outputs (i.e.biomass maps) are shown in Figure 5. Furthermore, to evaluate the quality of the applied models, two broadly used statistical measurements, coefficient of determination (R 2 ) and root mean square error (RMSE) (Richter et al. 2012) have been used.The most common interpretation of R 2 is how well the model fits the observed data, and a higher coefficient indicates a better fit for the model.The RMSE provides information about the performance of a model by allowing a comparison of the actual difference between the estimated and observed values, and the smaller the value, the better the performance of the used model.
As seen from Table 3, among the selected models, the PLSR method demonstrates the highest accuracy, because it results in R 2 ¼0.899 and RMSE ¼ 10.560 g/m 2 .The performance of the SVM technique shows the second highest accuracy having R 2 ¼0.837 and RMSE ¼ 12.881 g/m 2 .In contrast, the RF model results in the lowest accuracy.Here,  In case of the entire Mongolian pastureland, Otgonbayar et al. (2018) applied PLSR and RF models along with the selected spectral indices to estimate pasture biomass.In their study, the PLSR result showed a satisfactory correlation between field measured and estimated biomass with R 2 ¼ 0.750 and RMSE ¼ 101.10 kg ha À1 .The RF regression gave slightly better results with R 2 ¼0.764 and RMSE ¼ 98.00 kg ha À1 .As seen, R 2 values of both models are lower and RMSE values are highly scattered compared to the results of our estimations.
Thus, in the case of the selected test site (i.e.forest-steppe zone), the result of the PLSR method is superior to the other techniques.In recent years, a number of livestock has been substantially increased in many regions of Mongolia.Therefore, the final output biomass map derived from this study could be accurately used for pasture/land use planning and management in Bornuur soum.Nevertheless, other two models can also be used for other cases, because of their high accuracies and lower RMSEs.

Conclusion
In this research, an appropriate model for predicting pasture biomass in Bornuur soum of Tuv aimag, Mongolia was defined by comparing the PLSR, RF, and SVM methods.To determine the optimal spectral predictor variables, initially, 20 spectral indices were calculated using visible, NIR, SWIR1, and SWIR2 bands of Sentinel-2B image of 2019.These indices were evaluated using the PLSR method.For the training, reference biomass datasets collected from 40 sample plots were used.The PLSR revealed seventeen indices with r > 0.90, and five indices with r !0.94, accordingly.Then, compared evaluation was carried out graphically illustrating the relationships between the field measured biomass and spectral indices.The comparison confirmed the relationships between the biomass and VI values, including the indices with the highest r values.These five most important spectral indices were ATSAVI2, EVI, GRVI, IPVI, and MSR, and they were based on the visible and NIR bands.By combining these indices in the selected models, their specific contribution to the high-quality outputs, could be increased.The RF, SVM and PLS methods were applied to classify the five VIs.When the results of the classifications for predicting pasture biomass and producing appropriate maps were compared, the PLSR technique showed the highest accuracy, resulting in R 2 of 0.899 and RMSE of 10.560 g/m 2 .The SVM technique gave the second-best result having R 2 of 0.837 and RMSE of 12.881 g/m 2 .The performance of the RF model was not sufficient enough compared to the other two methods and it demonstrated the lowest accuracy (R 2 ¼ 0.823 and RMSE ¼ 13.430g/m 2 ).In practice R 2 values of greater than 0.8 could still be considered good enough to evaluate the selected models.Therefore, it was concluded that the PLSR should be used for the pasture biomass estimation and mapping in a forest-steppe zone of Mongolia, although other methods such as RF and SVM could also be considered.

Disclosure statement
The authors have no potential conflict of interest or no competing interests to declare.

Figure 1 .
Figure 1.Location of study area.

Figure 2 .
Figure 2. Frequency distribution of the field-measured biomass for the 40 selected sample plots (a), and locations of sampling sites on the Sentinel-2B image (b).

Figure 3 .
Figure 3.The flowchart of the applied method.

Figure 4 .
Figure 4. Relationship between the field measured biomass and VI values.

Table 1 .
Vegetation indices used for the study.

Table 3 .
Summary statistics (R 2 and RMSE) for the selected biomass prediction methods.¼0.823) and root-mean-square error (RMSE ¼ 13.430g/m 2 ) exhibit lower values compared to the other two methods.