Simulation of potential endangered species distribution in drylands with small sample size based on semi-supervised models

Identifying suitable habitats for endangered species is critical in order to promote their recovery. However, conventional species distribution models (SDMs) need large amounts of labeled sample data to learn the relationship between species and environmental conditions, and are difficult to fully detangle the role of the environment in the distribution of the endangered species, which are very sparsely distributed and have environmental heterogeneity. This study’s first innovation used the semi-supervised model to accurately simulate the suitable habitats for endangered species with a small sample size. The model performance was compared with three conventional SDMs, namely Maxent, the generalized linear model, and a support vector machine. Applying the model to the endangered species Populus euphratica (P. euphratica) in the lower Tarim River basin (TRB), Northwest China. The results showed that the semi-supervised model exhibited better performance than conventional SDMs with an accuracy of 85% when only using 443 P. euphratica samples. All models developed using smaller sample sizes exhibit worse performance in the prediction of habitat suitability areas for endangered species while the semi-supervised model is still excellent. The results showed that the suitable habitat for P. euphratica is mainly near the river channel of the lower TRB, accounting for 13.49% of the study area. The lower Tarim River still has enormous land potential for the restoration of endangered P. euphratica. The model developed here can be used to evaluate a suitable habitat for endangered species with only a small sample size, and provide a basis for the conservation of endangered species.


Introduction
Drylands, which have an average precipitation between 25 and 500 mm per year, are inhabited by 28% of endangered species [1]. Besides, land degradation has expanded rapidly in recent decades due to climate change and human activities, and threatening the habitats of critically endangered species such as Alcimandra cathcartii, Cycas pectinata, Eleutharrhena macrocarpa, and Parashorea chinensis, etc [2,3]. Nowadays, the protection of endangered species and the maintenance of biodiversity have attracted widespread attention in several countries. For example, China has enacted the Wildlife Protection Law of the People's Republic of China and the US has a similar endangered species act to facilitate the protection of endangered species [4]. Therefore, the conservation and restoration of endangered species in drylands remain the most important focus today [4,5].
The species distribution model (SDM), which is typically developed to predict species distributions through observations of species occurrence and environmental conditions, can be used to estimate the spatial distribution of endangered species [6]. Currently, plenty of models are used to construct simulations of SDM (e.g. Maxent, support vector machine (SVM), generalize linear model (GLM)) [7]. In general, these SDMs need large amounts of labeled sample data to learn the relationship between species and environmental conditions [8,9]. In contrast, several scholars have used small sample sizes for modeling endangered species and have been able to obtain excellent modeling results [10,11]. For example, Lomba et al used a new integrated and stratified approach to simulating habitats of endangered species and achieved an AUC value of 0.97 for its model with only a few species occurrence data [10]. McCune applied the Maxent model to simulate eight rare woodland plants separately with 5-295 samples, and AUC values varied from 0.6737 to 0.9861 [12]. Notably, Wisz et al found the AUC of the model for independent data dropped to 0.4172-0.8908, and the generalization ability of the model was either mediocre or sometimes even worse. This may be related to inadequate sample collections, and model failure to accurately characterize the species' habitat requirements. In fact, predictions based on small samples are generally inappropriate for conservation planning and other complex applications due to a lack of knowledge of which models are properly trained and produce robust predictions [13].
Unlike traditional supervised learning, semisupervised learning works by clustering similar data using unsupervised learning and then using the existing labeled data to label the remaining unlabeled data [14,15], which is mainly applied in cases of label scarcity, where unlabeled data are utilized to help train classification samples, thus improving the performance of the classifier and compensating for the lack of classification labels. For example, Berthelot et al compared with supervised models and found that the performance of the semi-supervised model with only 500 samples was almost identical to that of the model with a supervised 73 257 quantity of samples [16]. However, semi-supervised learning has not yet been implemented to simulate potential habitats for endangered species.
The Tarim River basin (TRB) is the aridest region in China. It contains around 54% of the world's Euphrates Poplar Populus euphratica (P. euphratica) forests [17]. P. euphratica, an ancient and endangered species, is a distinctive and constructive species of desert riparian forests [18]. P. euphratica was recently assessed for The IUCN Red List of Threatened Species in 2017. As a 'Green Corridor' , the P. euphratica forests contribute significantly to preventing the unification of the neighboring sandy deserts of the Taklamakan and the Kuruktagh [19]. In addition, apart from their socioeconomic and tourist value, P. euphratica forests also have ecological functions like conserving biodiversity, regulating the climate and hydrological conditions of the oasis, fertilizing the soil, and maintaining the balance of the regional ecosystem [20]. However, the construction of the Daxihaizi reservoir in 1972 intercepted water from upstream and completely dried up the lower reaches of the TRB, groundwater declined dramatically and caused wholesale destruction of the P. euphratica forests along the river corridor [2,21]. The area of P. euphratica in the lower TRB declined from 5.4 × 10 4 hectares in 1958 to 0.67 × 10 4 hectares in the early 1990s, and almost 87.6% of the P. euphratica forest area has been lost [22].
To revive the endangered P. euphratica and restore the degraded ecosystems, the Chinese government implemented the 'Ecological Water Conveyance Project' (EWCP) since 2000, which invests 10.7 billion CNY to deliver freshwater from the upper reaches and Bosten Lake to the lower TRB to restore the ecological environment of the lower TRB [23,24]. Compared with the pre-water transfer period, the area of desert riparian forest vegetation increased from 492 km 2 before the water transfer to 1423 km 2 in 2020, with the area of low, medium and high cover vegetation increasing by 20.80%, 448.00%, and 190.00% respectively [25]. Several previous studies mainly focused on the analysis of the physiological characterization of P. euphratica under drought stress [26], the response of P. euphratica to changes in groundwater levels [27], and the eco-physiological response of P. euphratica to ecological water delivery [28], however, no SDM implementation has yet been attempted to estimate the spatial distribution of endangered P. euphratica in the lower TRB.
Therefore, this study aims to develop a semisupervised model to predict suitable habitats for endangered species based on a limited number of field surveys. The model combines the benefits of supervised and unsupervised learning by training with samples of known species occurrence data together with unlabeled samples to accurately identify events that match the labeled data. Taking the endangered P. euphratica in the lower TRB as a case study target, we validate this approach with an endangered species of P. euphratica in the lower TRB and compare the predictive ability of species distribution between semisupervised learning approaches and several commonly used SDMs (i.e. Maxent, GLM, SVM). The novel aspect of this study is the proposal of a semisupervised learning algorithm for a limited quantity of species tags, which overcomes the difficulty of modeling rare species due to limited species occurrence data. The model developed here and the results concluded from it may assist endangered species conservation.

Study site and data source
The TRB, which is the largest inland river in China, is an extremely arid region with an average annual precipitation of less than 40 mm, and potential annual evaporation of more than 2590 mm [29]. In hydrology, the TRB is a unique freshwater ecosystem in that its main catchment is located at the confluence of the Hotan, Yarkant, and Aksu rivers, and consists of a closed-loop catchment connected to several tributaries [30]. The major vegetation in the TRB is dominated by trees (Populus euphratica Oliv., Elaeagnusan gustifolia L., etc.), shrubs (Tamarix spp., Lycium ruthenicum and Halimodendron halodendron, etc.), and herbaceous (Phragmites communis, Alhagi sparsifolia and Glycyrrhiza inflata, etc) [22,31]. Due to the differences in topography and landform, P. euphratica is mainly found along the entire mainstream of the TRB [32]. The study area is located in the lower TRB (figure 1). The streams of the lower TRB dried up completely in 1972 due to the construction of the Daxihaizi reservoir, leading to serious ecosystem degradation of the lower reaches [33], and P. euphratica experienced a massive decline [30]. According to the survey, the area of P. euphratica in the lower TRB was reduced by 70% before the 21st century due to the drying out. To revive the endangered P. euphratica and restore the degraded ecosystems, the Chinese government has implemented EWCP since 2000 to convey fresh water to the lower TRB [18]. The EWCP is implemented by intermittently releasing freshwater from the dam (i.e. the Daxihaizi Reservoir) to the Qiwenkur river channel and the Old Tarim River channel until it finally reaches Taitema Lake (figure 1). Because of the project, groundwater levels have lifted dramatically and degraded P. euphratica has effectively been restored in the lower TRB [34,35].

Data source
The elevation (from Digital Elevation Model data) adopts Chinese Academy of Sciences Computer Network Information Center Geospatial Data Cloud ASTER GDEM 30m resolution series data (www. gscloud.cn/sources) and resampled to 10m resolution. Soil properties were obtained from Qinghai-Tibet Plateau National Data Centre World Soil Database with a spatial resolution of 250 m, which were then resampled to a spatial resolution of 10 m. The monthly groundwater depth in the study site was obtained from a Machine Learning algorithm, which is generated based on 70 groundwater monitoring wells in the lower TRB (model details can be referenced in Liu et al [36]). The distance to the river channel were calculated by spatial analysis using ArcGIS software.

Variables selection
The SDM assumes that the prediction dataset contains all biologically relevant variables [37,38]. As guidance for the selection of predictor variables, mutual information were used to calculate the influence of groundwater levels, soil variables, and climatic variables on P. euphratica (figure 2(a)). The result shows that the mutual information between P. euphratica and groundwater tables presents higher values than those between P. euphratica and soil variables/climate variables. This may be because P. euphratica can thrive on almost all soil types and requires less specific soil characteristics [39]. Compared with climate and soil characteristics, the growth of P. euphratica is more directly related to the groundwater table as precipitation here is too little to contribute to vegetation growth [40]. Furthermore, since the grey correlation analysis (figure 2(b)) shows higher correlations between groundwater level and soil climatic variables, we think groundwater already has the information on climate and soil characteristics. Therefore, predictor variables of groundwater tables are considered here. Moreover, considering topographic variables are the most important factors for P. euphratica [40] and are positively correlated with SDMs [41], elevation, the distance from the river reservoir were also selected as an input variable. Since the inclusion of latitude and longitude-adjusted environmental variables more accurate modeling of species distribution and suitability are available [42], latitude and longitude were also considered here. Therefore, 12 months of groundwater, elevation, distance from the river, and latitude and longitude were considered as inputs to the model in this study (table 1).

Traditional SDMs
Maxent is a statistical method for inferring unknown distributions based on limited known information, which is based on the principle of spreading species without ecological constraints through a uniform distribution whenever possible [43].
GLM is an extension of the Linear Model and establishes the relationship between the mathematical expectation of the response variable and a linear combination of predictor variables through a linkage function. It requires less clustering of probability distributions of the data and is therefore more suitable for the non-normal error structure of most ecological data [44].
SVM is a supervised machine learning algorithm that works by projecting data into a specific highdimensional space, utilizing the maximum interval method to find a division hyperplane to enable the gaps between classes to be maximized. The new unknown samples are mapped into the same space   and predicted to fit into a category based on which side of the gap they fall [45]. Figure 3(a) describes the working principle of supervised learning (i.e. Mxent, GLM, and SVM). The classifier is trained using large labeled data to obtain a trained classifier, which is finally used to make predictions.

Semi-supervised self-training
Semi-supervised self-training is to use a small amount of labeled data to train on the base classifier and use the trained base classifier to predict a large number of unlabeled data [15]. Then select some pseudo-labeled datasets with high confidence from the predicted unlabeled data, which are used to expand the scale of the labeled datasets. The base classifier is retrained with the mixed data of the originally labeled datasets and the pseudo-labeled datasets, and this process is iterated until the iteration stop condition is met. Selftraining involves two assumptions, the smoothing assumption, and the clustering assumption. Smoothness means that two samples x and x ′ shall have the same marker y and y ′ if they are close together in the input space and the clustering assumption means that data points belonging to the same cluster belong to the same class [46]. This method has a significant advantage. It can be used as a base classifier for any supervised classification in machine learning [46]. Figure 3(b) describes the simulation of SDM based on a semi-supervised self-training model.

Model evaluation
The area under the receiver operating characteristic curve (AUC) is used to assess model accuracy and is widely used and robust. In general, the value of AUC lies between 0.5 and 1, and the closer the value of AUC is to 1, the better the model performance.
Recall is a widely used metric in the field of information retrieval and statistical classification to evaluate the quality of the results, which indicates how many positive cases in the sample were correctly predicted. Recall takes the value between 0 and 1, which is calculated as where R is Recall, TP is True positives and FN is False negatives.
The difference between the training data and the test data (AUC diff ) is a measure of model overfitting. In general, the closer the value of AUC diff is to 0, the lower the risk of overfitting the model [47].

Selection of P. euphratica labels
In this study, the location data for P. euphratica populations were derived from the research team's field surveys, which were stratified into environmental gradients to establish P. euphratica labels ( figure 4). Firstly, this study rasterized the lower TRB study area. Secondly, the environmental variables such as temperature, and elevation were clustered in the different rasters by the K-Means algorithm, which calculates data aggregation by continuously taking the nearest mean value to the seed point. The K-Means clustering results were divided into environmental gradient cells and the corresponding number of P. euphratica was determined based on the number of grids in each gradient cell. Finally, we filtered the specific locations of P. euphratica in the different gradient units through field surveys, literature review, manual judgment, and historical image review. As a result, 443 P. euphratica tags and their specific locations were selected for this study. The environment at these locations were more suitable for P. euphratica growth, and the P. euphratica growth at these locations were good (figure 5). Meanwhile, 7691 randomly selected data from the huge unlabeled dataset were used as the unlabeled training dataset for the semi-supervised model in this study and assisted in semi-supervised training, where unlabeled samples indicate that only the environmental vari-  ables such as groundwater at the point are available and the true label is not known for the sample point.

Semi-supervised versus supervised models
Using highly correlated variables in model construction can lead to a decline in model performance [48].
This study extracted the required factors utilizing the principal component analysis and the top five factors, for the sample variance, presented a cumulative explanation of 87.39%, which means that extracting the top five factors alone could reasonably explain the vast majority of the information on groundwater over 12 months. The labeled train datasets and the unlabeled train datasets together constitute the semi-supervised selftraining train datasets, which were subsequently resampled and modeled 100 times using the train datasets to quantify the uncertainty in the predictions. According to figure 6, it is clear that all four SDMs can predict the distribution of P. euphratica with above 60% accuracy, out of which the semi-supervised models performed the best, with a maximum of 85% accuracy, and recall of up to 90%. The traditional SDMs models, which require a large number of samples, performed less well, with Maxent performing the worst, with a model accuracy of only around 62%. SVM performed slightly better than both the Maxent and GLM models, with mean values above 0.7 for all three evaluation criteria ( figure 6). Simultaneously, it is observed that all four models have AUC diff around 0, with no risk of overfitting. And the semi-supervised model has a slightly higher AUC diff than the other three models (figure 6). A possible reason for this may because the principle that semisupervision works by training on a labeled dataset and continuously iterating on unlabeled data and creating pseudo-labels, thus continuously expanding the training set.
The performance of the models in this study varies with the sample size with class labels (figure 7). In general, the accuracy of the models gradually decreased as the sample size decreased. Traditional SDMs, which traditionally require a large number of samples, perform the worst, with the sharpest decrease in model accuracy, and the semi-supervised models consistently outperform the traditional SDMs with a small number of samples. For example, at a sample size of 230, the accuracy of the SVM GLM, and Maxent models was only 72.09%, 71.51%, and 55.23%, respectively, while the accuracy of the semisupervised model was the highest at 79.06%. Spatial autocorrelation is a common challenge in modeling species distribution and abundance. The strong spatial autocorrelation of residuals might be due to the absence of key environmental information in the model, the choice of the wrong model structure, or its geography influencing species distribution [37]. From figure 8, Moran's I of the model residuals is 0.221 and the p-value of the Z-test under the assumption of random distribution is 0.028. It indicates that the residuals of the model at a confidence level of 99% fail to meet the original assumption of spatial autocorrelation. Thus, the semi-supervised model that we have developed is reliable.
Accuracy and robustness relationships in the modeling process vary considerably in different SDMs [49]. This study evaluates the results of commonly used SDMs versus semi-supervised models for modeling suitable habitats for P. euphratica in the lower TRB. The predictive performance of the semi-supervised model is considerably higher than  the traditional models. The reason for that probably relates to the environmental heterogeneity of the area, where traditional SDMs fail to accurately identify the relationship between environment and species with small sample size. Meanwhile, inadequate sampling leads to differences in the prior distribution between samples and the entire population, which reduces the accuracy and generalization performance of the SDM [50]. However, semi-supervised models effectively avoid such problems by combining the strengths of supervised and unsupervised learning, which train the model from small sample size, continuously predict its labels for large volumes of unsampled data, and incorporate new samples with high confidence levels into the model for training, thus gradually improving the accuracy and generalization performance of the model.

Suitable areas for P. euphratica in the lower Tarim River
P. euphratica is a unique forest resource unique to arid zones, and it is essential in stabilizing the ecological balance of desert river zones, windbreaks and solidifying sand, regulating the atmosphere of oases and creating fertile forest soils [51]. Research has indicated that the optimum groundwater depth for the growth of P. euphratica is approximately 3-4 m, with groundwater depths exceeding 9 m causing significant degradation of P. euphratica [22]. The horizontal root system of P. euphratica are relatively welldeveloped, dense, and wide-ranging, which facilitates the uptake of sufficient water in drought conditions. In contrast, young P. euphratica saplings initially grow lengthy primary roots to track the decline in groundwater and are more sensitive to the effects of drought stress. Most saplings were observed within 20-50 m of the river and decreased sharply as the distance from the river increased [52].
The potential habitat suitability areas for endangered P. euphratica species in the lower reaches of the TRB were estimated through the semi-supervised self-training model since it performs best (figure 9). From figure 9, the suitable area for P. euphratica are mainly near the river course of the lower TRB, especially the junction of the Daxihaizi Reservoir, and the confluence of the two main rivers downstream, which accounted for 13.49% of the entire study area. These areas are suitable for P. euphratica growth due to their relatively shallow water table since precipitation here is scarce and groundwater is the only water resource for P. euphratica growth [53]. The findings of this study also indicate that the suitable locations for P. euphratica growth vary at small regional scales. The conservation of endangered species needs to take full consideration of the potential habitat of the species and plant in the right location. By analyzing the potential habitat of P. euphratica in the TRB, we found that there is still significant land potential for restoration. EWCP has a restricted impact, areas away from the river channel are still unsuitable for P. euphratica growth.
SDMs as a benchmark for the restoration of endangered species at small regional scales can assist in guiding species planting cultivation and conservation. Several scholars have divided protected areas for rare species at small scales with high-resolution variables, which require more sample sizes. For example, Lannuzel et al modeled the distribution of 13 rare plant species in New Caledonia with 1610 samples under 10 m spatial resolution [11]. Ahmadi et al simulated four dominant tree species in the Hyrcanian forests of northern Iran with 5600 samples under 33.3 m spatial resolution through four traditional SDMs [54]. For endangered species in the arid zone, traditional SDMs models without such large sample sizes help them learn the relationship between species and their environment, not to mention being unable to accurately predict their habitat and accurately guide the implementation of restoration measures. Conversely, a huge amount of data on environmental variables is easily accessible and usable. The semi-supervised method can utilize a large number of environment variables without class labels to improve the predictive ability of the model, thus reducing the uncertainty of the model. One critical point for the accuracy of SDMs is the quality of the data [6]. In this study, the quality of data with class tags was further improved by field visits to determine where P. euphratica was located.
The 'rare species modeling paradox' is the paradox of rare species requiring the most predictive modeling of their distribution and their limited occurrence data making them the most difficult to model [10]. To overcome this paradox, Lomba et al modeled the rare plant, with only 37 samples by fitting several bivariate models and averaging all models using a weighted integration method [10]. Likewise, Breiner et al utilized ensembles of three traditional SDMs to model 107 rare plant species with 20-140 samples [55]. These researchers improve the modeling results by integrating multiple models. In contrast, semi-supervised models directly work from the data aspect, by selecting unlabeled data with a high confidence level through successive iterations for pseudo-labeling and adding them to the model for training. Such mechanisms can effectively address the lack of data on the occurrence of rare species.
The model established in this study is useful for rare species for which only a small amount of species occurrence data is available, or for species where sampling is difficult and a large number of labeled samples are not available. To improve the model performance, alternative variables derived from the Digital Elevation Model data (e.g. aspect, slope, topographic wetness index) can be considered in the SDM to analyze suitable habitats for endangered species. Subsequently, semi-supervised deep learning can be considered later as it is good at solving the complex problem of sparse samples and high-dimensional data encountered [56].

Conclusions
Simulation of suitable habitats for endangered species is one of the essential tools to protect endangered species from extinction. In this study, the semi-supervised model was innovatively applied to identify suitable habitats for endangered species based on limited field surveys and compare the predictive ability with several commonly used SDMs. Taking P. euphratica, an endangered species in the lower TRB, as a reference case, the semi-supervised model was employed to identify suitable habitats for P. euphratica in the region. Overall, the accuracy of the models gradually decreased as the sample size decreased, and the semi-supervised models consistently outperformed the traditional SDM with a small sample size. Results also show that the suitable habitat for P. euphratica in the lower TRB is mainly near the river channel, accounting for 13.49% of the entire study area. This may be because the precipitation here is too scarce, and groundwater, which is mainly recharged by the water conveyance project, is the only water resource of P. euphratica. Therefore, such a potentially suitable habitat is a priority area for future P. euphratica restoration. This study establishes a semi-supervised model applicable to assessing suitable habitat modeling for endangered species with only a few samples, which is a guide for global habitat conservation of endangered species.

Data availability statement
The model used in this study can be found in sklearn library in Python via scikit-learn/_self_training.py at main scikit-learn/scikit-learn GitHub. The Digital Elevation Model data from the Data Center for Resource and Environmental Sciences, Chinese Academy of Sciences www.gscloud.cn/sources. Monthly groundwater data from the lower Tarim River groundwater monitoring wells. Soil property information was obtained based on the 2020 soil dataset from the World Soil Database (HWSD) of the National Tibetan Plateau Scientific Data Center, via https://soilgrids.org.
The data that support the findings of this study are openly available at the following URL/DOI: https:// github.com/cimengtao/Datasets-open.