Habitat Suitability Estimation Using a Two-Stage Ensemble Approach

Biodiversity conservation is important for the protection of ecosystems. One key task for sustainable biodiversity conservation is to effectively preserve species’ habitats. However, for various reasons, many of these habitats have been reduced or destroyed in recent decades. To deal with this problem, it is necessary to effectively identify potential habitats based on habitat suitability analysis and preserve them. Various techniques for habitat suitability estimation have been proposed to date, but they have had limited success due to limitations in the data and models used. In this paper, we propose a novel scheme for assessing habitat suitability based on a two-stage ensemble approach. In the first stage, we construct a deep neural network (DNN) model to predict habitat suitability based on observations and environmental data. In the second stage, we develop an ensemble model using various habitat suitability estimation methods based on observations, environmental data, and the results of the DNN from the first stage. For reliable estimation of habitat suitability, we utilize various crowdsourced databases. Using observational and environmental data for four amphibian species and seven bird species in South Korea, we demonstrate that our scheme provides a more accurate estimation of habitat suitability compared to previous other approaches. For instance, our scheme achieves a true skill statistic (TSS) score of 0.886, which is higher than other approaches (TSS = 0.725 ± 0.010).


Introduction
For decades, the importance of biodiversity conservation has been emphasized globally because high biodiversity offers a variety of natural services that support sustainable human living [1]. Despite this importance, ecosystem services have rapidly declined for a variety of reasons, such as indiscriminate resource development, rapid urban expansion, and global climate change. The loss of biodiversity can have adverse consequences on the ecosystem because of the complex interactions that exist among species [2,3]. To maintain biodiversity levels, ecologists have devised and applied various methods to protect habitats by analyzing the characteristics of target species and their habitats [4,5].
Habitat suitability models, also known as species distribution models (SDMs), environmental niche models (ENMs), and predictive habitat distribution models, have been used to predict the habitat of target species based on various environmental factors, such as temperature, precipitation, seasonality, and terrain [2,3,6]. Habitat suitability models can be used to assess not only the relationship among various environmental factors, such as global climatic conditions, landscape information, and species habitats, but also landscape management and the conservation of endangered species [6][7][8][9]. With the development of remote sensing technology, the performance of habitat suitability models has improved significantly. Until a few decades ago, the prediction of habitat suitability for particular Table 1. Overview of habitat suitability modeling techniques.

Models
Descriptions References Category

SRE
Profiling technique that uses the environmental conditions of locations of occurrence data to profile the environments where a species can be found.
Hastie et al. [14] Statistical regression GLM Parametric regression technique based on a random component, a systematic component, and a link function describing a relation between the former the random and systematic component.
Elith et al. [10], Zuur et al. [12] Statistical regression MARS Non-parametric regression technique that builds multiple linear regression models across the range of predictor values.

Statistical regression GBM
Machine learning technique based on the combinations of decision tree algorithms and boosting methods.
Thomaes et al. [20], De'Ath et al. [21] Machine learning CTA Machine learning technique that is a supervised non-parametric statistical classification approach based on binary recursive partitioning techniques Breiman et al. [22], De'ath G et al. [23] Machine learning SNN Machine learning technique based on non-linear mapping structures inspired on the biological system of the brain.
D'heygere et al. [24], Fukuda et al. [25] Machine learning RF Machine learning technique using bootstrap aggregation to create a set of decision trees.
Phillips et al. [17], Breiman L et al. [26], Cutler DR et al. [27] Machine learning MAXENT Machine learning technique using the principle of maximum entropy to make a prediction from incomplete knowledge.
Phillips et al. [15], Reiss et al. [16] Machine learning Remote Sens. 2020, 12, 1475 4 of 18 More recently, habitat modeling based on deep neural networks (DNNs) has been investigated in ecological research [28][29][30][31]. In general, DNNs provide more accurate predictions in terms of identifying potential habitats for target species than conventional models such as GLM, GAM, MARS, and BRT. This is because DNNs can automatically extract features and learn complex non-linearities from extracted features [32]. For instance, Rademaker et al. [28] determined the niches of wild and domesticated ungulate species using modeling schemes based on a DNN. They focused on the applicability of the DNN and employed it in habitat estimation modeling. In their experiment, they showed that DNN could effectively identify potential habitats using sufficient observational data. Botella et al. [29] proposed a deep-learning approach for SDM. They applied a convolution neural network (CNN) and DNN to overcome the shortcomings of the traditional SDM. To evaluate its performance, they used part of the GBIF dataset and 46 environmental layers, including climate, digital elevation, and land cover. They subsequently found that both models performed better than traditional models such as the GAM and MAXENT.
However, despite the versatility of DNNs [30][31][32][33][34][35], a DNN-based habitat model trained on a small observational dataset for a species has been shown to produce inferior estimations to traditional machine-learning models, such as MAXENT and RF. Thus, constructing an accurate habitat suitability model is challenging because obtaining observational data is difficult [28]. Collecting sufficient observational data is particularly crucial when constructing habitat suitability models for endangered species. To overcome these problems, in this paper, we propose a novel two-stage based ensemble model called TSEM for the development of an effective habitat suitability model using an ensemble of various habitat suitability estimation techniques and DNN. Our TSEM was trained and tested on the crowdsourced datasets composed of volunteers' observation data. Strictly speaking, the observation data may indicate where the species actually lives or where the observations were made. As a result, the estimation results of our model could have similar characteristics. To improve the performance of habitat suitability models, we focus on three major issues, which are the main contributions of this paper.

•
Using crowdsourcing databases [36][37][38][39] and a diverse range of environmental data, we employ data pre-processing to generate framed data, which consist of observation data for the target species and related environmental data.

•
We propose a two-stage modeling scheme. In the first stage, we construct a DNN model using framed data. Then, we build an ensemble model using a diverse range of habitat suitability estimation methods and the results of the DNN model in the first stage to improve estimation performance in the second stage.

•
We compare our ensemble model with other estimation models based on a variety of evaluation metrics and statistical analysis and verify the superiority of our model.
The rest of this paper is organized as follows. We first introduce the steps required to construct the TSEM in Section 2. Then, we present several experiments conducted to evaluate the performance of our proposed model and visualize the results using a map-overlay function in Section 3. Finally, we summarize the major findings and provide directions in Section 4.

Overview of Two-Stage Habitat Suitability Estimation Model
We describe in detail our two-stage habitat suitability estimation model with the overall structure ( Figure 1). Observational, global climatic [40], and Korean land cover data [41] were initially collected to configure the independent variables for our model. In the first stage of the model, we constructed a DNN model as a sub-model. Next, a stacking ensemble-based estimation model was constructed using the results of the DNN sub-model as input to improve estimation performance. Finally, the habitat suitability results and other widely used evaluation metrics, including area under the curve (AUC), Remote Sens. 2020, 12, 1475 5 of 18 sensitivity, specificity, the kappa statistic, and the TSS were visualized for a performance comparison between the TSEM and previously reported models [42][43][44][45][46][47].

Dataset Construction
In general, the performance of a DNN-based estimation model depends on the quality and size of the dataset used for training. To construct our dataset, we first reviewed several crowdsourced databases that contain global observations of various species, selected 11 target species that were primarily found in South Korea, and then collected observational data for these species. These species are all considered conservation targets in Korean wildlife conservation projects. We listed the target species and their number of observations, as reported in the global biodiversity information facility (GBIF), VertNet, biodiversity information serving our nation (BISON), and Naturing databases ( Table  2). Species habitats are closely related to the climate and land conditions [42][43][44][45][46][47]. Therefore, to construct a valid habitat model, we collected various layers of environment information from Worldclim Bioclimatic [40] and a land cover dataset for South Korea [41], which are both widely used for ecological modeling. The land cover dataset for South Korea was generated using Korea multipurpose satellite No. 2, also known as KOMPSAT No. 2 or Arirang 2, and satellite pour l'observation de la terre 5 (SPOT 5) remote sensing images from 2009. KOMPSAT No. 2, which is equipped with a 1-m high-resolution multi-spectral camera (MSC), has orbited Earth approximately 46,800 times in nine years, capturing approximately 75,400 high-quality satellite images of Korea, while SPOT 5 can capture satellite images with a coverage of 60 km × 60 km and a resolution of 5 m. Compared to the GlobCover, the land cover dataset for South Korea provides more detailed information due to its higher resolution. Because this land cover dataset consists of categorical variables, we converted them into proximity distance layers, which have continuous values. Proximity distance for land cover layers is regarded as crucial for modeling because the unique survival traits of species and their habitat characteristics are closely related. Indeed, several studies have improved the performance of habitat suitability estimation models by considering the distance between the environmental layers and species observations. For this reason, we also employed proximity distance as an input variable. In total, we used 41 environmental layers as input variables for our habitat suitability modeling (Table 3).

Dataset Construction
In general, the performance of a DNN-based estimation model depends on the quality and size of the dataset used for training. To construct our dataset, we first reviewed several crowdsourced databases that contain global observations of various species, selected 11 target species that were primarily found in South Korea, and then collected observational data for these species. These species are all considered conservation targets in Korean wildlife conservation projects. We listed the target species and their number of observations, as reported in the global biodiversity information facility (GBIF), VertNet, biodiversity information serving our nation (BISON), and Naturing databases ( Table 2). Species habitats are closely related to the climate and land conditions [42][43][44][45][46][47]. Therefore, to construct a valid habitat model, we collected various layers of environment information from Worldclim Bioclimatic [40] and a land cover dataset for South Korea [41], which are both widely used for ecological modeling. The land cover dataset for South Korea was generated using Korea multi-purpose satellite No. 2, also known as KOMPSAT No. 2 or Arirang 2, and satellite pour l'observation de la terre 5 (SPOT 5) remote sensing images from 2009. KOMPSAT No. 2, which is equipped with a 1-m high-resolution multi-spectral camera (MSC), has orbited Earth approximately 46,800 times in nine years, capturing approximately 75,400 high-quality satellite images of Korea, while SPOT 5 can capture satellite images with a coverage of 60 km × 60 km and a resolution of 5 m. Compared to the GlobCover, the land cover dataset for South Korea provides more detailed information due to its higher resolution. Because this land cover dataset consists of categorical variables, we converted them into proximity distance layers, which have continuous values. Proximity distance for land cover layers is regarded as crucial for modeling because the unique survival traits of species and their habitat characteristics are closely related. Indeed, several studies have improved the performance of habitat suitability estimation models by considering the distance between the environmental layers and species Remote Sens. 2020, 12, 1475 6 of 18 observations. For this reason, we also employed proximity distance as an input variable. In total, we used 41 environmental layers as input variables for our habitat suitability modeling (Table 3).             Proximity distance from each cell to a residential area (detached residential and common residential areas) Continuous Distance_2 Proximity distance from each cell to an industrial area Continuous Distance_3 Proximity distance from each cell to a commercial area (commercial/business and mixed residential/business areas) Continuous Distance_4 Proximity distance from each cell to a leisure facility area Continuous Distance_5 Proximity distance from each cell to a transportation area (airport, harbor, railway, road, and other transportation and communication facilities) Continuous Distance_6 Proximity distance from each cell to a public facility area (basic environmental, education/administrative, and other public facilities) Continuous Distance_7 Proximity distance from each cell to paddy fields (land consolidation success and undergoing land consolidation in paddy fields) Continuous Distance_8 Proximity distance from each cell to dry fields (land consolidation success and undergoing land consolidation in dry fields) Continuous Distance_9 Proximity distance from each cell to a greenhouse Continuous Distance_10 Proximity distance from each cell to an orchard Continuous Distance_11 Proximity distance from each cell to other plantations (pastureland and other plantations) Continuous Distance_12 Proximity distance from each cell to broadleaf forest Continuous Distance_13 Proximity distance from each cell to coniferous forest Continuous Distance_14 Proximity distance from each cell to mixed forest Continuous Distance_15 Proximity distance from each cell to natural pasture Continuous Distance_16 Proximity distance from each cell to artificial pasture (golf course, cemetery, and other pastures) Continuous Distance_17 Proximity distance from each cell to inland wetland Continuous Distance_18 Proximity distance from each cell to coastal wetland (tidal mudflat and saltern) Continuous Distance_19 Proximity distance from each cell to naturally barren areas (beaches, riverbanks, and rocks) Continuous Distance_20 Proximity distance from each cell to artificially barren areas (mining area, playground, and other barrens) Continuous Distance_21 Proximity distance from each cell to inland water (rivers and lakes) Continuous Distance_22 Proximity distance from each cell to ocean water Continuous Remote Sens. 2020, 12, 1475 8 of 18

Data Preprocessing
To conduct habitat suitability estimation, preprocessing of the collected observation data and environmental variables was required, consisting of a number of steps. We carried out this data preprocessing using Quantum Geographic Information System (QGIS) 3.8.1. We present the steps used to prepare the training and testing datasets in Figure 2. First, we set the resolution to 3000 × 3000 pixels and cropped the collected layers based on the study area, which corresponded to a rectangle on the map. For this, we used the World Geodetic System 1984 (WGS84), which is an Earth-centered, Earth-fixed terrestrial reference system and a geodetic datum. In this system, the entire South Korea region is represented by the latitude and longitude coordinates (125.000, 38.083), (129.583, 38.083), (125.000, 33.166), and (129.583, 33.166). Because the bioclimatic and land cover layers (i.e., the classified land cover in South Korea) are all in a shape (.shp) file format, we converted them into a gridded data (.grd) file format, which consists of scalar values on a regular rectangular grid, either in longitude or latitude space.  Proximity distance from each cell to artificially barren areas (mining area, playground, and other barrens) Continuous

Distance_21
Proximity distance from each cell to inland water (rivers and lakes) Continuous Distance_22 Proximity distance from each cell to ocean water Continuous

Data Preprocessing
To conduct habitat suitability estimation, preprocessing of the collected observation data and environmental variables was required, consisting of a number of steps. We carried out this data preprocessing using Quantum Geographic Information System (QGIS) 3.8.1. We present the steps used to prepare the training and testing datasets in Figure 2. First, we set the resolution to 3000 × 3000 pixels and cropped the collected layers based on the study area, which corresponded to a rectangle on the map. For this, we used the World Geodetic System 1984 (WGS84), which is an Earth-centered, Earth-fixed terrestrial reference system and a geodetic datum. In this system, the entire South Korea region is represented by the latitude and longitude coordinates (125.000, 38.083), (129.583, 38.083), (125.000, 33.166), and (129.583, 33.166). Because the bioclimatic and land cover layers (i.e., the classified land cover in South Korea) are all in a shape (.shp) file format, we converted them into a gridded data (.grd) file format, which consists of scalar values on a regular rectangular grid, either in longitude or latitude space. We configured each pixel of the raster to have a coverage of approximately 135 m, which produces 900 M grids if we convert the entire South Korean region into a grid space. We then applied min-max normalization to all variables in the cropped bioclimatic layers. For the preprocessing of the land cover layer, we first divided it into multiple layers according to the land cover labels and conducted rasterization. We then calculated the proximity distance based on the separate single rasters and applied min-max normalization to each distance raster. Finally, we stacked all of the We configured each pixel of the raster to have a coverage of approximately 135 m, which produces 900 M grids if we convert the entire South Korean region into a grid space. We then applied min-max normalization to all variables in the cropped bioclimatic layers. For the preprocessing of the land cover layer, we first divided it into multiple layers according to the land cover labels and conducted rasterization. We then calculated the proximity distance based on the separate single rasters and applied min-max normalization to each distance raster. Finally, we stacked all of the rasters to generate a data frame that included labeled presence and absence, and that matched the values of the environmental rasters in the given presence and absence locations.

Stage 1: Habitat Suitability Estimation Based on a DNN
Recently, several DNN-based habitat suitability models have been proposed and have performed well when compared with previous methods [28][29][30]. Hence, we constructed a DNN model in the first stage to determine the probability of a species' presence or absence, with a higher probability of presence for a species indicating higher habitat suitability. In general, DNNs consist of three types of layers: an input layer, one or more hidden layers, and an output layer. The input layer receives input variables, while the hidden layers are involved in hidden feature processing. The output layer then produces the final prediction. The prediction performance of a DNN model is determined by the configuration of each layer and the model design. For example, it has been shown that the learning rate, the optimizer, regularization, and the activation function significantly affect prediction performance [33]. To obtain the best performance from the DNN model, we carefully determined the optimal hyperparameters using grid search and considering related research [28]. We used the GridSearchCV function of the scikit-learn library [48]. The number of repetitions of grid search was set to infinite, and the number of cross-validations was set to five times. Consequently, when we constructed our DNN model, we used four hidden layers containing 250, 200, 150, and 100 neurons ( Figure 3). We set the batch size to 75 and the number of epochs to 5000 with early stopping to optimize model training. To decide the batch size in the training stage, we carefully considered the results of the grid search. For training optimization, we tested three optimizers, including the stochastic gradient descent (SGD), root mean square propagation (RMSprop), adaptive moment estimation (ADAM), and then selected the ADAM as the best optimizer. In addition, we utilized the he-normal (HE) initialization to sort initial weights for individual inputs in a neuron model. The activation function controls the non-linearity of individual neurons. We tested five popular activation functions: linear, soft-max, rectified linear unit (ReLU), tangent, and sigmoid. Through the grid search, we selected ReLU as the activation function of our training model. The learning rate is a hyper-parameter that controls how much we are adjusting the weights of our network with respect to the loss gradient. If the learning rate is set too small, it might take a long time to converge on the performance goal. On the contrary, if the learning rate is set too large, the average loss will increase. To obtain the optimal learning rate, we performed a grid search with ADAM as the optimizer and ReLU as the activation function. Our training model was able to achieve optimal learning efficiency when the learning rate was 0.001.

Stage 1: Habitat Suitability Estimation Based on a DNN
Recently, several DNN-based habitat suitability models have been proposed and have performed well when compared with previous methods [28][29][30]. Hence, we constructed a DNN model in the first stage to determine the probability of a species' presence or absence, with a higher probability of presence for a species indicating higher habitat suitability. In general, DNNs consist of three types of layers: an input layer, one or more hidden layers, and an output layer. The input layer receives input variables, while the hidden layers are involved in hidden feature processing. The output layer then produces the final prediction. The prediction performance of a DNN model is determined by the configuration of each layer and the model design. For example, it has been shown that the learning rate, the optimizer, regularization, and the activation function significantly affect prediction performance [33]. To obtain the best performance from the DNN model, we carefully determined the optimal hyperparameters using grid search and considering related research [28]. We used the GridSearchCV function of the scikit-learn library [48]. The number of repetitions of grid search was set to infinite, and the number of cross-validations was set to five times. Consequently, when we constructed our DNN model, we used four hidden layers containing 250, 200, 150, and 100 neurons (Figure 3). We set the batch size to 75 and the number of epochs to 5000 with early stopping to optimize model training. To decide the batch size in the training stage, we carefully considered the results of the grid search. For training optimization, we tested three optimizers, including the stochastic gradient descent (SGD), root mean square propagation (RMSprop), adaptive moment estimation (ADAM), and then selected the ADAM as the best optimizer. In addition, we utilized the he-normal (HE) initialization to sort initial weights for individual inputs in a neuron model. The activation function controls the non-linearity of individual neurons. We tested five popular activation functions: linear, soft-max, rectified linear unit (ReLU), tangent, and sigmoid. Through the grid search, we selected ReLU as the activation function of our training model. The learning rate is a hyper-parameter that controls how much we are adjusting the weights of our network with respect to the loss gradient. If the learning rate is set too small, it might take a long time to converge on the performance goal. On the contrary, if the learning rate is set too large, the average loss will increase. To obtain the optimal learning rate, we performed a grid search with ADAM as the optimizer and ReLU as the activation function. Our training model was able to achieve optimal learning efficiency when the learning rate was 0.001. The finished DNN model was to generate the probability of both presence and absence for a species. The more suitable an area was as habitat for a particular species, the closer the probability of that species' presence was to 1, while the probability of absence followed the opposite trend.  The finished DNN model was to generate the probability of both presence and absence for a species. The more suitable an area was as habitat for a particular species, the closer the probability of that species' presence was to 1, while the probability of absence followed the opposite trend.

Stage 2: Ensemble-Based Habitat Suitability Estimation
Ensembles of machine-learning techniques have been widely adopted to solve various prediction problems in past research [43][44][45][46][47]. Compared to one machine-learning model, ensembles can improve prediction performance by combining several models. In the field of ecological modeling, ensemble models are widely known as a useful approach for the construction of potential habitat estimation models [43][44][45][46][47]. Therefore, in the second stage, we developed an ensemble-based habitat suitability estimation model using the BIOMOD2 package [49] for R programming. We present the overall construction process for our ensemble model in Stage 2 in Figure 4.

Stage 2: Ensemble-Based Habitat Suitability Estimation
Ensembles of machine-learning techniques have been widely adopted to solve various prediction problems in past research [43][44][45][46][47]. Compared to one machine-learning model, ensembles can improve prediction performance by combining several models. In the field of ecological modeling, ensemble models are widely known as a useful approach for the construction of potential habitat estimation models [43][44][45][46][47]. Therefore, in the second stage, we developed an ensemble-based habitat suitability estimation model using the BIOMOD2 package [49] for R programming. We present the overall construction process for our ensemble model in Stage 2 in Figure 4. According to the authors of [28,50], a low number of observations (i.e., n < 100) can degrade the estimation performance of a habitat model because using very few observations in model construction leads to overfitting and bias [8][9][10]. To solve this issue, we used 41 environmental layers and the results of habitat suitability from the DNN in Stage 1 as input variables for the ensemble model in Stage 2. This modeling method is known as stacking and can effectively avoid the possibility of overfitting and bias [51]. We built our ensemble model by combining GLM, GBM, CTA, SNN, FDA, MARS, RF, the SRE model, and MAXENT and used a weighted-average algorithm, which returns a weighted value for each model based on selected evaluation scores. Therefore, an accurate estimation model will have a relatively high weighted value when it combines all of the models. Because the TSS has been proven to be a reliable evaluation metric when measuring and assessing the performance of habitat models [47], we used the TSS to calculate the weighted value. Equation (1) was used to calculate the final estimation using the weighted average value for each model, in which and represent the class label for presence and absence and the number of models, respectively, indicates the estimated class label, and is the calculated probability of the th model. In addition, is the weighted value of the th model, which was calculated using Equation (2). We evaluated each model using five-fold cross-validation and obtained the final TSS value as the average of the TSSs generated by the individual models. * (1)

Results and Discussion
We evaluated our proposed model and compared its performance with other approaches to habitat suitability modeling. We first explained the evaluation metrics used to assess the quality of According to the authors of [28,50], a low number of observations (i.e., n < 100) can degrade the estimation performance of a habitat model because using very few observations in model construction leads to overfitting and bias [8][9][10]. To solve this issue, we used 41 environmental layers and the results of habitat suitability from the DNN in Stage 1 as input variables for the ensemble model in Stage 2. This modeling method is known as stacking and can effectively avoid the possibility of overfitting and bias [51]. We built our ensemble model by combining GLM, GBM, CTA, SNN, FDA, MARS, RF, the SRE model, and MAXENT and used a weighted-average algorithm, which returns a weighted value for each model based on selected evaluation scores. Therefore, an accurate estimation model will have a relatively high weighted value when it combines all of the models. Because the TSS has been proven to be a reliable evaluation metric when measuring and assessing the performance of habitat models [47], we used the TSS to calculate the weighted value. Equation (1) was used to calculate the final estimation using the weighted average value for each model, in which i and j represent the class label for presence and absence and the number of models, respectively,ŷ indicates the estimated class label, and p ij is the calculated probability of the jth model. In addition, w j is the weighted value of the jth model, which was calculated using Equation (2). We evaluated each model using five-fold cross-validation and obtained the final TSS value as the average of the TSSs generated by the individual models.

Results and Discussion
We evaluated our proposed model and compared its performance with other approaches to habitat suitability modeling. We first explained the evaluation metrics used to assess the quality of habitat suitability estimation and then evaluated the performance of our proposed model and other commonly used models using these metrics. We also visualized the results for habitat suitability analysis using map overlays.

Evaluation Metrics
To evaluate the performance of our model, we used five metrics: sensitivity, specificity, the kappa statistic, AUC, and TSS. These have all been regularly used to assess habitat modeling performance in ecology [47]. The percentage of correctly predicted sites was excluded as a measure of prediction accuracy for the proposed model because, even though it is simple to calculate, its usefulness is severely limited for rare species [19]. To evaluate the estimation results, we used a confusion matrix in which a, b, c, and d indicate true positive, false positive, false negative, and true negative, respectively (Table 4). For instance, when the ground truth is the presence and the prediction result from the proposed model is also the presence, then we counted it as a true positive. Sensitivity, specificity, the kappa statistic, and TSS were calculated using Equations (3)-(6), respectively, based on this confusion matrix.

Presence Absence
Presence Sensitivity represents the probability that a model will correctly predict the presence of a species, while specificity measures the probability of a model accurately predicting the absence of a species. TSS normalizes overall accuracy [47,50]. AUC is widely used to assess the accuracy of habitat suitability models because it is easy to interpret, thus allowing comparison between models. ROC curves are frequently used as a single threshold-independent measure for model performance. In previous studies designed to predict habitat suitability [52][53][54], models with an AUC greater than 0.8 were considered valid as predictive models. The kappa statistic is also a common evaluation metric used for habitat suitability estimation models, but it has been criticized for being heavily dependent on prevalence. TSS, on the other hand, avoids this problem while offering the advantages of the kappa statistic. In general, most ecological modeling research uses sensitivity, specificity, the kappa statistic, and TSS together to analyze the performance of habitat suitability models [51][52][53][54][55]. Thus, we used these five metrics together to compare their weaknesses, strengths, and commonalities.

Performance Evaluation
We described the comparison results for the habitat suitability models using the five metrics discussed in Section 3.1. We compared as many estimation models as possible, including GLM, GBM, CTA, SNN, FDA, MARS, RF, SRE, DNN, ensemble models not including DNN (EMED), and our proposed approach. As mentioned above, EMED has demonstrated satisfactory performance in the previous studies. We constructed the EMED model in the present study using GLM, GBM, CTA, SNN, FDA, MARS, RF, and SRE. All of these models were trained and tested using the BIOMOD2 package in R and were verified using five-fold cross-validation. We used 80% of the species observation as the training set and 20% as the test set. We present the selected parameters and training strategies for each model in the following (Table 5). We calculated the estimation performance of various models for target species using five metrics and present their averages in Table 6. Detailed experimental results, including sensitivity, specificity, AUC, kappa statistic, and TSS, can be found in the Supplementary Materials (Tables S1-S5). To objectively assess the estimation results, we show the evaluation criteria of AUC, kappa statistic, and TSS in Table 7. We can observe that our proposed model showed the best performance, while the DNN exhibited weak performance because model training was insufficient due to the lack of training data. Likewise, the SRE model showed a poor performance for the prediction (AUC < 0.6). Even though the SRE model is intuitive and fast, it does not fully reflect the interactions between environmental conditions and species distributions in modeling. All other models except the DNN and SRE models performed reasonably well in terms of predicting the presence of a species. In contrast, in terms of specificity, DNN was the best performing model. The AUC has long been regarded as the standard metric for assessing the performance of habitat suitability models. In most cases, TSEM demonstrated the best estimation performance, while EMED also generated high AUC values, with an average of 0.972. This demonstrates that the two-stage based ensemble approach can improve estimation performance. While TSEM performed best for kappa statistic and TSS, SRE was the worst-performing model. For Hyla japonica, EMED yielded a higher TSS (EMED = 0.786 and TSEM = 0.783) than TSEM because the DNN model in the first stage produced very poor estimation results. However, for all other species, our model outperformed the other models. In summary, based on these comparisons, clearly TSEM is more suitable for deriving ecological insights related to habitat suitability estimation.
To confirm whether the estimation results of our model are valid, we selected Rana huanrenensis as a visualization case. The visualizations of the results for other target species can be found in the Supplementary Materials (Figures S1-S10). Rana huanrenensis, also known as the Korean stream brown frog, lives mainly in Korea and Japan, and its habitats are identified as wetland, forest, and grassland ( Table 2). Due to the low number of confirmed populations of this species, it could be listed as vulnerable (VN) under the IUCN Red List criterion, but is listed as least concern (LC) based on the assumption of widespread occurrence, especially in Korea. The Rana huanrenensis lives in valleys in high montane regions, above 500 m in elevation [56]. This species is mainly observed from March to April, which is very closely related to the breeding season of this species. Rana huanrenensis breeds in slow-moving montane streams and rivers, and their eggs are laid in moderately small masses that are attached to submerged rocks [57]. Indeed, the distribution of observation data for Rana huanrenensis fits well with their habitat characteristics. We marked the blue points as a training set and yellow points as a test set. The areas marked in black represent the entire coniferous forest, mixed forest, and broad-leaved forest ( Figure 5). In Figure 6, green indicates suitable species habitats and blue indicates uninhabitable areas. In evaluating the habitat suitability estimation of Rana huanrenensis, TSEM showed a better estimation performance (TSS = 0.949) than other estimation models. Although EMED and RF performed slightly worse than TSEM (TSS of EMED = 0.819 and TSS of RF = 0.822), these models showed excellent estimation performance based on evaluation criteria. The distribution of Rana huanrenensis habitat estimated by TSEM is very similar to the mountainous terrain of South Korea. The SNN, MARS, GBM, EMED, and RF also showed the distribution of habitat estimations similar to mountainous terrain. However, we confirmed that the MARS, GBM, EMED, and RF estimated that the Rana huanrenensis was suitable for habitation in some regions, including residential, industrial, and commercial areas. The TSEM estimated that the areas of mixed forests, coniferous forests, and broad-leaved forests were relatively more suitable as the Rana huanrenensis habitat. Even though the TSEM was trained and tested on crowdsourced datasets of the Rana huanrenensis, the estimation results matched well with their actual habitats because most of the observation data for the Rana huanrenensis were near the main habitats such as wetland, montane streams, and forest. We demonstrated that the habitat suitability results estimated by the TSEM are well-matched when compared with the existing studies of the Rana huanrenensis habitats [56,57].  The highest values are in bold.  Figures S1-S10). Rana huanrenensis, also known as the Korean stream brown frog, lives mainly in Korea and Japan, and its habitats are identified as wetland, forest, and grassland (Table 2). Due to the low number of confirmed populations of this species, it could be listed as vulnerable (VN) under the IUCN Red List criterion, but is listed as least concern (LC) based on the assumption of widespread occurrence, especially in Korea. The Rana huanrenensis lives in valleys in high montane regions, above 500 m in elevation [56]. This species is mainly observed from March to April, which is very closely related to the breeding season of this species. Rana huanrenensis breeds in slow-moving montane streams and rivers, and their eggs are laid in moderately small masses that are attached to submerged rocks [57]. Indeed, the distribution of observation data for Rana huanrenensis fits well with their habitat characteristics. We marked the blue points as a training set and yellow points as a test set. The areas marked in black represent the entire coniferous forest, mixed forest, and broad-leaved forest ( Figure 5). In Figure 6, green indicates suitable species habitats and blue indicates uninhabitable areas. In evaluating the habitat suitability estimation of Rana huanrenensis, TSEM showed a better estimation performance (TSS = 0.949) than other estimation models. Although EMED and RF performed slightly worse than TSEM (TSS of EMED = 0.819 and TSS of RF = 0.822), these models showed excellent estimation performance based

Statistical Evaluation
To demonstrate the superiority of our proposed method, we performed Wilcoxon signed-rank and Friedman tests [58,59]. The Wilcoxon signed-rank test is a non-parametric statistical hypothesis test used to compare two related samples [58]. It can be used as an alternative to the t-test when one or more of the samples are not normally distributed. It establishes a null hypothesis to determine whether there is a significant difference between the two samples. If the p-value is less than a certain significance level, the null hypothesis is rejected, and the two samples are assumed to be significantly different. In the statistical hypothesis testing, the p-value is the probability of obtaining test results at least as extreme as the results actually observed during the test, assuming that the null hypothesis is correct. The Friedman test is a multiple comparison test that aims to identify significant differences between three or more samples [59]. It first ranks each row (block) together, and then considers the values of the ranks by column. The data are organized into a matrix with B rows (blocks) and T columns (treatments) with a single operation in each cell of the matrix. To verify the results of these tests, we used AUC, the kappa statistic, and TSS for each machine-learning method. The results of the Wilcoxon signed-rank test, for which the significance level was set to 0.05, and the Friedman test are listed in Table 8. Based on these tests, our proposed method exhibited significantly better prediction performance than the other machine-learning methods because the p-value was below the significance level in most cases.

Conclusions
In this study, we focused on a two-stage modeling scheme that can be applied to habitat suitability estimation for various species. First, we investigated and selected 11 species that are present in South Korea and regarded as targets for species conservation research. To obtain a sufficient number of observations for the target species, we extracted observational data for these species from several crowdsourced databases and added them to our database. Since spatial bias is a well-known problem in crowdsourced data, we tried to alleviate this bias by using three global datasets and one domestic dataset. In particular, the domestic dataset, Naturing database, contains data of target species observed quite evenly across South Korea. We also employed 41 environmental layers that included information on the global climate and the land cover of South Korea as input variables. To effectively estimate habitat suitability, we used a DNN model and an ensemble of habitat suitability estimation models in the first and second stages, respectively. To evaluate the effectiveness of the proposed model, we compared it with previously employed models and visualized these results using a suitability map overlay. The experimental results demonstrate that the proposed model has significant potential for use in estimating habitat suitability.
For model training and testing, we used crowdsourced datasets. This implies that there could be some bias in the observation data and the estimation results, as mentioned above. Hence, even though our model showed better performance than other models, estimation results might indicate where the observation was made, in other words, the species can be observed. To the best of our knowledge, it is an inevitable limitation of prediction models based on crowdsourced data. In future work, to ensure the reliability of our habitat suitability model, we plan to develop a method that can alleviate the potential biases of crowdsourcing datasets.