Abundance distributions for tree species in Great Britain: A two‐stage approach to modeling abundance using species distribution modeling and random forest

Abstract High‐quality abundance data are expensive and time‐consuming to collect and often highly limited in availability. Nonetheless, accurate, high‐resolution abundance distributions are essential for many ecological applications ranging from species conservation to epidemiology. Producing models that can predict abundance well, with good resolution over large areas, has therefore been an important aim in ecology, but poses considerable challenges. We present a two‐stage approach to modeling abundance, combining two established techniques. First, we produce ensemble species distribution models (SDMs) of trees in Great Britain at a fine resolution, using much more common presence–absence data and key environmental variables. We then use random forest regression to predict abundance by linking the results of the SDMs to a much smaller amount of abundance data. We show that this method performs well in predicting the abundance of 20 of 25 tested British tree species, a group that is generally considered challenging for modeling distributions due to the strong influence of human activities. Maps of predicted tree abundance for the whole of Great Britain are provided at 1 km2 resolution. Abundance maps have a far wider variety of applications than presence‐only maps, and these maps should allow improvements to aspects of woodland management and conservation including analysis of habitats and ecosystem functioning, epidemiology, and disease management, providing a useful contribution to the protection of British trees. We also provide complete R scripts to facilitate application of the approach to other scenarios.

tion in the relationships between species occurrence, and abundance has been reported, with various studies showing weak relationships (Gaston et al., 2000;Nielsen et al. 2005;Van Couwenberghe et al., 2013). Another study by Pagel et al. (2014) used a hierarchical model to predict abundances in time and space using a combination of plentiful occurrence data and restricted abundance data. Their method produced unbiased results, but with very low precision in predictions, perhaps due to inflexibility in their models or not using environmental covariates. These studies have all suggested promise for a technique combining the use of large amounts of occurrence data with small amounts of abundance data, but none have yet performed well enough to be of use for many real-world applications.
We present a two-stage modeling approach for predicting abundance, where the results of SDMs produced using the R package biomod2 (Thuiller, Georges, Engler, & Breiner, 2016) are regressed against abundance data and additional predictors using random forest regression with the R packages caret and randomForest (Breiman, 2001;Kuhn et al., 2016;Liaw & Wiener, 2002). This approach performs well in almost all cases tested here and is flexible and simple to use. We argue that poor correlations between SDM results and abundance previously reported may be partly due to the use of less powerful or inappropriate modeling techniques in other studies. SDMs are first produced using, in our case, presence-absence data to produce a map of estimated probability of occupancy for the species of interest. Separate abundance data are then used to fit a random forest regression that predicts abundance from probability of occupancy.
Additional predictors, which may be expected to influence abundance but not occupancy, can be included at this stage. We also include the SDM results of co-occurring species as covariates in the random forest regression, allowing biotic effects to be accounted for in the prediction of abundance and producing more realistic species responses.
We have used this approach to produce distribution maps showing the abundance of 20 common tree species in Great Britain. Available tree distribution data for Great Britain were surprisingly poor, presenting a knowledge gap for ecologists working on British woodlands, particularly in light of major threats such as emerging tree diseases (Boyd, Freer-Smith, Gilligan, & Godfray, 2013). Our distributions show total combined area covered by each species within each square kilometer (hectares per square kilometer) across Great Britain and are a significant improvement on previously widely available distribution data. We envisage that such distribution maps could make an important contribution in a number of fields related to British forestry, from conservation planning to epidemiology.

| METHODS
We predicted abundance of tree species using a combination of two established techniques. First, we used the R package biomod2 (Thuiller et al., 2016) to produce ensemble species distribution models (SDMs) of trees in Great Britain at 1 km 2 resolution. Then, we used random forest regression, with caret (Kuhn et al., 2016) and randomForest (Liaw & Wiener, 2002) packages in R, to link the results of these SDMs to a much smaller amount of abundance data, to predict abundance across Great Britain at the same resolution.

| Stage 1: Fitting species distribution models
From the Distribution Database of the Botanical Society of the British Isles (BSBI) (see Data Accessibility), we downloaded all records from Great Britain between 1950 and 2014 for 25 commonly found tree species: Acer campestre L., Acer platanoides L., Acer pseudoplatanus L., Alnus glutinosa (L.) Gaertner, Betula pendula Roth, Betula pubescens Ehrhart, Carpinus betulus L., Castanea sativa Miller, Corylus avellana L., Crataegus monogyna von Jacquin, Fagus sylvatica L., Fraxinus excelsior L., Populus tremula L., Prunus avium L., Prunus padus L., Pseudotsuga menziesii Franco, Quercus petraea Lieblein, Quercus robur L., Salix caprea L., Salix cinerea L., Sorbus aria Crantz, Taxus baccata L., Tilia cordata Miller, Ulmus glabra Hudson, and Ulmus procera Salisbury. We discarded records with location data less precise than tetrad level (2 × 2 km) and simplified data with more precise locations to tetrad level. We chose tetrad resolution as a suitable compromise between having a high number of records to use and a small spatial scale, as using coarse scales can be problematic when modeling species distributions (Dengler, Löbel, & Dolnik, 2009;. We then converted this presence-only data to presenceabsence. We considered tetrads for which botanical surveys had been undertaken at least twice since 1950, and where at least 50 species of plants were recorded in each survey, to be "well-surveyed" (Groom, 2013) (a map is provided in Appendix 1). Any well-surveyed tetrads that did not have records for the species of interest were reclassified as "absence" points for that species (i.e., locations where the species was likely to be either truly absent or at very low abundance and therefore playing little role in defining the dominant ecological characteristics of that tetrad). Accounting for the likelihood that common trees will have a higher detection probability than most species of plants, we kept this threshold low enough to prevent the exclusion of tetrads in species-poor areas, while being high enough to prevent the inclusion of too many poorly surveyed tetrads (Groom, 2013). This produced a total of 18,993 tetrads from across Great Britain that were considered well surveyed and subsequently used as presence or absence points.
Data manipulation was carried out using custom-written scripts in Python (Python Software Foundation: Version 3.3.2).
We downloaded data on a variety of ecological variables across Great Britain from a variety of free sources (Table 1). See Data Accessibility for details. Preprocessing of layers was carried out in ArcGIS (ESRI 2014) to ensure identical extent, cell size, and coordinate system for use in species distribution modeling. All environmental covariates were used at 1 km resolution: vector datasets were rasterized to 1 km resolution.
We then fitted species distribution models (SDMs) to these data.
For reviews of these methods, see Elith and Leathwick (2009) and Pearson and Dawson (2003). SDMs use species records and environmental variables to fit models that describe the relationship of the species' distribution to the environmental variables, which can then be used to predict the occupancy probability or related measures across a wider landscape (Elith & Leathwick, 2009;Thuiller, 2003). SDMs for all species were produced using the package biomod2 in R (R Core Team 2015; Thuiller et al., 2016).
We selected 15 environmental variables as covariates from the original set of 33. We removed one of each pair of variables with a pairwise Pearson's correlation coefficient higher than 0.7, while retaining variables that are known to be important determinants of plant growth Prentice et al., 1992). The final selection was altitude, aspect, slope, direct incoming solar radiation, mean diurnal temperature range, temperature seasonality, annual precipitation, ancient woodland locations, topsoil available water capacity, topsoil minerology, topsoil organic carbon content, topsoil texture class, soil category, National Forest Inventory (NFI) forest type, and land cover type(see Appendix 2 for pairwise Pearson's correlations between selected variables). We ran six algorithms (GLM, GAM, classification tree analysis (CTA), generalized boosting models (GBM), random forest (RF), and maximum entropy (MaxEnt)) 15 times for each species using the 15 environmental covariates, producing a total of (25 species × 6 algorithms × 15 repeats) = 2,250 models.
Each model run was carried out using a randomly chosen 70% of the presence-absence data (Heikkinen, Marmion, & Luoto, 2012;Thuiller, 2003); the remaining 30% were used for cross-validation to assess the performance of each model using two model assessment criteria; area under the receiver operator curve (ROC) and the true skill statistic (TSS; Allouche, Tsoar, & Kadmon, 2006). For each species, we selected the best-performing models (see Table 2) to build an ensemble distribution model (a mean of the raw model results, weighted by the model ROC scores), producing a single distribution map for each species that represents a robust estimate of a species' British distribution at 1 km 2 resolution (Thuiller et al., 2016). The model selection process was as follows. Firstly, we assessed ROC and TSS scores-for both metrics, a higher value indicates better model fit-and if there was a leading group of models whose ROC and TSS scores were a step higher than the remainder, this leading group was chosen. Often this leading group contained just the 15 random forest models. Otherwise, the top 20 models with the highest scores were selected. Secondly, we visually assessed the predicted responses of the species to each environmental covariate for each of these models.
Any models that contained biologically implausible responses were rejected, as were models where the responses or predicted occurrence maps disagreed greatly from the overall consensus, as these can lead to development of inappropriate ensemble models (H. Hannemann, personal communication). See the walkthrough of R code in Supporting Information for an example of how models were chosen and example response curves. After rejection of implausible models, the final number of models used to produce each ensemble ranged between 11 and 20. Ensemble models were therefore robust, biologically plausible, and had high predictive power for the majority of species (see Table 2).
Nonsignificant variables were not removed from the models because of the very large size of our datasets, and because the models were used to make predictions rather than to test hypotheses. Therefore, final models may include terms that were not important to the outcome, but this should not have had a detrimental impact on the model fit. The numbers and types of models selected for each species are displayed in Table 2.

| Stage 2: Modeling abundance using random forest regression
Abundance data for trees, in the form of hectares covered by a species per square kilometer (or percent cover), were obtained from the Countryside Survey and myForest (see Data Accessibility). The Countryside Survey is a large-scale survey in Great Britain measuring many aspects of landscapes and the countryside, including diversity and abundance of plant species. It uses a random stratified sampling procedure to capture a representative sample of all land cover types.
By contrast, myForest is a service set up to help woodland owners map and manage their forests, which currently holds data on over 45,000 ha of woodlands across Great Britain, but does not contain any records outside of woodlands. For all tree species combined, 9,800 randomly selected abundance data points from the Countryside Survey and 9,453 abundance data points from myForest were used, making an average of 770 abundance data points per species (see Appendix 5 for numbers of data points per species).
The two abundance datasets (Countryside Survey and myForest) were rescaled to express them as hectares covered per kilometer squared (percent cover), in order to make them comparable. For the myForest data, which was originally provided in the format percentage cover of each species within a woodland patch, this involved multiplying each percentage cover record by the proportion of woodland cover in the relevant kilometer square. For this, we used a shapefile downloaded from the National Forest Inventory (NFI), containing outlines of all woodlands over 0.5 ha in Great Britain. For the Countryside Survey data, which was collected using a more complex methodology (details available in Barr et al., 1993) where linear features such as hedgerows were sampled separately from the rest of the landscape, more manipulation was required. The data were weighted by the length of linear features in the kilometer squared, to account for the fact that linear features are more likely to contain trees and the lengths of them are not equal across the country. The weighting was done using (linear plot percentage cover × percent of kilometer square covered by linear features) + (nonlinear plot percentage cover × remaining area), with all required information taken from the Countryside Survey.
Tree cover data for England and Wales from Bluesky's National Canopy Map were made available by the Woodland Trust, to be T A B L E 1 Ecological variables downloaded and produced for species distribution modeling. Details of data sources can be found in Data Accessibility used as a modeling covariate. Three layers from this were used: the total tree cover, tree cover derived only from woodlands included in the NFI, and tree cover derived from trees outside woodlands. The National Canopy Map layers were used in England and Wales, while the more basic NFI layers were used in Scotland where complete tree cover data were not available. We also used the NFI dataset to calculate the proportion of each square taken up by broadleaved woodland edge, which was defined as any woodland within 50 m of nonwoodland (Aune, Gunnar, & Moen, 2005). All these layers were used as covariates in the random forest regression (below). We used R version 3.2.3 for all modeling and data processing (R Core Team 2015).
We used random forest regression to model the relationships between abundance, the probability of occupancy predicted by the SDMs, and our tree cover covariates which we expected would be important for modeling tree abundance (Breiman, 2001). A separate random forest regression was implemented for each species. The SDM outputs for all species were included as variables for each species, so that the models would also capture interactions between species (such as competition). Potentially, this could also capture variation in other variables that are not included in that species' SDM but which correlate with the distribution of other species. Models had the form: where P̂ is the predicted probability of occupancy from the relevant SDM, C A is cover from all trees, C W is cover from woodland trees only, C O is cover from trees outside woodland only, and C E is cover from woodland edge.
Models were run using the combined myForest and Countryside Survey data. We chose to use random forest regression because it is insensitive to data distribution and therefore copes well with our data which has a high percentage of zeros. It can also take a large number of potentially collinear variables, and is robust to overfitting, making it extremely useful for prediction (Prasad, Iverson, & Liaw, 2006;Segal, 2004). We used these models to predict abundance of each species across the whole of Great Britain at 1 km 2 resolution. We used rootmean-square error (RMSE) and mean absolute error (MAE), produced by k-fold cross-validation with 10-fold, to evaluate our models. These two commonly used evaluation metrics give interpretations of a model's average error when testing it against independent data, in this case, the 10% that was left out of each run (Chai & Draxler, 2014).
A schematic overview of the whole two-stage method is shown in Figure 1.

| Species distribution modeling
All selected models had useful prediction capability (AUC > 0.7) (Boyce, Vernier, Nielsen, & Schmiegelow, 2002). In general, prediction accuracy of the selected models was good and they successfully predicted a large proportion of known presence or absence points.
The selected models had ROC scores between 0.71 and 0.96 and TSS scores between 0.31 and 0.83 ( or non-native trees whose distribution is largely controlled by human planting (P. menziesii), and as a result, it is unlikely to be possible to generate high-scoring distribution models for these species . For 21 of 25 species, however, SDMs produced high-quality ensemble models.

| Abundance modeling
In general, the random forest models were very successful in predicting the abundance of tree species. Figure 2 shows the predicted against observed abundance for four representative species; graphs for all other species are included in Appendix 3. For the majority of species, the predictions of the models are similar to the observed values.
We produced root-mean-square error (RMSE) and mean absolute error (MAE) scores using 10-fold cross-validation to evaluate our models' performance. These two commonly used model evaluation metrics give interpretations of a model's average error when testing it against independent data (Chai & Draxler, 2014). Table 3 shows RMSE and MAE scores for each species; the error scores are given in the same scale as the response variable, that is, hectares covered per square kilometer (percent cover). All the models have RMSE scores under 10, and most are under 5. All MAE scores are under 5. The average prediction error for most of the models produced is therefore <5%.
For six species, Acer platanoides, Populus tremula, Prunus padus, Sorbus aria, Ulmus glabra, and Ulmus procera, there were too few nonzero abundance data points to use 10-fold cross-validation. We chose 50 positive data points as the cutoff for using 10-fold cross-validation, as this gives an average of five nonzero data points per fold. Acer platanoides had 42 positive abundance data points, so for this species, we used eightfold cross-validation to maintain an average of five nonzero data points per fold. However, for the remaining five species, we felt that there was not enough data available to produce reliable abundance models (see Table 3). These species were omitted, and maps of predicted abundance of the remaining 20 species across Great Britain were produced ( Figure 3 and downloadable from the Sylva Foundation website and Oxford University Research Archive (see Data Accessibility). Where adequate abundance data were available, however, random forest regression was able to improve the prediction of the species for which the SDMs had a poorer fit. We were able to successfully model the abundance of Prunus avium, Pseudotsuga menziesii, and Tilia cordata despite the SDM prediction accuracy for these species being poorer than the other species.
We also calculated R 2 scores for the models, and these are available in Appendix 5. However, we recommend caution when interpreting these scores, as R 2 is not the most appropriate metric to use in this situation. R 2 is affected by the extent of the dependent variable (Gelman & Hill, 2007), and as the maximum abundance varied greatly F I G U R E 1 Schematic showing the outline of the two-stage method for predicting abundance distributions. The first stage uses SDMs to produce maps of predicted probability of occupancy, while the second stage takes these maps as inputs and uses Random Forest regression to produce maps of predicted abundance. covariates, e.g., wood area between species, this confounds comparison between our models.
The high percentage of zeros in our datasets also produces difficulty in the interpretation of R 2 . For instance, for Acer campestre, over 97% of the available abundance data points were zero. The model tended to predict very slightly higher than zero for these points (generally between zero and one percent cover), resulting in a low R 2 (.523). However, the observed vs predicted graph ( Figure 2) and the low RMSE and MAE scores (Table 3) for Acer campestre show that the model generally predicts very close to the true abundance, despite scatter in the data, and this is mirrored for most other species. For applications where the difference between zero and one or two percent cover is unimportant, these models can be used directly for predicting abundance; where it is more important, the predicted against observed graph can be used to select a cutoff below which predicted abundance will be coerced to zero.

| DISCUSSION
The two-stage modeling approach produced good or excellent predictions of abundance for the majority of species across the whole of Great Britain, despite only being trained on a relatively small amount of abundance data. This is in contrast to several previous studies looking for relationships between SDM results and abundance, which have shown little or no relationship (Gaston et al., 2000;Johnson & Seip, 2008;Nielsen, Johnson, Heard, & Boyce, 2005). However, to our knowledge, no previous studies have used random forest regression to model this relationship, and doing so has a number of advantages. Most importantly perhaps is that it does not make any assumptions about the shape of the relationship. Previous studies have attempted to use the negative binomial and other theoretical distributions, but we argue that this is likely to be an oversimplification that may mask true relationships. The shape of such a relationship, which is likely to have several different drivers, may not follow a simple mathematical function, and is known to vary between species (Gaston et al., 2000;Harris, 2015;Nielsen et al. 2005). The use of random forest regression allows for such variation, making it a much more powerful technique for this application (De'ath & Fabricius, 2000;Evans & Cushman, 2009;Prasad et al., 2006).
Our two-stage modeling approach has a number of other advantages.
It can incorporate biotic effects, and include covariates that are expected to influence abundance separately from those expected to influence occupancy. It makes use of the large amount of presence or presence-absence data that are often available, rather than discarding it. It will work with any measure of abundance (number of individuals, percentage cover, biomass, etc.) and has been shown to be effective over large spatial extents. It may be a particularly powerful approach where occurrence and abundance are not influenced by exactly the same factors (see Nielsen et al., 2005). Although not tested here, this method also has the potential to be effective when used with the results of more problematic SDMs, such as those made using presence-only data, which can only predict a relative likelihood of occupancy (Araújo & Peterson, 2012).
We can also make use of the covariate allocation of random forest to gain insights into underlying ecological processes within the F I G U R E 2 Observed abundance against abundance predicted by Random Forest regression, as used to assess model performance, shown for four tree species. The line on each graph is the 1:1 line showing perfect model fit  (Breiman, 2001). This means we can see which other species' SDM results are most strongly associated with the abundance of our species of interest, allowing us to identify possible biotic interactions such as competition. This does not allow us to distinguish causal relationships because of the possibility that hidden covariates could be at play; two species' SDM results could be correlated with each other not because of a biotic interaction, but because they are both influenced by an underlying factor. However, it does provide a qualitative estimate of biotic effects that could be an interesting starting point for further study. The inclusion of biotic effects may have the additional benefit of improving model performance for predicting abundance under new conditions, such as future climate scenarios (Anderson, 2013;Araújo & Guisan, 2006;Elith & Leathwick, 2009;Harris, 2015). Species distributions and abundances are predicted to be strongly influenced in future by both climatic changes and biotic effects, and to our knowledge, this is the first technique for predicting abundance which is able to make some account of these biotic effects.
However, the approach will not be able to incorporate changes to biotic effects with novel species assemblages, or other factors such as dispersal limitation, without further modification.
Not all species were successfully modeled using this technique.
Prunus padus, Populus tremula, Sorbus aria, Ulmus glabra, and Ulmus procera were all unsuccessful, in each case because very little abundance data were available for these species in our datasets. For example, our combined abundance dataset contained only four nonzero data points for Sorbus aria, demonstrating the difficulty in acquiring abundance data even for such a well-studied system. However, various tree species which are generally considered to be difficult to model-such as Pseudotsuga menziesii, a non-native species whose distribution is still largely controlled by planting, and Tilia cordata, which is thought to be both rare and widespread in Britain due to an unusual ecological history (Pigott, 1991)were successfully modeled by random forest regression, despite showing relatively poor SDM performance. Overall, the method performed well for the majority of species and seems to be generally effective across a range of species, provided that sufficient abundance data are available.
British trees exist in highly human-modified landscapes where their distributions have without exception been altered by human land use and preferences (Hopkins & Kirby, 2007;Rackham, 2008 However, despite this, the models performed well for the majority of species. This suggests that the models may be flexible enough to work in a variety of contexts and are likely to perform even better in less human-dominated landscapes. This flexibility is one of the major advantages of using random forest regression, and we expect it to offer broad application in modeling abundance of a wide range of species (Prasad et al., 2006). The next step for evaluating the method will be to compare its performance to other published methods for predicting abundance, which could be done by evaluating the relative performance of this and other methods with a variety of published datasets.
The abundance maps that we have produced are the best quality abundance distributions currently available for these species in Great Britain; previously, the best widely available distribution maps for trees in Great Britain were presence-only maps on 10-or 2-km square scales (see Figure 4). Our maps are modeled, not directly observed, and as is the case for modeling any highly noisy system, will not accurately predict abundance in every 1 km square; however, they are expected to capture overall patterns of distribution well. As more data, particularly abundance data and better quality environmental covariates, become available, our maps can continue to be improved. Abundance maps have a far wider variety of applications than presence-only maps, and these maps will allow significant improvements to these applications. British woods face a range of threats, including invasive diseases such as ash dieback, undermanagement or overmanagement leading to poor woodland quality, pollution, and damage by deer (Rackham, 2008). These improved maps should allow better planning and management of woodlands, analysis of habitats and ecosystem functioning, and epidemiology and disease management, and will be a useful contribution to the protection of British trees.

| CONCLUSION
The two-stage method to predict abundance, using random forest regression to model the relationship between SDM outputs and abundance, is robust and easy to use producing good results for the majority