Unpacking the ‘black box’: Improving ecological interpretation of regression‐based models

Many tree species distribution models use black‐box machine learning techniques that often neglect interpretative aspects and instead focus mainly on maximizing predictive accuracy. In this study, we outline an interpretative modelling framework to gain better ecological insights while mapping abundance patterns of six North American species.


| INTRODUC TI ON
The evolutionary history of temperate North American vegetation has been shaped by various global-scale climatic and biogeographic forces, the accompanying disturbance regimes and their interactions. These forces have resulted in the current distribution of extant species we see today, and sets the stage for species potential changes under contemporary anthropogenic climate and land use changes (Schoonmaker & Foster, 1991). Over deep time, geological forces have shaped the continents and the terrain, resulting in biogeographic constraints to migration pathways and niches of various tree species (Barton et al., 2003). These geological forces created mountain ranges and other features that not only altered migration routes but also greatly influenced regional climates, including glacial events that shaped the vegetation of North America for both Gymnosperms (starting ~350 mya) and Angiosperms (starting ~120 mya). The changing landforms, especially in the last 120 Mya when Angiosperms and Gymnosperms competed amidst climatic swings, and the accompanying large-scale biotic and abiotic disturbances, have shaped the migration pathways and niches where tree species have subsequently colonized (Graham, 1999;Williams et al., 2004). The climate-terrain interactions of present-day abundance patterns, therefore, are the predominant outcome of this rich eco-evolutionary legacy across their entire range (Silander, 2001), In this study, we explore and analyse the macroscale distribution of six North American species derived from their current abundance patterns, and their interaction with climate and terrain throughout their range. Our focus is on the interpretative aspects of modern machine learning techniques rather than the maximization of predictive accuracy.

| Modelling rationale
Modern species distribution modelling began roughly in the last decades of the 20th century, with the development of reliable climatic interpolation methods (Booth, 1990(Booth, , 2016. A shift away from traditional parametric methods to non-parametric machine learning techniques (especially those based on decision trees) was also evident during this period (Iverson & Prasad, 1998;Verbyla, 1987). In addition to decision trees, semi-parametric data-driven functional models like generalized additive models (GAM) that account for nonlinear relationships between response and explanatory variables became popular (Guisan et al., 2002). The attractiveness of decision tree (DT) models stems from their superior ability to handle multivariate non-linearity, including all possible interactions, in a hierarchical tree, with the important macroscale splits (spatially 'broad') forming earlier in the tree and the more regional effects in the lower branches (Breiman et al., 1984;Iverson & Prasad, 1998). Another important aspect is the interpretability of the decision trees, which enables one to understand complex relationships intuitively. While the DT approach shines in intelligibility, interpretability and the handling of interactions, they generally do not provide robust predictions (Dwyer & Holte, 2007) because they suffer from higher tree variance even with small data perturbations, and a tendency to overfit. To improve DT predictions, ensemble methods are used -for example, bagging, Random Forests and boosted trees (Prasad, 2018;Prasad et al., 2006). But these ensemble methods are typically blackbox techniques designed to maximize prediction accuracy (Kuhn & Johnson, 2013). The explainable features of these black-box models are mainly limited to variable importance scores and approximate partial dependency plots. However, sacrificing interpretability for prediction with black-box models can be costly (Chakraborty et al., 2021;Rudin, 2019), and efforts are underway to find a good compromise between interpretability and prediction (Linardatos et al., 2020;Lucas, 2020;Ryo et al., 2021).

| Our approach
Our earlier efforts in modelling suitable habitats and colonization potentials used multi-model ensemble and hybrid techniques optimized for predictive accuracy which were mostly black-box techniques Prasad et al., 2020). In this paper, we focus on 'ecological interpretation', rather than predictive performance, as our main goal by examining more closely the variance inherent in ecological phenomena instead of trying to minimize it for achieving better prediction. By focusing on ecological rules governing continuous abundance patterns across the entire range of the species, we shed light on the non-linear interactions that are typical of vegetation-environment interactions. We construct a procedure we term 'optimized regression tree bagging for interpretation and mapping' (ORTBIM) using specialized machine learning techniques for extracting dominant patterns and rules geared towards interpretation and mapping. By formulating a procedure based on examining the variance of singular and bagged regression trees (Breiman, 1996), we boost confidence in ecological interpretation. The ORTBIM procedure consists of (a) selecting ecologically meaningful climate and terrain variables; (b) using a bootstrapping technique called recursive feature elimination to pick the most relevant variables from this set; (c) using hyper-parameter tuning to achieve optimally pruned tree; (d) using frequency distribution of species abundance to choose an abundance threshold for selecting dominant rules; (e) using bagging trees (bootstrap resampling) to quantify variances; and (f) to combine and map dominant rules (henceforth referred to as dominant bagged rules) under current conditions, and to examine trends under a future climate scenario.
Using this novel procedure, we explore patterns of tree abundance for six representative tree species in the eastern and western regions of the North American continent (north of Mexico), derived from Forest Inventory and Analysis (FIA) data for the conterminous United States and projected across the entire continental United States and Canada. The goal is to understand and explore dominant tree distribution patterns under current conditions and a future climate at varying resolutions and therefore lay the foundation for improved modelling of regional and local trends. In the process, we seek to answer the following questions: • What is the role of elevation and terrain in the distribution of the species across the east and west?
• What can we understand from interactions between elevation/ terrain and climate, especially between the eastern and western tree species?
• How sensitive are the rules to changes in scale (10 and 20 km 2 )?
• Does the type of response (relative abundance vs. absolute abundance) have a large effect on the results?
• Are there patterns in finer-scale plot-level FIA data with the explanatory variables that shed light on niche occupation patterns when contrasting eastern and western species?
• How do the trends in the mapped dominant bagged rules of current and future abundances inform improved management of tree species in the face of changing climate? By examining the above questions for the six representative species in a nuanced interpretative framework, we set the stage for building a better prediction modelling framework in North America.

| The abundance responses
We use the USDA Forest Service Forest Inventory and Analysis (FIA; https://apps.fs.usda.gov/fia/datam art/datam art.html) plot-level data to derive relative importance value (RIV, also relative abundance) and absolute importance value (AIV, also absolute abundance) for three eastern species -eastern white pine (Pinus strobus), sugar maple (Acer saccharum) and chestnut oak (Quercus montana); and three western species -Engelmann spruce (Picea engelmannii), ponderosa pine (Pinus ponderosa) and Douglas fir (Pseudotsuga menziesii). The RIV index is calculated as: where x is the species in the plot, BA is the basal area and NS is the number of stems (summed for overstorey and understorey trees). In monotypic stands, the RIV would reach a maximum of 100%.
The absolute abundance (AIV) is calculated as: where x is the species in the plot, SumDia is the sum of diameter of the species in the plot and NS is the number of stems (summed for overstorey and understorey trees). AIV is normalized by dividing it by the maximum value of AIV for the species in the plot and multiplying it by 100 to get values between 0 and 100. AIV is an absolute measure of both the abundance and dominance of the species within the plot without standardizing them to the other trees in the plot. Because the FIA plot locations are unequally sampled across the United States, we aggregated relative importance values averaged over grid cells at (a) 20 × 20 km (20 k) and (b) 10 × 10 km (10 k). We excluded plots that indicated artificial regeneration, ensuring that only natural stands were aggregated for reasons that will be explained in Section 4.
The range and spread of the six selected species capture the spatial trends of many of the other species in the conterminous United States, except those that are quite rare or widely scattered in distribution. We used the ORTBIM procedure (explained in detail below) to model the responses for RIV and AIV for these six species at the two spatial resolutions. The abundances were calculated from the accurate and consistent annual FIA data confined to the conterminous United States, but the explanatory variables spanned the continental United States and Canada.

Because our domain of inquiry included continental United
States and Canada, the explanatory variables (also referred to as 'features') were confined to climate, elevation and terrain variables ( Table 1). We did not use soil variables for two reasons: (a) the soil data quality and metrics are inconsistent at continental scales and (b) they have been shown to be generally less relevant at these scales, becoming more important only at regional and local scales (Dyer & Hutchinson, 2019;Mathys et al., 2014).
We grouped the variables into six categories according to climate and terrain characteristics to assist in summarizing the results (Table 1).

| Elevation and terrain
The elevation and terrain data ( Table 1) were derived originally from the USGS GMTED2010 DEM data at 500 m resolution for the North American continent (Global Multi-resolution Terrain Elevation Data, 2010). The terrain variables were derived from the DEM data using R (using terrain in terra package; Hijmans, 2022; R Core Team, 2022) and the topographic wetness index was derived using SAGA GIS (RSAGA package in R; Brenning et al., 2018). Because our domain included the whole of the continental United States and Canada, computational bottlenecks were avoided by choosing the relatively coarse 500 m DEM data for this macroscale analysis. The elevation and terrain variables were aggregated (using averaging) to 10-and 20-km grids.

| Climate data and the ORTBIM procedure
We obtained climate data from the AdaptWest database (AdaptWest Project, 2021;Mahony et al., 2022;Wang et al., 2016) for climate variables for the current (1991-2020) and future (2071-2100) time periods. We used data for future climate scenario (CMIP6) associated with shared socioeconomic pathways (SSPs) to examine the degree to which habitat suitability differed under these alternative pathways. SSPs are socioeconomic pathways that use quantitative and qualitative elements pertaining to society, energy, economy and environment (O'Neill et al., 2016). We chose the worst-case SSP5-8.5 (roughly corresponding to CMIP5's RCP-8.5) to highlight the extreme trends in our interpretative analysis. The SSP5-8.5 projection is based on an ensemble of 13 general circulation models -ACCESS-  Project, 2021). For further details about these 13 models, refer to https://wcrp-cmip.github.io/CMIP6_CVs/docs/CMIP6_ source_id.html. The original data, at 1 km resolution, were aggregated to 10-and 20-km grids to match the abundances at these scales.
To stabilize the variance of regression trees to map dominant rules based on the abundance distribution of tree species, we used the ORTBIM procedure (as outlined previously). The ORTBIM procedure is most suited to exploring spatial distributions of continuous responses although it can be modified to adapt to different types of responses. A complete explanation of this procedure is listed below.
1. From the 33 climatic variables (AdaptWest Project, 2021), we selected a subset of 16 variables ( Table 1) that we found to be most ecologically relevant based on our experience from previous modelling efforts (Iverson et al., 2008Iverson & Prasad, 1998). The five elevation and terrain variables, along with the 16 climatic variables (we favoured seasonal temperature/precipitation and heat-moisture ratio variables over annual ones - Table 1) were used in the first stage of the modelling procedure.
2. Recursive feature elimination (explained in the next section) was used to select the most relevant of the 21 variables (based on bagging trees), using the rfe and caret packages in R (Kuhn, 2008(Kuhn, , 2022. We did not let the variables chosen by rfe to exceed 12 to sharpen interpretation and minimize noise in the dominant rules. 3. The pruning parameter for the regression tree (rpart package in R; Therneau & Atkinson, 2022) was optimized via the caret package using bootstrapping and cross-validation and using the variables selected by rfe for that species. Variable selection via rfe and optimized pruning via rpart (using 'anova' method to split since the abundance response is continuous) together resulted in an optimally pruned regression model specific to the tree species.
4. The abundance threshold for defining dominance was based on the frequency distribution of abundance. We chose the values of abundance above the first quartile as dominant for the species in order to exclude 'noise' in the lower values of abundance (below the first quartile). This allowed the rules to vary depending on the species' abundance distribution. 5. After screening for the best algorithm to obtain rules, we decided on using a singular regression tree (see more in Section 4) chosen by the optimal pruning parameter and the abundance threshold to examine the main rules (limited to a maximum of six rules for practicality) and key ecological variables. TA B L E 1 Climate, elevation and terrain (21) variables selected as ecologically meaningful (and their six groupings) and used in the ORTBIM procedure. 6. We performed bagging with 30 trees (using the ipred package in R; Peters & Hothorn, 2021) with the optimal regression tree chosen in step 3, and the abundance threshold chosen in step 4, to (a) examine rule stability via deviances explained (which would vary based on the species and the scale of analysis) and to (b) map each of the individual bagged rules.

AHM
7. We averaged the 30 individual bagged maps that represented the dominant rules (in step 6) for current conditions and the SSP5-8.5 scenario. By combining bagged maps via averaging, we incorporate variability among the correlated bagged trees, which is essential for capturing dominant trends.
By using species-specific rules determined by quartiles of the abundance distribution, and further averaging the 30 bagged trees, we focus on the main ecological trends in the distribution while minimizing noisy signals in the other branches. To explore dominant abundance patterns, we examine the rules derived from key ecological drivers based on a singular regression tree (step 5). The mapped products, therefore, incorporate variability among correlated bagged trees, and the main rules from singular regression tree help explain the key drivers. It should be noted that the bagged rules derived from current climate were assumed to be applicable in the future timeframe (2071-2100) because it is generally agreed that there will be an adaptation lag and most species are unlikely to adapt quickly to rapid climate change (Etterson et al., 2020;Fréjaville et al., 2020;Gray & Hamann, 2013). We emphasize that the ORTBIM procedure is an interpretative modelling framework developed to gain better insights into response-covariate interactions and provides an interpretative framework to assist in building better predictive models -it is not optimized to provide the best prediction. option in rfe because we had already preselected ecologically meaningful variables and randomizing the predictors would result in decorrelated DTs, which are optimized for prediction, not interpretation.

| Analysis
Based on the ORTBIM procedure, we extracted the deviances explained for the six species at the two resolutions (20 and 10 k) for both RIV and AIV responses. We also compiled the deviances explained for the bagged statistics (minimum, maximum, median, mean and standard deviation) for the 30 trees -to compare performance ( Table 2). The dominant rules chosen by the ORTBIM procedure (singular tree, step 5) at 10 and 20 k resolutions for the six species were extracted for RIV, and the deviances explained by the explanatory variables were summed at the tree nodes and summarized ( Table 3). The combined bagged rules (step 7 in the ORTBIM procedure) were mapped for the current and future climate (SSP5-8.5) at 20 k (Figures 3-8) and 10 k ( Figures S1 and S2). Because FIAbased abundances ( Figure 2) provide the best plot-based estimate of basal area available, the ORTBIM procedure created higher fidelity rules which were applied to the whole continent.
It is noteworthy that the diversity of tree species in a single FIA plot is low for many western species, elevating the relative importance to higher values. Because of this pattern, we wanted to test whether our choice of response, relative abundance (RIV), was superior to absolute abundance (AIV) which was not affected by this artefact (since AIV does not consider other species in the plot).

| FIA plot-level analysis
We also evaluated the correspondence between our macroscale analysis and FIA plot-level responses to the explanatory variables, and tested whether plot data offered additional insights. We, therefore, calculated FIA plot-level abundances and extracted elevation and climatic features at 1 km resolution (to match the approximate extent of the fuzzed FIA plot) to understand the following responses.

| RIV versus elevation
The response of FIA plot abundances to elevation was plotted using loess smoothing for the six species ( Figure 9). We also extracted the most important explanatory variables based on the modelled results to examine if these responses were accentuated at different elevation range classes. We, therefore, split the elevation into six range classes (−400-500; 500-1000; 1000-1500; 1500-2000; 2000-2500 and 2500-5000 m) and created facet/trellis plots to examine the relationships (see Figures S3-S8). Negative elevations were included in the range classes because there are areas in the continent that are below sea level.

| Niche contour plots
To illustrate the relationship of FIA plot abundances with climatic and topographic drivers, we mapped FIA plot-based RIV against mean annual temperature (MAT) and mean total annual precipitation (MAP) using interpolation (akima package in R; Akima, 1991;Akima & Gebhardt, 2021). To highlight the role of topography in the niche space, elevation contours were draped onto the 3D plots of the six species ( Figure 10).

| RE SULTS
The deviances explained by regression trees (pruned singular and bagged) among the six species for both RIV and AIV, along with summary statistics ( Table 2), show that RIV generally outperformed AIV. This is true for all species except ponderosa pine at 20 k, and all species except Douglas fir at 10 k ( Table 2). Also, for the bagged trees, the standard deviation (SD) is almost always higher with AIV compared to RIV, making RIV less variable and hence a more stable metric for modelling the response. Based on these results, we decided to confine our further analysis to RIV, which also has the added advantage of incorporating a portion of the interspecific interaction by way of including all species in the plot. AIV, however, can be a useful measure for some species when the interspecific component is not needed, and for some species for which the deviance explained is significantly higher.
Focusing only on the RIV response, the quartile distributions (minimum, maximum and median), as well as the mean and standard deviations of the bagged trees, portray the explained deviation statistics displayed by the six species (Table 2). Because both the singular tree and bagged trees use optimally pruned regression trees derived from the ORTBIM procedure, the range (maximum minus minimum) in the explained deviance is due to bootstrapping in the bagged trees. Using singular tree with the full training data is essential to unravel the main features influencing the distribution.
The deviance explained varied among the species, with Engelmann spruce, followed by Douglas fir having the highest values ( Table 2). The lower deviance explained by eastern white pine (with only 31% even at 20 k) may reflect that it is a highly adaptable species that has naturalized in many parts of the eastern United States, and hence more problematic to model.
The deviances explained by the explanatory variables were summed along with the variable defining the topmost split in the regression tree ( Table 3). The climate/terrain grouping categories (Table 1) were used to summarize the deviances explained by these variables ( Table 3) and plot their total importance for the western and the eastern species at 20 and 10 k resolutions (Figure 1). At both 20 and 10 k, for the western species, elevation is the most dominant variable explaining the bulk of the deviance -it is noteworthy that for the eastern species, elevation does not figure at 20 k and is marginal at 10 k. For Engelmann spruce, elevation is the most important variable at both scales, explaining 44.5% and 39.1% at 20 and 10 k, respectively (Table 3), while dwarfing the other variables.
For ponderosa pine, elevation explains 9.6% of the deviance at 20 k and 6.4% at 10 k; for Douglas fir, elevation explains 12% at 20 k and was not important at 10 k. For both ponderosa pine and Douglas fir, heat-moisture variables (AHM, CMI and CMD -see Table 3) become quite important at both scales. However, at 20 k, the influence of seasonal precipitation and seasonal temperature was quite important for eastern species, followed by growing season variables. Heatmoisture indices are more important in the west and the role of TA B L E 2 The deviances explained and the bagged statistics for 20 km (20 k) and 10 km (10 k) resolutions for relative abundance (RIV) and absolute abundance (AIV seasonal temperature is greater for the eastern species compared to western species (Figure 1). Because the eastern species are at much lower elevations compared to the west, they are more responsive to terrain features (especially at 10 k) compared to the western species; especially chestnut oak -ttri (topographic roughness index) at 20 k (12.4%) and, tslope (slope angle) and ttri (12.5%) at 10 k ( Table 3).
For sugar maple, spring precipitation (PPT_sp -14.4%) and summer temperature (Tave_sm -7%) are prominent at 20 k and annual heatmoisture index (AHM -6.8%) and autumn temperature (Tave_at -5.5%) are highest at 10 k. The pattern at 10 k is similar to 20 k, but in the east, the role of seasonal precipitation decreases, and seasonal temperature increases at 10 k compared to 20 k. The current distribution maps (Figure 2) (Figures 3-8). Because 20 k performance was higher (in terms of deviance explained - Table 2), the interpretation of the macroscale patterns is more robust compared to 10 k ( Figures S1-S6). However, interesting and revealing contrasts with the 10 k distributions are discussed.

| Engelmann spruce
By comparing the current distribution of Engelmann spruce ( Figure 2a) to dominant bagged rules at 20 k (Figure 3a), we see the latter captures most of the current distribution in the conterminous United States and also some outlined by Little's range in Canada (Little, 1971). The deviances explained by the rules as discussed in the previous section and  trends that influence its current distribution.

| Ponderosa pine
The

| Douglas fir
There is good match of the dominant bagged model (Figure 5a

| Eastern white pine
Eastern white pine captures the current distribution of white pine ( Figure 2d) fairly well, however, only 31% of the deviance is explained even at 20 k ( Table 2). The dominant trends under SSP5-8.5 (Figure 6b) show that the main shift is towards the northeast although there is a substantial presence in Alberta, Canada and to the northwest into Alaska (Figure 6b). Most of the dominant suitable habitats tend to be lost in the United States, except in the northeastern states. At 10 k, which only explains 18% of the deviance F I G U R E 5 The bagged dominant rules mapped at 20 k for Douglas fir (Pseudotsuga menziesii) for current (a) and future SSP5-8.5 (b). The legend pertains to per cent relative abundance derived from the quantile distribution of the species.

F I G U R E 6
The bagged dominant rules mapped at 20 k for eastern white pine (Pinus strobus) for current (a) and future climate SSP5-8.5 (b). The legend pertains to per cent relative abundance derived from the quantile distribution of the species.
( Table 2), the trend towards the west is apparent under SSP5-8.5 ( Figure S2A.a,A.b), with lower-quality habitats. However, it should be noted that there is a substantial drop in the deviance explained for sugar maple from 20 k (41%) to 10 k (29%) ( Table 2).

| Chestnut oak
The dominant abundance pattern of chestnut oak (Figure 2f) is captured quite well (Figure 8a), although the model shows some lowerquality habitats further west even under current conditions. Under SSP5-8.5, there is a strong north-eastern extension. The suitable

F I G U R E 7
The bagged dominant rules mapped at 20 k for sugar maple (Acer saccharum) for current (a) and future climate SSP5-8.5 (b). The legend pertains to per cent relative abundance derived from the quantile distribution of the species.

F I G U R E 8
The bagged dominant rules mapped at 20 k for chestnut oak (Quercus montana) for current (a) and future climate SSP5-8.5 (b). The legend pertains to per cent relative abundance derived from the quantile distribution of the species.

F I G U R E 9
The overall response of elevation (x-axis) versus FIA plot-based relative abundance (y-axis) for the six species -(a) Engelmann spruce; (b) Ponderosa pine; (c) Douglas fir; (d) Eastern white pine; (e) Sugar maple and (f) Chestnut oak.
habitats in the west are also more prominent under SSP5-8.5 ( Figure 8b). Compared to the other eastern species (eastern white pine and sugar maple), the quality of the future habitats is not impacted much. At 10 k ( Figure S2C.a,C.b), the pattern is somewhat similar, although there is substantial thinning and degradation of habitats under SSP5-8.5. It should be noted that the drop in the deviance explained is quite steep with only 36% explained at 10 k compared to 53% at 20 k ( Table 1).

| Dominant rules from singular tree
Table S1a-f shows the dominant rules for the six species derived from the optimally pruned singular tree for 20 and 10 k resolutions (Step 5, ORTBIM procedure). While rules from singular tree exhibit variation and should not be used for prediction, they are highly valuable for interpreting when due consideration is given to the per cent deviance explained. They show which explanatory variables, along F I G U R E 1 0 The niche plots depicting interpolated relative abundance plotted against mean annual temperature (C) and total annual precipitation (mm). The contours depict elevation.
with their interactions, are influencing the dominant distribution under current and future (since the rules would not vary much due to adaptational lag) conditions. These rules should be interpreted in relation to maps of bagged dominant rules (Figures 3-8). The variations exhibited by these rules between 20 and 10 k can be scrutinized for greater insight into the nuances that drive the distributions under these two scales, especially when examining the climateterrain niche space of the species, greatly aiding interpretation, which we discuss below.

| Plot-level responses
The FIA plots can highlight finer-scale responses compared to the aggregated response models at 20 and 10 k. The plot-level relative abundance response to elevation (Figure 9) and the most important explanatory variable ( Table 3,  For the eastern species, the curves are much flatter -eastern white pine's (PPT_sm) pattern is visible under both -400-500 m and 500-1000 m classes ( Figure S6). For sugar maple (PPT_sp), the overall response is most apparent at −400-500 m and 500-1000 m classes ( Figure S7). The abundance response of chestnut oak to spring precipitation (PPT_sp) decreases gradually, reflected both at −400-500 m class as well as at 500-1000 m ( Figure S8).

| Climatic niche-elevation contour plots
The species' niche in climate space casts light on the climatic range the species occupies. To plot this, we mapped interpolated RIV (from FIA plot data) against averaged total annual precipitation, mm (MAP - x-axis) and mean annual temperature, C (MAT -y-axis) and draped elevation contours (Figure 10). These niche-elevation contour plots are a succinct way of summarizing the differences between the eastern and western tree species. For example, we can see that, in general, the range of precipitation (x-axis) is larger in the western species compared to the eastern species. The temperature range (y-axis) is somewhat higher for the eastern species, especially when compared to Engelmann spruce which prefers much cooler temperatures. Bands of higher relative abundances tend to be narrower for Engelmann spruce which prefers lower temperatures, and mostly 1200-2000 mm of rainfall. Ponderosa pine has a narrower range of precipitation where it performs well, but a much broader temperature range. The eastern species are generally climatically more tolerant (with the caveat that they occupy a smaller range of precipitation compared to the western species) and occupy broader niches compared to their western counterparts -although Douglas fir in the west fares well across its entire climatic range. When one examines the elevation contours, the contrast between the eastern and western species gets sharper. Engelmann spruce, for example, shows large abundance at higher elevations. The spacing of the contours reflects terrain heterogeneity -with Engelmann spruce preferring higher elevations over rugged terrain, whereas terrain heterogeneity becomes more important for Ponderosa pine which prefers relatively lower elevations, but dissected terrain. For Douglas fir, the pattern is more difficult to discern as it does well across its entire climatic niche. For the eastern species, chestnut oak, which mainly spans the Appalachian ranges and their foothills, the importance of terrain heterogeneity is more obvious, preferring more rugged terrain compared to eastern white pine and sugar maple. These interactions between climate and terrain are captured well in the dominant rules (Table S1a-f), as well as the optimally pruned bagged maps (Figures 3-8).

| Background
In this study, we have emphasized the exploratory and interpretive aspects of macroscale analysis by focusing on ecologically dominant patterns that are influencing the abundance distribution of six major species in the United States and Canada -and comparing important features of these distributions in the eastern and western regions.
This analysis helps lay an interpretative framework for conducting better predictive analysis by capturing the main ecological trends in species abundance distribution. Within this context, some variance in the rules can be tolerated because many ecological questions typically involve inherently variable biotic and abiotic feedbacks in multivariate space, and the target species themselves are adaptable and influence their niches (Laland et al., 2016).
Initially, when formulating the ORTBIM procedure, we explored Rulefit, Sirrus, Cubist and interpretML (Bénard et al., 2021;Chang et al., 2021;Friedman & Popescu, 2008;Kuhn & Quinlan, 2023) as alternatives to a singular decision tree -but these algorithms did not yield results that were interpretable and mappable using our spatial datasets. The singular regression tree derived from an optimal pruning parameter proved to be the best option. Singular trees can overfit and exhibit variation for prediction, but can still be valuable for interpretation, especially when combined with bagging to examine variability. In addition, the rules derived from a singular tree are useful for mapping the predictors to understand which predictors and their interactions are driving the distribution geographically (Iverson & Prasad, 1998;Prasad, 2015). It should be emphasized that, in spite of some variability in the rules due to the singular regression tree, it is the dominant pattern in the distribution of habitat suitability and not the prediction of abundance values that is the focus of this analysis. However, the ORTBIM procedure stabilizes the variability in the rules to a minimal set of ecologically relevant features, while still incorporating correlated variability via bagging. This interpretative exploration is urgently needed because the entire current species range and beyond could be the domain of disturbance under climate change, especially with the northern latitudes warming faster than the southern ones (Walsh, 2014). By specifically exploring the dominant patterns of abundance by focusing on quartiles greater than the first, and comparing these macroscale patterns with plot-level inventory data, we reveal dynamic aspects of species distribution that are often hidden by black-box predictive methods.
An assumption and potential limitation inherent in our approach is that the realized niche reflected by FIA-based abundance data is sufficient for the ORTBIM procedure. Several studies have pointed out that many tree species can colonize and survive in areas well beyond their current climatic profile (e.g. see Booth, 2017).
The assumption of using the realized niche to extract rules was a compromise that was needed in our interpretative approach for regression-based modelling of relative abundance. Gathering binary presence/absence data from other sources may have captured a somewhat broader species range to assess climatic requirements beyond the realized niche (Booth, 2016;McKenney et al., 2007;Zhang et al., 2017). But this would not have yielded the insights that only continuous abundance data using the rigorous sampling procedure of FIA can provide. Based on our experience with modelling continuous data, we are confident that our approach is robust enough to extract the principal rules from the core of the distribution, which are well represented by the realized niche. To explore this further, we created climatic envelopes of species (using climatic variables like mean annual temperature, extreme minimum temperature, annual heat moisture index, etc.) based on their realized niche. The results showed that the envelopes were far greater than the actual species distribution, lending support to our assumption that the rules derived from the realized niche will not limit the climatic potential of the species. Also, when the bagged rules are mapped under SSP5-8.5 (Figures 3-8), we see that several eastern species extend their potential range into northern Canada and Alaska, providing further confidence about the climatic plasticity inherent in the rules derived from the realized niche.
Another potential limitation in our approach is that we use only 'natural' forested plots which do not capture the slightly wider niche provided by including plantations. We did this for two reasons: (a) only commercial species have plantations -which would cause data asymmetry among species; and (b) the relative abundance of nonnatural stands would be misleadingly high due to the elimination of competition for forest management objectives -which will adversely affect model performance. However, we include all forested occurrences in FIA plots when aggregating to 20 and 10 k by round- ing RIV values of abundance greater than 0, however tiny, as 1.
A further limitation of our approach is that the current distribution is derived only from the conterminous US FIA data, which constrains the models to be conservative when mapping bagged rules across Alaska and Canada. We refrained from using Canada's photo-plot data because they stem from a different protocol and their combination with FIA data would have complicated the interpretative aspects of our analysis. However, this omission is justifiable for the species chosen because the FIA-based distribution (which is based on rigorous, systematic sampling) represents the species' core climate-terrain range sufficiently for the bagged rules to map reliably across the entire domain of inquiry for interpretative purposes.

| East versus West
Our earlier attempts to model and project tree species' suitable habitats (Iverson et al., 2008 have been confined to the eastern United States, although we have modelled a subset of the eastern species across most of their range (Prasad et al., 2020).
We have so far not attempted to model the species in the western regions of United States and Canada using our habitat suitability and colonization likelihood models -an active area of biotic and abiotic change (Stanke et al., 2021). While the six species we have studied here are a tiny sample of the many found in the east and west, they represent wide-ranging species characteristic of these two regions. Importantly, we can be confident that the interpretations using our ORTBIM procedure can extract features that are relevant to modelling multiple species across their entire range using a more predictive ensemble modelling framework. Crucially, the unexplained deviance opens avenues to probe other historical, biogeographic and disturbance factors that have influenced current distributions, and will continue to shape future responses to rapid climate change.
From our results, the predominant difference between the western and eastern species, as expected, is the role of elevation, which has a major influence on our examined species in the west at both 20 and 10 k ( Figure 1, Table 3). Correspondingly, the seasonal precipitation variables are important in the east at 20 k and less so at 10 k.
Terrain-related features appear to be more important for chestnut oak which spans the Appalachian in the east (Table 3), most likely because its abundance response varies with topographic heterogeneity (partly captured by ttri variable), and not primarily high elevation ranges more common in the west.

| Scale issues and residual deviance
One of our goals was to understand the role of scale, and the type of response on the outcome. It is clear that relative abundance, as compared to absolute abundance, produces a stronger signal even in the west (except for Ponderosa pine), in spite of lesser diversity at the plot level. Relative abundance has been the response of choice all along in the east, as presented in the Climate Change Tree Atlas (Iverson et al., 2008;Iverson & Prasad, 1998;Peters et al., 2019). The domination of plots by single species, much more predominant in the west compared to the east, still yields good interpretive models and is most likely to be true for predictive models also -boosting our confidence in this response metric as we extend our models to the west.
Ecological questions should be framed based on the scale of the analysis -a macroscale perspective allows us to understand emergent patterns across the full range of the species, and subsequently, help probe more regional and local phenomena based on this broad understanding. Within this macroscale perspective, the deviances explained in our study vary with resolution -higher at 20 k compared to 10 k for all the species (Table 2). This systematic pattern is likely because aggregation at coarser resolutions brings out the emergent properties of the response (in this case relative abundance) to match more closely with the scale of climatic and terrain variables. This relationship breaks down with finer resolutions because there are sharper variances between the abundance and climate-related patterns compared to the coarser resolutions (e.g. a 10 k aggregated pixel is more likely to have fewer FIA plots), which cannot be explained by the explanatory variables. Therefore, the trade-off for modelling at finer resolutions is to sacrifice interpretability and predictability and add significant computational costs -which is not attractive for macroscale investigation of patterns across entire ranges for multiple species.
While climate and terrain remain the main macroscale factors influencing species range (Pederson et al., 2015;Toledo et al., 2011), other ecologically relevant factors like disturbances and soils can become important at finer scales. The upper branches of the regression tree (especially the top split) are based on more regional features defining the distribution, with the lower branch splits based more on local features (Iverson & Prasad, 1998

| Management issues
As noted previously, the deviance not explained by the models also becomes a rich source of inquiry into the historical biogeography of past and current disturbances. The species' legacies and acquired evolutionary traits can give us vital information about how they are likely to respond to disturbance currently and in the future. These can be systematically tabulated for each species as modifying factors (Matthews et al., 2011), and many of the disturbance layers (land-use change, fire impacts, contemporary climate change, etc.) can be made spatially explicit to provide additional insights into the management of the species as post-model filters at specific locations. In conjunction with model interpretations and predictions, they can also provide insights into landscape features that could act as refugia that can facilitate or deter species migrations (Roberts & Hamann, 2015). For example, we know from our analysis that elevation is the most influential variable for some of the western species ( Figure 1, Table 3). We also know that elevation is interacting with many seasonal and heat-moisture variables in the west (rules in  6-8). Similarly, some western species appeared in suitable spots in the east, for example, Ponderosa pine (along central and southern Appalachians under SSP5-8.5 - Figure 4) and Douglas fir in Ontario and Maine ( Figure 5). The eastern species are influenced by a mix of factors ( Figure 1) with seasonal precipitation patterns having a larger influence compared to others. However, these are likely to interact with other features to influence abundances which are captured by mapping the dominant bagged rules (Figures 3-8). We see strong tendencies for eastern species to shift their habitats northeastward to Canada, and also to the western parts all the way to Alaska. Comparatively, there is less tendency for the western species to move eastward, the dominant pattern being north-westward into Canada. The fact that we do not currently find the species in these suitable habitats is related to climatic lags, climatic disequilibrium and the ability of the species to migrate to suitable habitats amidst other interspecific and disturbance interactions (Davis, 1989;Hoek, 2001;Svenning & Sandel, 2013). The potential of changing habitat suitability, however, has major implications for management in both the United States and Canada with increasing need for greater research and management collaboration as the climate warms faster in the northern latitudes compared to the south (Walsh, 2014). Many of the species show both habitat degradation and thinning in the southern regions under future climates; this tendency is more pronounced for eastern species (Figures 6-8). The patterns at 10 k ( Figures S1 and S2) generally mirror that of 20 k, although there are some exaggerated patterns (especially for the eastern species) that could be related to lesser deviance explained compared to 20 k.
It is noteworthy that tree species have always been opportunistic, migrating in time and space based on niche availability (with strong interactions with climate and topography) and their own acquired evolutionary traits. Therefore, the near future will be no different with niches appearing in unexpected places but with human influence acting both as a facilitator and a barrier to the occupation of these niches.

| Plot-level insights
While the bagged rules provide information on the important variables that are influencing the distribution via dominant rules, these relationships operate at coarse scales. The FIA plot-level analysis of trends, however, can provide a finer-scale perspective when overlaid on the dominant trends. Understanding the correspondence between coarser-scale trends and plot-level trends can help management because they provide important insights based on matching and contrasting patterns at these two scales. While the relative importance of elevation is reinforced for the western species (Figure 9), by examining the abundance response of other important variables with elevation classes, we can also uncover at what range(s) of elevation the overall pattern is likely to be manifesting ( Figures S3-S8). For example, for ponderosa pine, the overall pattern is most evident at elevation range of 1500-2000 m ( Figure S4), but for Douglas fir, the overall pattern is partly reflected in several classes ( Figure S5). This strong correspondence between elevation and abundance breaks down in the eastern species ( Figures S6-S8).
A succinct way of summarizing climatic and elevational differences between the eastern and western species is via the nicheelevation contour plots ( Figure 10). In addition to the elevational suitability within the climatic range, it illustrates the suitability of terrain heterogeneity -for example, the Engelmann spruce prefers higher elevations compared to the higher preference for more dissected terrain for ponderosa pine.
Both these types of plots ( Figures S3-S8 and Figure 10) reveal finer plot-level non-linear relationships that are harder to decipher from the maps (Figures 3-8) and the rules (Table S1a-f). These, together with the macroscale analysis, reveal important insights into where to plant and at what ranges the response is likely to be most pronounced -helping with assisted migration plans of selected species. This interpretative framework can also be extended to explore how variables other than climate-terrain can influence species-specific rules and distributionfor example, to investigate the role of trait-fitness relationships, etc.
However, the procedure has to be modified by carefully considering how these variables scale with respect to the response variable.
In conclusion, our ORTBIM procedure provides a robust framework to address ecologically meaningful interpretation of the dominant rules influencing the abundance patterns of tree species and provides insights into how climatic-terrain habitats can rearrange under future climates. In addition, it offers a meaningful template to examine the role of scale in interpreting ecological patterns -and differentiates the important variables influencing the distribution in the eastern and western parts of the continent. Unexplained deviance can be used as a feature to examine historical biogeographical and disturbance components of current distributions, offering guidelines on how they may change under future climates. Our procedure, which is appropriate for examining macroscale distribution of species across their entire range (together with finer-scale FIA plot-based analysis of non-linear relationships), can be applied wherever there is sufficient inventory-based abundance data for tree species. Where only presence/absence data are available, a modification of the procedure is needed to yield reliable results. We believe our procedure is a useful addition to the mostly predictive black-box modelling frameworks that are predominant today.

ACK N O WLE D G E M ENTS
We are grateful to the field crews of the Forest Inventory Analysis programme in the United States for gathering data that made this paper possible. We would like to thank Andy Gougherty, Brice Hanberry and Erik Lilleskov for their helpful reviews of the manuscript. And to the three anonymous reviewers whose comments greatly improved the manuscript. Disclaimer: The use of trade, firm or corporation names in this publication is for the information and convenience of the reader. Such use does not constitute an official endorsement or approval by the U.S. Department of Agriculture or the Forest Service of any product or service.

CO N FLI C T O F I NTE R E S T S TATE M E NT
We have no competing interests.

PEER R E V I E W
The peer review history for this article is available at https:// www.webof scien ce.com/api/gatew ay/wos/peer-revie w/10.1111/ ddi.13707.