A large database supports the use of simple models of post-fire tree mortality for thick-barked conifers, with less support for other species

Predictive models of post-fire tree and stem mortality are vital for management planning and understanding fire effects. Post-fire tree and stem mortality have been traditionally modeled as a simple empirical function of tree defenses (e.g., bark thickness) and fire injury (e.g., crown scorch). We used the Fire and Tree Mortality database (FTM)—which includes observations of tree mortality in obligate seeders and stem mortality in basal resprouting species from across the USA—to evaluate the accuracy of post-fire mortality models used in the First Order Fire Effects Model (FOFEM) software system. The basic model in FOFEM, the Ryan and Amman (R-A) model, uses bark thickness and percentage of crown volume scorched to predict post-fire mortality and can be applied to any species for which bark thickness can be calculated (184 species-level coefficients are included in the program). FOFEM (v6.7) also includes 38 species-specific tree mortality models (26 for gymnosperms, 12 for angiosperms), with unique predictors and coefficients. We assessed accuracy of the R-A model for 44 tree species and accuracy of 24 species-specific models for 13 species, using data from 93 438 tree-level observations and 351 fires that occurred from 1981 to 2016. For each model, we calculated performance statistics and provided an assessment of the representativeness of the evaluation data. We identified probability thresholds for which the model performed best, and the best thresholds with either ≥80% sensitivity or specificity. Of the 68 models evaluated, 43 had Area Under the Receiver Operating Characteristic Curve (AUC) values ≥0.80, indicating excellent performance, and 14 had AUCs <0.7, indicating poor performance. The R-A model often over-predicted mortality for angiosperms; 5 of 11 angiosperms had AUCs <0.7. For conifers, R-A over-predicted mortality for thin-barked species and for small diameter trees. The species-specific models had significantly higher AUCs than the R-A models for 10 of the 22 models, and five additional species-specific models had more balanced errors than R-A models, even though their AUCs were not significantly different or were significantly lower. Approximately 75% of models tested had acceptable, excellent, or outstanding predictive ability. The models that performed poorly were primarily models predicting stem mortality of angiosperms or tree mortality of thin-barked conifers. This suggests that different approaches—such as different model forms, better estimates of bark thickness, and additional predictors—may be warranted for these taxa. Future data collection and research should target the geographical and taxonomic data gaps and poorly performing models identified in this study. Our evaluation of post-fire tree mortality models is the most comprehensive effort to date and allows users to have a clear understanding of the expected accuracy in predicting tree death from fire for 44 species.


(Continued from previous page)
Conclusions: Approximately 75% of models tested had acceptable, excellent, or outstanding predictive ability. The models that performed poorly were primarily models predicting stem mortality of angiosperms or tree mortality of thin-barked conifers. This suggests that different approaches-such as different model forms, better estimates of bark thickness, and additional predictors-may be warranted for these taxa. Future data collection and research should target the geographical and taxonomic data gaps and poorly performing models identified in this study. Our evaluation of post-fire tree mortality models is the most comprehensive effort to date and allows users to have a clear understanding of the expected accuracy in predicting tree death from fire for 44 species.

Background
Wildland fires burn millions of forested hectares annually, affecting biodiversity, carbon storage, hydrologic processes, and ecosystem services largely through postfire tree mortality and stem mortality (e.g., top-kill) of resprouting species (Bond-Lamberty et al. 2007;Dantas et al. 2016;He et al. 2019). Mortality is one of the primary processes through which fire reorders plant communities, and post-fire mortality has the potential to dramatically reassemble plant community structure and species composition (Cocking et al. 2014). Because of the widespread importance of mortality processes, numerous models exist to predict tree mortality from fire; however, few are evaluated for accuracy.
Post-fire tree and stem mortality has been traditionally modeled as a simple empirical function of tree defenses (bark thickness) and fire injury (crown scorch and stem char) (Ryan and Amman 1996;Woolley et al. 2012). Empirically derived statistical models predicting mortality (e.g., regression equations; hereafter, empirical models) are commonly used in fire management decision support software to predict fire effects (Reinhardt et al. 1997), inform post-fire silvicultural treatments, identify and project changes in wildlife habitat quality and availability, project future vegetation composition and structure changes, estimate carbon fluxes, and model future impacts of climate change (Hood et al. 2018).
At the individual-tree scale, models predicting mortality are used to understand ecological relationships. For example, logistic regression analyses have demonstrated that interactions between abiotic drought stress and fire injury can elevate mortality (van Mantgem et al. 2013). Likewise, structural equation modeling has been used to elucidate complex interactions between bark beetles, fire injury, season of burn, and stand structure in post-fire mortality (Menges and Deyrup 2001). Differences in tree survival after fire can also help elucidate trade-offs between different plant traits and associated evolutionary strategies in relation to vegetation recovery after fire (Catry et al. 2013). Tree-scale models are also used for applied decision-making, such as developing salvage logging silviculture prescriptions, and hazard tree guidelines (Hood et al. 2010;Hood and Lutes 2017).
At the stand scale, empirical models can be used to predict population-level survival and mortality, and are used to inform simulations of subsequent vegetation, structural development, and carbon estimates. Computer simulation programs that model vegetation change at a variety of scales include fine-scale software tools for fire management planning that often predict percentage mortality of a class (Rebain 2010;Hood and Lutes 2017). For example, a 70% predicted probability of mortality can be interpreted to mean that a fire of a given intensity would likely kill 70% of the trees in the modeled class.
Process-based succession models (Keane et al. 2011) and global models of the terrestrial carbon cycle (Hantson et al. 2016) may scale-up or coarsen these projections for integration into other simulation modeling steps. Fire behavior and effects software packages, such as the First Order Fire Effects Model (FOFEM; Reinhardt et al. 1997;Lutes et al. 2012), BehavePlus (Andrews 2014), and the Fire and Fuels Extension to the Forest Vegetation Simulator (FFE-FVS; Rebain 2010) model post-fire tree and stand-level mortality.
It is possible for logistic regression models to fit the data accurately but make poor predictions (Hosmer and Lemeshow 2000;Woolley et al. 2012;Ganio et al. 2015;Shearman et al. 2019), due to overfitting of the model, high dispersion of data, or relationships with variables not included in the model. If the goal of the model is not just to identify relevant explanatory variables, but to make accurate predictions of individual-or stand-level mortality-such as for the decision-support applications provided by FOFEM, BehavePlus, and FFE-FVS-then assessing the predictive accuracy of the model with independent data is needed. Until recently, most logistic regression models of post-fire tree mortality have undergone little evaluation. A model evaluation by Kane et al. (2017), using data from across the western US, found that existing models have high sensitivity-they accurately detect trees that are going to die-at the expense of specificity, meaning that they inaccurately predict death of trees that actually live. Also, model evaluations have found better performance by models that include damage to both tree stem and crown, as well as attacks from bark beetles, as predictors (Sieg et al. 2006;Hood and Bentz 2007;Hood et al. 2010;Thies and Westlind 2012;Grayson et al. 2017). Model accuracy can vary across the size classes of trees (Thies and Westlind 2012), as well as across the ranges of fire injury variables and different spatial scales of application (Furniss et al. 2019).
Land managers who use stand-and tree-scale mortality models to support decision-making may be interested in optimizing accuracy in predictions of either mortality or survival, or in minimizing misclassification of either class (Ganio and Progar 2017;Grayson et al. 2017). For example, a fire that burned through a recreation site may require high accuracy in predicting burned trees that are likely to die in the near future so that these trees can be identified and removed for public safety. Conversely, managers planning a prescribed fire in a longunburned, old-growth stand may want higher certainty that large legacy trees will survive fire. Model evaluations can help support these aims by assessing different types of classification error (Table 1). These model evaluations can be used to help managers better understand the uncertainty in model predictions, determine appropriate uses for specific models, and support the development of new predictive models for species for which the existing models perform poorly.
Empirical models are inherently limited to the underlying data distributions, creating uncertainty in accuracy when extrapolating beyond initial data ranges and for novel conditions (Hood et al. 2018).
The main tree mortality model in fire behavior and effects software packages was developed by Ryan and Reinhardt (1988) and amended by Ryan and Amman (1994) (hereafter, the R-A model). The R-A model was developed from data from US Western conifers and is limited in terms of species, tree sizes, and life history strategies, with training data coming from mid-sized conifers in western North America (Hood et al. 2018). The R-A model uses the inputs of flame length or scorch height and stem diameter to predict post-fire tree and stem mortality for any species for which bark thickness can be calculated. A sub-model calculates the percentage of crown volume scorched from tree height and crown ratio inputs. A second sub-model calculates bark thickness as a function of the stem diameter, with 184 species-level bark thickness coefficients included in the program and additional genuslevel coefficients. FOFEM v6.7 also includes 38 species-specific tree mortality models (26 for gymnosperms, 12 for angiosperms), with unique predictors and coefficients. All models in FOFEM were developed from logistic regression and output values between 0 and 1, with 1 predicting a 100 percent likelihood of tree death within three years post-fire.
The mortality models packaged into FOFEM were incorporated over time as new research was available on the likelihood of tree death from fire. The various models were developed from disparate datasets. While some of the logistic regression models in FOFEM have been tested for prediction accuracy using external data Ganio and Progar 2017;Grayson et al. 2017;Hood and Lutes 2017;Kane et al. 2017), there has not been a systematic effort to independently evaluate models across a range of tree taxa in the USA, largely due to lack of existing evaluation data. To address the lack of available testing data, we developed the largest and most comprehensive collection of observations of fire-caused tree mortality in the continental US, the Fire and Tree Mortality (FTM) database, which is described in detail in Cansler et al. (2020a) and available in an open-access online archive (Cansler et al. 2020b). We used these data to conduct the largest evaluation to date of the post-fire tree mortality models included in FOFEM.
Our primary research objective was to assess the prediction accuracy of the post-fire tree mortality models in FOFEM. The FTM database allowed us to assess Table 1 Classification table of model predictions and model performance statistics calculated based on predicted and true conditions, from this study of post-fire tree mortality models from the USA, from fires occurring from 1981 to 2016. Managers may wish to use models or classification thresholds that perform optimally for different scenarios Live trees that were predicted to be live divided by total live trees Example use: Post-fire salvage where there is a need to avoid harvesting large trees that may survive (e.g., potential seed trees or large wildlife trees.) Correctly classified live and dead trees divided by total trees Example use: Need to optimize multiple objectives.
accuracy of 68 models. We evaluated model performance in several ways, including both quantitative and qualitative accuracy assessments on individual models, and the directions of model error in relation to predictor variables and geography. We determined the best probability thresholds to use to assign live or dead status for each model, and assessed whether other potential sources of error influence model performance for field-measured versus derived crown injury, initial fires versus second fires, and geographic variation. We also assessed trends in model error across predictor variables by taxa with associated species traits to support targeted development of new models in the future. Lastly, we identified data gaps in the FTM database that can be targeted in future research.

Database
The FTM database contains over 170 000 tree-level standardized field observations of fire injury and tree mortality for obligate seeders and stem mortality for resprouting species from various years after fire, up to 10 years (Cansler et al. 2020a, b). Some trees were tracked through multiple fires, so the total number of individual trees is less than the number of observations. The database includes trees burned in wildfires and prescribed fires (i.e., human-ignited fires that were ignited for resources benefit). Measurements of fire-caused injuries include percentage crown volume scorched (CVS), percentage crown length scorched (CLS), percentage crown volume killed (CVK), bark char height (BCH), and cambium kill rating (CKR) (Hood and Bentz 2007; Table 2 Descriptions of defense, injury, and biotic stress variables used in logistic regression models predicting tree mortality in the First Order Fire Effects Model (FOFEM) decision support software system, in this study of post-fire tree mortality models from the USA, from fires occurring from 1981 to 2016. These abbreviations are used throughout the paper to refer to these variables Variable used in species-specific pre-fire and post-fire models Hood and Lutes 2017; Table 2). The FTM also contains presence and absence data for bark beetles, which act as agents of mortality, and the data needed to calculate bark thickness of many tree species. Many models use bark thickness as the primary tree defense variable. We followed methods in FOFEM to calculate bark thickness (BT) for all trees with DBH (diameter at breast height) measurements (Lutes et al. 2012). We calculated BT for all species except Pinus palustris Mill. using the following equation: where BT is single bark thickness (i.e., thickness of bark on one side of the tree; cm), BT coef is a species-specific bark thickness coefficient (Table 2), and DBH is the diameter (cm) 1.37 m above ground. Thus, by definition, we excluded trees that were shorter than 1.37 m from the bark thickness calculations and any model evaluations that used bark thickness as a predictor. For Pinus palustris, BT was calculated using an equation implemented in version 5.2 of FOFEM, which follows Wang et al. (2007): where BT is single bark thickness (cm) and DBH is stem diameter at breast height (cm). We used BT coef in the FTM database to calculate BT. FOFEM provides a BT coef for 192 tree species. If a species is absent, users can substitute a similar species for modeling, or use one of the 24 BT coef that are provided at the generic level. Thus, for species lacking a specieslevel BT coef in FOFEM, the FTM database provides a BT coef from a similar species in the same genus, if a reasonable substitute is available. For this model evaluation analysis, the BT coef was substituted from other species in three cases, and from genera in four cases (see Table 3 footnotes).

Model evaluations for individual species
We assessed accuracy of all models-at the scale of individual species-included in FOFEM for which there were at least 50 observations with measurements of the variables used in the model, and at least 10 live and 10 dead trees in the FTM database. We assessed accuracy of the R-A model for 44 tree species and assessed 24 species-specific models for 13 species, using a subset of data from the FTM database: 94 568 trees, 93 438 treelevel observations (1.1% of trees had records from a second fire), 351 fires, and 35 datasets (Fig. 1). We evaluated the accuracy of species-level model predictions, at the scale of individual trees, and examined group errors in relation to main defense and injury predictor variables.
For this model evaluation analysis, any second observation (e.g., a tree that was burned in a second known fire) was treated as an independent record. This mirrors how managers would use FOFEM for a second-entry fire (but see "Exploring potential sources of error" section below). Second observations were present in 32 of the models evaluated and, when present, made up on average 5% of the sample (range, when present, 0.07% to 37.7%). We excluded FTM data from M. Battaglia, S. Hood, and V. McDaniel that were used to create species-specific models (Battaglia et al. 2009;Hood and Lutes 2017;Keyser et al. 2018) from evaluation of those models as data used to create models cannot be used for external validation of those models. All scientific nomenclature is listed in Table 3 and follows the PLANTS Database (USDA NRCS 2019).
We assessed the R-A model for 44 species (Table 4). The R-A logistic regression model equation is listed in Table 5. FOFEM also includes 38 species-specific tree mortality models, with unique predictors and coefficients. We assessed 24 species-specific models: 15 models intended to be applied before the fire for prescribed fire planning purposes (hereafter, pre-fire models; Table 5; Lutes et al. 2012) and 9 models intended for use after the fire occurs to inform post-fire management (hereafter, post-fire models; Table 6). Many of the post-fire models include predictor variables that can only be measured after fire, such as CKR or presence of bark beetles. Note that the pre-fire models predict post-fire mortality but are meant to be used before the fire occurs, or when other factors such as basal injury or bark beetle attack are unknown.
Many post-fire species-specific models include presence or absence of any beetle that acts as a primary agent of mortality on that tree species as a predictor. Primary beetles include Dendroctonus ponderosae (Hopkins; mountain pine beetle) on Pinus lambertiana, D. ponderosae or Ips spp. (De Geer; engraver beetles) on Pinus ponderosa, and D. pseudotsugae (Hopkins; Douglas-fir beetle) on Pseudotsuga menziesii. Dendroctonus valens (LeConte; red turpentine beetle), while not considered a primary bark beetle, can indicate tree stress and is included in some species-specific models. The model for Abies concolor included ambrosia beetles (subfamilies Scolytinae and Platypodinae) as a predictor. Beetle presence was used as a binary variable, coded two different ways (Table 2) for use in different models.
We evaluated all models separately for each species; hereafter, we refer to each species-model combination as a "model". For each model, we created a one-page summary (hereafter, model evaluation figure) that displays information on the quality of the data used to evaluate model performance, the performance statistics of the model, model errors in relation to main injury and defense variables, and provides a simple qualitative summary of data quality and model performance (Additional file 1).
To assess the data quality used in each model evaluation, we summarized the number of tree observations and mapped the number and locations of fires sampled. We displayed these locations over maps of the species range using the Atlas of United States Trees, which shows species' ranges within the North American continent (Little 1971). We created a bi-plot for each model, which shows where the observations used to evaluate models fall within the species' bioclimatic niche space in terms of temperature and precipitation. We produced these plots by sampling a 30-arc second (~1 km) digital elevation model (United States Department of the Interior US Geological Survey 2007) at fire locations where the species was present, and 10 000 randomly chosen points within each species' range (Little 1971). Annual climate data were sampled at fire locations and associated mean elevations using the ClimateNA v5.10 software package (available at http://tinyurl. com/ClimateNA), based on methodology described in Wang et al. (2016). We calculated 30-year normals for 1981 to 2010 using the annual climate data, and used those normals for plotting bioclimatic niche space. The primary defenses are shown with bi-plots (i.e., DBH, as an interpretable representation of bark thickness) and injury (i.e., CVS, CLS, CVK, and BCH) variables used in each model to show the combined predictor space that is represented in the dataset, as well as boxplots of the two predictor variables in the margins. We produced plots using DBH instead of bark thickness because they are linearly related and DBH was the actual variable measured.
We calculated model performance statistics for each model using receiver operating characteristic (ROC) curves, which evaluate sensitivity and specificity (see definitions in Table 1) over a range of probability thresholds at which a tree or stem is classified as dead or alive. The area under the ROC curve (AUC) for each model was calculated using the package pROC (Robin et al. 2011) in the statistical program R (R Development Core Team 2017). Confidence intervals around the AUC were produced using 10 000 bootstraps of our sample using the pROC package (Robin et al. 2011). AUC values ≤0.5 suggest that the model does not perform better than random chance, values between 0.7 and 8.0 are acceptable, between 0.8 and 0.9 are excellent, and >0.9 are outstanding (Hosmer and Lemeshow 2000). For Pinus edulis and Pinus strobiformis, we used the bark thickness coefficient for the genus Pinus, because no species-specific coefficients were available f We calculated Pinus palustris bark thickness following the equation in Wang et al. (2007). For results tables and figures in which species are ordered by bark thickness coefficient, we used 0.049, which was the coefficient implemented in previous version of FOFEM (Wang et al. 2007) g Populus deltoides ssp. wislizeni is a subspecies of Populus deltoides that occurs in the Rio Grande watershed. This subspecies was formally considered a subspecies of Populus fremontii, and available maps covering this subspecies' range still include it with Populus fremontii. Genetic studies show that it has a genetic admixture from both species (Cushman et al. 2014). Therefore, we mapped the range and climatic niche space of Populus deltoides ssp. wislizeni using merged range maps for both Populus deltoides and Populus fremontii h For Populus deltoides ssp. wislizeni, we used the bark thickness coefficient for Populus deltoides i For Quercus gambelii, we used the bark thickness coefficient for the genus Quercus, because no species-specific coefficient was available We provided a table of model performance statistics over a range of probability thresholds to aid in the selection of probability thresholds for a given purpose (Table 1). We calculated the specificity, sensitivity, true positive rate, true negative rate, and overall accuracy (see definition in Table 1) for nine thresholds from every 0.1 probability of mortality increment from 0.1 to 0.9. Typically, a threshold of 0.5 is used (i.e., trees that have a ≥50% probability of mortality are classified as dead). Additionally, we used the pROC package to identified probability thresholds for which the model performed best (optimizing both specificity and sensitivity), and the best thresholds with either ≥80% sensitivity or ≥80% specificity.
We assessed species-level error, grouping data for each species in relation to the primary crown injury variable (i.e., CVS, CLS, CLK, and BCH) used in each model, and in relation to the primary defense variable (DBH). For each model, we graphically compared the predicted probability of mortality (P m ) and the observed proportion of trees or stems killed within binned observations of the primary injury and defense variables. The number of dead trees or stems was assessed by assigning live or dead status based on a 0.5 threshold. For CVS, CLS, and CLK, we tabulated proportional mortality using 10% bins, with additional bins for 0% and 100% injury (e.g., 0, ≥1 and < 10, ≥10 and < 20, etc.). For BCH, we used 2 m bins, with an additional bin for BCH = 0. For DBH we used 10 cm DBH bins from 0 to 150 cm, and then 50 cm DBH bins >150 cm. We calculated the species-level error rate used in each model as: where N model is the number of predicted deaths based on a 0.5 threshold, N obs is the number of observed deaths, and N bin is the number of total observations in each injury variable bin. For each model, we provided ratings of data quality used to evaluate the models, model performance, and the direction or error in model predictions. The logical decision framework used to determine the qualitative ratings is provided in Table 7. The data quality assessment was meant to help both managers and researchers determine if more data would allow for a better assessment of the model. We based our data quality ratings of poor, fair, acceptable, excellent, or outstanding in part on the total number of trees, the number of live and dead trees, and the number of fires sampled. If the data quality was poor, we did not assess the model. Fair data Fig. 1 Map of fire locations for all data used in this study to evaluate post-fire tree mortality models. Tree injury and post-fire mortality data used are from the USA, for fires occurring from 1981 to 2016 (dots). Dot color represents the number of trees sampled in a fire event. Data are from the Fire and Tree Mortality (FTM) database (Cansler et al. 2020a, b). Orange shading shows large fires that have occurred in the USA from 1984 to 2017 (data accessed from https://www.mtbs.gov) Table 4 Sample sizes and distributional statistics for assessment of model accuracy in this study of post-fire tree mortality models. Data are from the USA, from fires occurring from 1981 to 2016. R-A = Ryan and Amman model, which can be applied to many species; Pre-fire models predict morality after fire based on predictors available before fire; Post-fire models predict morality after fire based on predictors available after fire. Dead and live tree status is for three years post fire, unless otherwise noted. Damage variables are defined in Table 2 and model formulae are presented in Tables 5 and 6 Scientific name Model quality indicated a small sample size, while acceptable data quality indicated relatively larger samples sizes (Table 7). For these rankings, subsequent model evaluations with more data would be beneficial in determining the model's true accuracy. For the data to be ranked excellent or outstanding, the model evaluation had to use observations across the full range of the crown injury variable. To be ranked outstanding, the trees sampled had to cover much of the species DBH range (see Table  3 for large-tree sizes), and the sites sampled had to provide reasonable coverage of the species' temperatureprecipitation bioclimatic niche. We focused our reporting of individual model results on models that had acceptable, excellent, or outstanding data quality. We provided a separate ranking of the performance of the model. We based our model performance standards on the AUC, as well as on the positive predictive value (PPV) and negative predictive value (NPV) ( Table 1). Model performance ratings of excellent or outstanding indicated that the model should be used without reservation. Model performance of acceptable meant that the model should be used with caution, or for specific applications that align with the circumstances under which the model produces accurate predictions. Model Table 4 Sample sizes and distributional statistics for assessment of model accuracy in this study of post-fire tree mortality models. Data are from the USA, from fires occurring from 1981 to 2016. R-A = Ryan and Amman model, which can be applied to many species; Pre-fire models predict morality after fire based on predictors available before fire; Post-fire models predict morality after fire based on predictors available after fire. Dead and live tree status is for three years post fire, unless otherwise noted. Damage variables are defined in Table 2  Post-fire status from two years after fire performance of poor indicated that the model is unreliable and new modeling approaches were needed. Finally, we described how often the model over-predicted or under-predicted mortality by assessing the species-level error rate (Table 7).

Exploring potential sources of error
In post-hoc analyses, we explored three potential sources of error in our model evaluation data: (1) first versus second fires; (2) field versus calculated versions of CVS; and (3) unidentified spatial variables. We included trees that had burned a second time as independent observations in the model evaluations. This reflects the use of the current models for second-entry (or third, etc.) burns. For models in which there were ≥50 trees in both the live and dead classes, in both first and second fires groups, we statistically compared the performance of all models for first fires and second fires. We calculated AUCs for both groups, and tested for statistical differences in AUCs using the method of DeLong et al. (1988) as modified for the pROC package to test unpaired ROC curves (Robin et al. 2011). DeLong et al. (1988 developed a method to compare ROC curves by using the theory developed for generalized U-statistic, and Robin et al. (2011) implemented a bootstrapping method in the statistical programming language R (R Table 5 Models predicting pre-fire tree mortality designed for use before the fire occurs, from this study of post-fire tree mortality models from the USA, from fires occurring from 1981 to 2016. The Ryan and Amman model (R-A) model can be applied to any species for which bark thickness can be estimated. The other models are species-specific; hereafter, "pre-fire models." These models are used in the First Order Fire Effects Modeling system (FOFEM), which predicts post-fire mortality for species in the USA. We evaluated dead and live tree status at three years post fire, unless otherwise noted Ryan and Reinhardt 1988; Ryan and Amman 1994

Abies lasiocarpa
Hood and Lutes 2017

Larix occidentalis
Hood and Lutes 2017

Picea engelmannii
Hood and Lutes 2017 Pinus albicaulis b Hood and Lutes 2017

Pinus contorta
Hood and Lutes 2017

Pinus lambertiana
Hood and Lutes 2017 Pinus palustris Wang et al. 2007 Table 6 Species-specific models predicting post-fire tree mortality designed for use after the fire occurs (hereafter,

Larix occidentalis
Hood and Lutes 2017

Pinus contorta
Hood and Lutes 2017   The FTM database includes CVS values taken directly from field observations, and CVS calculated from other measurements. These other measurements include CLS, which may have been measured in the field or calculated for other measured variables such as tree height, canopy base height, or crown ratio (Cansler et al. 2020a). The equation provided in the FOFEM Help manual (Lutes et al. 2012), a rearranged version of the equation derived in Peterson and Ryan (Peterson and Ryan 1986), was used to calculate CVS from CLS in those cases. We expected that field measurements would be more accurate representations of CVS than derived values, but it is unclear from previous research if these differences are large enough to affect model performance. Therefore, we used the same sample size requirements and statistical methods as described above for comparing fires and second fires to compare model performance using fieldbased measurements of CVS and calculated CVS.
We also explored geographic variation in model performance at the fire scale. This could be caused by observer bias between fires, geographic variation in species' ability to withstand and recover from fire injury, unique burning characteristics for individual fire events (e.g., low soil moistures causing high soil heating and associated root damage; Shearman et al. 2019), and additional geographically and temporally associated stressors such as drought (van Mantgem et al. 2013). Teasing apart these different causes of error is beyond the scope of this paper, but we did qualitatively assess if model error varied regionally, so that model users can consider whether models tend to over-predict or under-predict mortality at their location. For models with data from ≥10 fires that had at least 10 tree observations within them, we visually assessed if there were regional groupings of firescale error using mapped differences between the firescale mean predicted levels of mortality and fire-scale percentages of observed mortality. We summarized qualitative observations in tabular form.
Do species-specific models perform better than the R-A model?
We tabulated model performance statistics between different models applied to the same species. We expected that the species-specific pre-fire and post-fire models would perform better than the R-A model. As above, we tested for statistical differences in the AUC between models using the method of DeLong et al. (1988) as modified for the pROC package to test unpaired ROC curves (Robin et al. 2011). Paired ROC curves are created from the same sample dataset, but with different predictive models, while unpaired ROC curves are created from different sample datasets (Robin et al. 2011). In the two cases for which our ROC curves were paired-the two Pinus palustris models, and the lowseverity and moderate-severity models for Populus tremuloides-we used the paired test from DeLong et al. (1988) instead.

R-A model performance across species
We wanted to identify any trends in R-A model performance across the 44 species assessed. We explored whether performance varied between division (gymnosperms versus angiosperms), families, regions, leaf habit (deciduous versus evergreen), and species bark thickness. We qualitatively compare model performance among these groups, and visually assessed whether the direction of model errors was related to species' main defensive trait, bark thickness. In relation to bark thickness, we assessed differences in model performance between subjectively defined groups of species: thin-barked (BT coef ≤ 0.035), moderately thick-barked (0.036 ≤ BT coef < 0.049), and thick-barked (BT coef ≥ 0.049). We plotted the species-level error rate in relation to the crown volume scorched and in relation to DBH for each R-A model, ordered from thin-barked to thick-barked species (Table  3).

Performance of individual R-A models
We summarized results for the R-A model for individual species, focusing on models that had acceptable (12), excellent (10), or outstanding (4) data quality (Table 8). We provided four illustrative examples of the model evaluation figures. All 68 model evaluation figures are included in Additional file 1. We organized this section by region and taxonomical division, and within those groups, by thin-barked, moderately thick-barked, and thick-barked species. We summarized typical model performance and highlighted species for which the R-A model performed very well or very poorly. We encourage readers to examine the figures in Additional file 1 to better understand the detailed results for species of interest.

Thin-barked gymnosperms
The R-A model consistently over-predicted mortality for thin-barked (BT coef ≤ 0.035) Western conifers. The model evaluation figure for Pinus contorta provides a typical example of results for this group (Fig. 2). We explained the figure in detail in the caption of Fig. 2 to provide information on this specific model, and as a guide for critical interpretation of the additional model figures in Additional file 1.
The model evaluation for many other thin-barked conifers-including Pinus edulis, Juniperus deppeana, Pinus monticola, and Thuja plicata-followed a similar pattern as P. contorta (Fig. 3, Additional file 1). Thinbarked species showed a consistent pattern of high sensitivities and low specificities, and low PPVs and high NPVs (Fig. 3). There were a few exceptions to this pattern. The model for Pinus attenuata had a similar pattern of relatively higher sensitivities and lower specificities, but the errors were more balanced across the range of crown scorch. The model for P. contorta, which was evaluated using a large set of observations, only slightly over-predicted mortality across the full range of CVS.

Moderately thick-barked gymnosperms
Most gymnosperms with intermediate bark thickness (0.036 ≤ BT coef < 0.049) had AUC values >0.75, but sometimes had unbalanced errors, continuing the pattern of high sensitivities and low specificities (Fig. 3). Abies magnifica and Pinus echinata follow this pattern. Tsuga heterophylla, Abies lasiocarpa, and Abies grandis had relatively balanced errors, and acceptable performance overall. The models for Picea engelmannii performed poorly, and flipped the typical trend with higher specificity than sensitivity, and higher PPV than NPV. The model for P. echinata, an eastern gymnosperm, performed poorly (AUC = 0.55), with higher sensitivity and low specificity, low PPV, and low NPV. The modest sample for P. echinata (n = 144), which included some sites on the margins of this species' climate niche (Additional file 1), means that results should be interpreted with caution.

Thick-barked gymnosperms
The R-A model generally performed best for gymnosperms with thick bark (BT coef < 0.049) AUC values were over 0.8 for most Western conifers, including Abies concolor, Calocedrus decurrens, Larix occidentalis, Pinus ponderosa (Fig. 4), Pseudotsuga menziesii, Pinus jeffreyi, and Pinus lambertiana. Included in this group are the only R-A models that met our criteria for outstanding models-AUC values ≥0.90, and both PPV and NPV ≥0.7: C. decurrens, P. jeffreyi, and P. lambertiana (Table 9). Two models for thick-barked species did not perform as well-Chamaecyparis lawsoniana (AUC = 0.633) and Pinus coulteri (AUC = 0.611)-but these were species for which the data quality was only fair. The species-model evaluation figures for C. lawsoniana is demonstrative of why the amount, quality, and representativeness of the data should be considered when interpreting model results (Fig. 5).
The R-A models did not perform as well for the Eastern thick-barked gymnosperms Pinus palustris and Pinus taeda (Fig. 3). Models for Pinus elliottii and P. palustris were similar in that they had very low specificity, low overall accuracy, and both PPVs and NPVs were relatively low. Interpreting the errors associated with P. palustris is complicated because the direction of error varies over the mid-ranges of CVS. This may in part be driven by limitations of the validation data: there were not many trees of this species with high level of CVS (Additional file 1). Both P. elliottii and P. palustris are species with large, protected buds, which can result in large differences between CVS and CVK, reflecting a limitation of the R-A model, which uses CVS to predict post-fire tree mortality. The model for P. taeda did not perform well overall (AUC = 0.68 and accuracy at 0.5 = 0.67), but errors were fairly well balanced. Mortality was over-predicted when CVS = 0%, and under-predicted when 0% < CVS < 50%. The high levels of mortality at relatively low levels of crown volume scorched may be indicative of other, unmeasured types of injuries causing mortality, but the magnitude of these errors may also reflect small sample sizes for trees with 0% < CVS < 50% (Additional file 1).

Angiosperms
The R-A models for angiosperms, particularly those with thin and moderately thick bark (e.g., BT coef < 0.049) did not perform as well as the models for gymnosperms (Fig. 6). Much like the models for thin-barked gymnosperms, mortality was consistently over-predicted, Table 8 Count of models for different qualitative ratings of data quality and model performance, from this study of post-fire tree mortality models from the USA, from fires occurring from 1981 to 2016. R-A = Ryan and Amman model, which can be applied to many species; SS = species-specific pre-fire and post-fire models. See Table 7   particularly for higher levels of CVS. Models predicting mortality of Quercus L. species after fire performed poorly because the main damage variable (CVS) had a weak relationship with observed mortality (Additional file 1). Quercus kelloggii provides an illustrative example (Fig. 7). Q. kelloggii had higher levels of mortality at CVS >75% than other Quercus species, and higher mortality in small-diameter trees, but the model showed high sensitivity and low specificity, consistent with other oak species. Quercus gambelii had a large overall sample size (n = 444) and large sample size of live and dead stems (n = 331 and n = 113, respectively), and the other Quercus species that had smaller samples and only fair data quality all exhibited similar patterns of low mortality, and similar directional errors in predictions in relation to the main injury and defense variables (Additional file 1). The consistency of the modeling errors across Quercus species supports that oak survival with high levels of crown scorch is a real trend, not an anomaly driven by small samples sizes of some Quercus species. Relative to other angiosperms, the R-A models for Populus L. species performed relatively well (Additional file 1; Fig. 6). The model for P. tremuloides had an AUC = 0.73, at the 0.5 threshold. Higher levels of sensitivity and specificity could be achieved by adjusting the threshold. P. tremuloides that burned tended to have no CVS or 100% CVS, with a median value of 0% CVS for both live and dead trees, leading to low samples at mid-levels of CVS, and high oscillations in observed mortality (Additional file 1).
In contrast to thin-barked angiosperms, the R-A models for Cornus nuttallii had a high AUC (0.947). Nevertheless, it continued the pattern in angiosperms of high sensitivities (1.0) and low specificities (0.06). The model for C. nuttallii almost always over-predicted mortality, although good performance in all model performance statistics (≥0.8) could be achieved by adjusting the threshold (Additional file 1).

Exploring additional potential sources of error
Ten models met our sample size criteria for comparing first and second fires (Table 10). All comparisons demonstrated significantly higher AUC levels for first fires. In some cases, the AUC values for both groups were high, but in other cases-the pre-fire models for Abies concolor, Calocedrus decurrens, Pinus lambertiana, and both the RA and pre-fire model for Pseudotsuga menziesii-the AUC scores were much lower (<0.70) for second fires.
We also compared models using field-based measurement and the calculated version of CVS for seven models (Table 11). We found that models using fieldbased measurements had significantly higher AUC scores for four of the models. Field-based measurements had significantly lower AUCs for the Pinus ponderosa pre-fire model. Only in one case, the R-A model for (See figure on previous page.) Fig. 2 Model results for the Pinus contorta Ryan and Amman (R-A) model, from our study to evaluate post-fire tree mortality models. This figure allows for a thorough assessment of model quality and data quality for the P. contorta R-A model. (A) Map shows locations of fires occurring from 1981 to 2016 within the USA from which data to evaluate the model were sampled. Fire locations are plotted over the species' range (green polygons). P. contorta had excellent data quality, with observations coming from 34 fires, dispersed across the species range within the USA. BTcoef = species-specific bark thickness coefficient. (B) The bi-plot shows where the observations used to evaluate models (orange points) fall within the species' bioclimatic niche space (black points) in terms of temperature (x-axis) and precipitation (y-axis). Fires were located across the temperature niche of the species, but on the lower range of the precipitation niche. (C) Model evaluation summary statistics including the AUC (area under the receiver operator characteristic curve) at 0.5 threshold for determining mortality, and confidence intervals (CI) around the AUC. Model evaluation statistics include accuracy, sensitivity (Sens.), specificity (Spec.), positive predicted values (PPV), and negative predictive values (NPV), summarized over a range of probability thresholds (0.1 to 0.9; rows), with the commonly used threshold of 0.5 shown in bold. Warmer colors indicate greater values. The top three bold rows show model performance metrics for the "best" threshold, which optimizes sensitivity and specificity, the best threshold with sensitivity >0.8, and the best threshold with specificity >0.8. The model accuracy statistics indicate a high AUC (0.803), but at the typically used 0.5 threshold, model sensitivity is very high and model specificity is very low. This means that the model accurately predicts which trees are going to die, but makes inaccurate predictions regarding which trees are going to live, which is reflected in the low positive predicted values (PPV) and high negative predictive values (NPV): many trees predicted to die do not actually die, while most trees predicted to live do live. By adjusting the threshold used to assign either trees to live or dead classes to a high value, either high sensitivity or specificity can be obtained with this model with the evaluation data (top bold rows).  Table 1 for formulas. Warmer colors indicate higher values. Species are ordered from thin-barked to thick-barked species, based on species' bark-thickness coefficient. Data used to evaluate post-fire tree mortality models are from the USA, from fires occurring from 1981 to 2016 Abies magnifica, was the AUC value for the calculated version of the AUC lower than 0.7. Overall, models still performed relatively well using the calculated version of CVS.
We examined the spatial structure of fire-scale errors for 22 species-model combinations when sufficient data, both within and across fires, was present (Additional file 2). In Table 12, we provided qualitative summaries of the direction of fire-scale errors and described any visually obvious regional patterns in error direction and magnitude. The R-A models for Abies concolor, Pinus contorta, Pinus ponderosa, and Pseudotsuga menziesii had fire-scale errors that showed visible patterns of regional variation, as did the pre-fire models for Abies concolor, Pinus contorta, Picea engelmannii, and Pinus ponderosa.

Systematic errors in FOFEM 5 model performance
Approximately 77% of R-A models tested had either excellent or good predictive ability. The models that performed poorly were primarily for angiosperms or thin-barked conifers. R-A model performance differed between angiosperms and gymnosperms, and across the gradient from thin-barked to thick-barked species. The evaluation of R-A model error across levels of CVS offers additional insights (Additional file 3). For conifers, the R-A model made accurate predictions of mortality across all levels of CVS for very thickbarked species, over-predicted mortality at higher levels of CVS for moderately thick-barked species, and under-predicted mortality at low levels of CVS for many thin-barked species (Table 13). For gymnosperms with intermediate bark thickness (e.g., 0.35 ≤ BT coef ≤ 0.52), the R-A model moderately underpredicted mortality at low levels of CVS (e.g., ≤40%). For gymnosperms with thick bark (e.g., BT coef ≥ 0.55), the R-A model moderately over-predicted mortality at high levels of CVS (e.g., ≥60%).
Errors also showed patterns across the DBH ranges-and thus the bark-thickness ranges-of species (Additional file 4). In gymnosperms, mortality was often over-predicted for smaller trees (DBH <30 cm), particularly for thin-barked species (BT coef < 0.35), and under-predicted at the high end of a species' DBH range. Some of the angiosperms followed this same pattern (e.g., Populus tremuloides, Quercus gambelii, and Notholithocarpus densiflorus), but for many other angiosperms, mortality was over-predicted across their DBH range (e.g., Acer rubrum, Quercus garryana, Oxydendrum arboreum, Quercus alba, and Quercus montana). For some species, like Pinus elliottii and Cornus nuttallii, the DBH range represented in the dataset was fairly narrow, making it difficult to assess trends.
Do pre-fire and post-fire models perform better than the R-A model?
We expected that the species-specific models would perform significantly better than the R-A models. However, the species-specific pre-fire models had significantly higher AUCs than the R-A models for only five of the 13 models: Pinus albicaulis, Picea engelmanii, Populus tremuloides (both models), and Pseudotsuga menziesii (Fig. 8). The species-specific post-fire models performed better than the pre-fire models, and had significantly higher AUCs than the R-A models for six of the nine post-fire models, including for Pinus contorta, Picea engelmanii, Pinus ponderosa (both the post-fire crown scorch and crown kill post-fire models), and Pseudotsuga menziesii (Fig. 8).
Counter to our expectations, the species-specific pre-fire models had significantly lower AUCs than the R-A for Abies concolor, Calocedrus decurrens, Pinus contorta, and both the pre-fire Black Hills and crown scorch Pinus ponderosa models (Table 5); the speciesspecific post-fire model for Abies concolor also had a significantly lower AUC than the R-A model. In other cases, models were not significantly different. Both the R-A model and the pre-fire model for Pinus palustris performed very poorly, with AUCs ≤0.66. The pre-fire models for Pinus palustris severely under-predicted mortality across the range of CVS for the trees in our dataset (Additional file 1).
Another way to compare models is based on how balanced their errors were: for models to be ranked as excellent or outstanding in our qualitative ratings, they had to have both PPV and NPV >0.7, indicating good predictive power for both dead and live trees. Five models-Larix occidentalis (pre-fire and post-fire models), Pinus ponderosa (both pre-fire models), and Pinus contorta (pre-fire model)-performed better than the R-A models when PPV and NPV were considered, even though the AUCs were not significantly different or were significantly lower (Table 9).

Data quality
The FTM database provides an unprecedented opportunity to evaluate the tree and stem mortality models in FOFEM. This is the first model evaluation for many species, such as the Quercus and Juniperus species. Nevertheless, the data quality of these evaluations vary substantially between the models assessed (Tables 8 and 9, Additional file 1). For many angiosperms, data quality is only at the fair level (Table 9), meaning that data are limited in terms of the number of fires, number of live or dead trees, or range of damage variables observed (Table 7). Twentythree models for Western conifers and the R-A model for Populus tremuloides have data quality that we considered  (Table 9). Few tree species met the outstanding data quality ranking because their ranges extend outside the US, but the FTM database currently only includes data from within the US. The model evaluations for which the data quality is only fair should be considered provisional and used primarily to identify consistent patterns in model performance across taxa. Our results highlight species and regions for which new data collection or new model development could be prioritized. We were unable to assess the R-A models for 148 species for which FOFEM has built-in bark thickness coefficients and for five species-specific postfire models (Additional file 5). Over 80% of the species in Additional file 5 are angiosperms. In both the eastern and western US, oak woodlands and other angiospermdominated ecosystems are primary targets for prescribed fire and restoration (e.g., Stambaugh et al. 2015;Long et al. 2016). Species that we identify as having lowquality data could be prioritized as these efforts move forward. Nevertheless, the FTM database provides a foundation that future data collection can build on to support model evaluation and model development for many species. Sixty-six of these species have at least one observation in the FTM database with the relevant predictor variables for given model, and 40 have observations at three years post fire (Additional file 5).
The maps of the evaluation data (Fig. 1) and the maps for each species-specific model (Additional file 1) demonstrate substantial geographic gaps in data availability. Our objectives for the FTM database were to collect observations of post-fire tree and stem mortality in the continental US, but many of the species evaluated here and included in FOFEM have ranges extending into Canada and Mexico. Data from the eastern USA are extremely limited. Data were also limited from the Intermountain West (e.g., Nevada and Utah) and the Central Rockies (e.g., Colorado) for many of the widespread Western conifer species we assessed. Although mechanisms of injury from fire should be consistent across species' ranges, the stress that individual trees experience may differ due to contrasting climate, soils, competitors, or other conditions. Therefore, including data from throughout species' ranges may yield additional insights.

Model evaluation Model performance across studies
Logistic regression is one of the most widely used approaches for predicting post-fire tree and stem mortality. Logistic models have been used to identify relevant variables that help us understand contributors to the mortality process and have been used to predict individual and stand-scale mortality (Woolley et al. 2012). Because logistic regression models have been shown to make effective predictions and can translate to management applications, they have been implemented operationally in models simulating fire effects, either as part of a larger-scale modeling process, or as management decision support systems (Hood et al. 2018).
Our results support the applicability of logistic regression modeling for predicting tree and stem mortality after fire. Of the 68 models evaluated, 45 had AUC (See figure on previous page.) Fig. 4 Model results for the Pinus ponderosa Ryan and Amman (R-A) model. (A) Map shows locations of fires occurring from 1981 to 2016 within the USA from which data to evaluate the model were sampled. Fire locations are plotted over the species' range (green polygons). P. ponderosa was incredibly well sampled across its geographic range in the US, with data used to evaluate the R-A model from 43 140 trees and 226 fires. BTcoef = specific specific bark thickness coefficient. (B) The bi-plot shows where the observations used to evaluate models (orange points) fall within the species' bioclimatic niche space (black points) in terms of temperature (x-axis) and precipitation (y-axis). (C) Model evaluation summary statistics including the AUC (area under the receiver operator characteristic curve) at 0.5 threshold for determining mortality, and confidence intervals (CI) around the AUC. Model evaluation statistics include accuracy, sensitivity (Sens.), specificity (Spec.), positive predicted values (PPV), and negative predictive values (NPV), summarized over a range of probability thresholds (0.1 to 0.9; rows), with the commonly used threshold of 0.5 shown in bold. Warmer colors indicate greater values. The top three bold rows show model performance metrics for the "best" threshold, which optimizes sensitivity and specificity, the best threshold with sensitivity >0.8, and the best threshold with specificity >0.8. This model showed higher sensitivity than specificity, but performed well overall (AUC = 0.887). (D) The distributions of defense (diameter at breast height [DBH], as an interpretable representation of bark thickness) and injury (crown volume scorch) variables used in the model are shown with bi-plots. Box plots in the margins of D show median (bar), interquartile range (IQR; box; 25th and 75th percentiles), and whiskers show the minimum and maximum values that do not exceed a 1.5 × IQR. The scatter plot shows that trees that survived and died after fire were sampled across the ranges of percentage crown volume scorched (CVS) and diameter at breast height (DBH). (E) and (F) Assessment of species-level error comparing the predicted probability of mortality using a 0.5 threshold (P m ; orange points show values and shading shows range) and the observed proportion of trees or stems killed (gray points) within binned observations of the primary injury variables (E), and the DBH F. (E) The model over-predicted mortality at middle to high values of CVS, and (F) under-predicted mortality for the small sample of very large trees. Qualitative ratings of data quality, model performance, and direction or error in model predictions are listed at the bottom of the figure. This model had excellent data quality, but did not qualified as having outstanding data quality because samples were only from within the US and did not cover the species' full climatic range (Additional file 5) Table 9 Qualitative ratings of data quality and model performance, from this study of post-fire tree mortality models from the USA, from fires occurring from 1981 to 2016. R-A = Ryan and Amman model, which can be applied to many species; Pre-fire models predict mortality after fire based on predictors available before fire; Post-fire models predict mortality after fire based on predictors available after fire. Model formulas are in Tables 5 and 6, and model variables are defined in Table 3. See Table 7 Ryan and Amman (1994), performed acceptably for 25 of the 45 of species tested, and had excellent or outstanding performance for five additional species. Our verification for this model, and for using bark thickness and crown damage as the primary predictors of post-fire tree mortality for western North American conifers, is supported by other studies that evaluated model performance with datasets that are independent of those used to create the models (Sieg et al. 2006;Thies and Westlind 2012;Ganio and Progar 2017;Grayson et al. 2017;Kane et al. 2017). Although our results allow us to describe the conditions under which each model produces errors, readers should not lose sight of the overall efficacy of these models, especially for Western conifers. It is rare to have a simple model-one using only two variables-so effectively capture an important ecological process. In their review of logistic regression models used to predict post-fire tree mortality in western North American conifers, Woolley et al. (2012) identified 116 logistic regression models, from 33 studies, that made predictions for 19 species. Only 13 of the 116 models had been evaluated with independent datasets (Woolley et al. 2012). Since then, several new studies have independently evaluated post-fire mortality logistic regression models. Generally, we observed better model performance than previous studies. For example, Ganio and Progar (2017) validated 14 models for Pinus ponderosa, and six models for Pseudotsuga menziesii, and found lower AUC scores than we did for the three models common to both their study and this study: Pinus ponderosa post-fire scorch model, Pseudotsuga menziesii R-A model, and Pseudotsuga menziesii post-fire model.
Comparisons with other studies indicate that the direction of error in model predictions likely varies among datasets, perhaps due to variation in how a fire burned, seasonal differences related to phenology, or due to regional and inter-annual differences in stress. In line with our findings, Kane et al. (2017) found high sensitivity and low specificity for the R-A model for Abies concolor, Calocedrus decurrens, Juniperus osteosperma, Pinus contorta, Quercus gambelii, and Quercus kelloggii and, like this study, found the opposite pattern of low sensitivity and high specificity for Pinus lambertiana. In contrast to our results, Kane et al. (2017) found low sensitivity and high specificity for Pinus ponderosa and Populus tremuloides. The patterns of sensitivity and specificity (their "true positive rate" and "true negative rate," respectively) found by Grayson et al. (2017) for the R-A model were the same as those we observed for three conifer species (Chamaecyparis lawsoniana, Pinus contorta, and Pinus lambertiana), but not for the other nine species common to both studies (Abies magnifica, Abies contorta, Abies grandis, Calocedrus decurrens, Larix occidentalis, Picea engelmannii, Pinus monticola, Thuja plicata, and Tsuga heterophylla). The database that Kane et al. (2017) drew on, from the US National Park Service Fire Effects Monitoring Program, and the data from Grayson et al. (2017) and Ganio (See figure on previous page.) Fig. 5 Model results for the Chamaecyparis lawsoniana Ryan and Amman (R-A) model. (A) Map shows locations of fires occurring from 1981 to 2016 within the USA from which data to evaluate the model was sampled. Fire locations are plotted over the species' range (green polygons). Note the small samples size (n = 69), the small number of fires sampled (n = 2), and the small number of dead trees (n = 11; D). The small sample size in part reflects the relatively small natural range of this species. BTcoef = specific specific bark thickness coefficient. (B) The bi-plot shows where the observations used to evaluate models (orange points) fall within the species' bioclimatic niche space (black points) in terms of temperature (x-axis) and precipitation (y-axis). (C) Model evaluation summary statistics including the AUC (area under the receiver operator characteristic curve) at 0.5 threshold for determining mortality, and confidence intervals (CI) around the AUC. Model evaluation statistics include accuracy, sensitivity (Sens.), specificity (Spec), positive predicted values (PPV), and negative predictive values (NPV), summarized over a range of probability thresholds (0.1 to 0.9; rows), with the commonly used threshold of 0.5 shown in bold. Warmer colors indicate greater values. The top three bold rows show model performance metrics for the "best" threshold, which optimizes sensitivity and specificity, the best threshold with sensitivity >0.8, and the best threshold with specificity >0.8. (D) The distributions of defense (diameter at breast height [DBH], as an interpretable representation of bark thickness) and injury (crown volume scorch) variables used in the model are shown with bi-plots. Box plots in the margins of (D) show median (bar), interquartile range (IQR; box; 25th and 75th percentiles), and whiskers show the minimum and maximum values that do not exceed a 1.5 × IQR. (E) and (F) Assessment of species-level error comparing the predicted probability of mortality using a 0.5 threshold (P m ; orange points show values and shading shows range) and the observed proportion of trees or stems killed (gray points) within binned observations of the primary injury variables (E), and the DBH (F). Despite the low values of the model accuracy statistics (C), the predicted mortality follows the expected mortality over both the range of percentage crown volume scorched and diameter at breast height (DBH; E and F). The small sample of dead trees, some of which were large-diameter trees with <25% crown volume scorch (D) likely caused the low observed sensitivity of the model. Qualitative ratings of data quality, model performance, and direction or error in model predictions are listed at the bottom of the figure and Progar (2017) were both included in the FTM database, and thus this study draws on these same data.

Model performance varies across taxa, measurement methods, times burned, and regions
Our results show that models performed well for conifers, which make up the dominant canopy component of many forests in western North America. For example, the only R-A models that met our criteria for outstanding models (AUC values ≥0.9, and both PPV and NPV ≥0.7) were for Calocedrus decurrens, Pinus jeffreyi, and Pinus lambertiana (Table 9). In contrast, the model performed poorly for many Southwestern species and Southeastern species, including species for which prescribed fire programs often target a decrease in density (e.g., Quercus gambelii). The R-A performed relatively well for some angiosperms, including Populus deltoides spp. wislizeni, Notholithocarpus densiflorus, and moderately well for Populus tremuloides and Cornus nuttallii (Fig. 6). Nevertheless, five of 11 angiosperms had AUCs <0.7. Model sensitivity and NPV was highest, and model sensitivity and PPV was lowest, for thin-barked gymnosperms (Fig. 3).
The trends in the performance of the R-A model related to taxa and bark thickness indicate that different approaches-such as different model forms, better estimates of bark thickness, and additional predictors-may be warranted for angiosperms and thinbarked conifers. New models for 10 Eastern angiosperms were recently developed by Keyser et al. (2018) that used maximum bole char height and DBH as predictors, and found good model fit, particularly for thin-barked species. Because many angiosperms are deciduous, assessing crown scorch is complicated by season (i.e., burns occurring during leaf-off that fail to reflect branch injury or burns near leaf-off when the physiological cost of replacement is reduced or absent). For ecosystems in which fire is generally low severity and may not scorch or consume tree crowns, damage to the stem-as measured by bark char height, percentage of bole circumference charred, or cambium kill rating-may be more meaningful measurements of fire-caused injury and better predictors of mortality (Catry et al. 2013). Including damage to stem may also be important for angiosperms because their growth form can be non-vertical, allowing the fire to scorch the crown while not burning the base of the stem. Patchy fire spread and an associated lack of coupling of injuries to the stem and the canopy may be one reason the models for angiosperms strongly over-predict mortality of small trees. Also, many angiosperms are able to produce epicormic shoots from their stems (Meier et al. 2012;Pausas and Keeley 2017). Species that resprout from their stem may be relatively resilient to crown volume scorch. Instead, damage to the cambium on the main stem may be a better predictor (Catry et al. 2010(Catry et al. , 2013Keyser et al. 2018), or a different relationship with CVS may need to be parameterized (Furniss et al. 2019). Data collection and modeling of actual mortality as opposed to just stem mortality of Fig. 6 Model evaluation summary statistics and qualitative ratings for the FOFEM 5 model for angiosperms, from our study to evaluate post-fire tree mortality models. AUC = area under the receiver operator characteristic curve; Acc. = accuracy; Sens. = sensitivity; Spec. = specificity; PPV = positive predictive value; NPV = negative predictive value; see Table 1 for formulas. Warmer colors indicate higher values. Species are ordered from thin-barked to thick-barked species, based on species' bark-thickness coefficient. Data used to evaluate post-fire tree mortality models are from the USA, from fires occurring from 1981 to 2016 resprouting species, as has been conducted in the US (e.g., Keeley et al. 2008) and other global ecosystems (Pinard et al. 1999;Barlow and Peres 2008;Hoffmann et al. 2009), might be a more informative approach for managers and ecosystem models in North America.
In summary, the R-A model consistently overpredicted mortality for angiosperms, resulting in high sensitivity and low specificity. For conifers, R-A slightly over-predicted mortality for thick-barked species. It also under-predicted mortality at low levels of CVS for moderately thick-barked conifers, perhaps indicating that injuries to the stems and roots need to be accounted for when modeling mortality of these species, as well. The species-specific models we evaluated typically offered the most accurate model predictions, but it is impractical to parameterize models for every species. Modeling approaches based on species Table 10 Statistical comparison of AUCs (area under the receiver operating characteristic [ROC] curve) between samples burned in one fire and those burned by a second fire. We tested for statistical differences in AUCs between first and second fires using the method of DeLong et al. (1988) as modified for the pROC package in the statistical program R to test unpaired ROC curves (Robin et al. 2011). U = generalized U-statistic; R-A = Ryan and Amman model, which can be applied to many species; Pre-fire models predict morality after fire based on predictors available before fire. Data used to evaluate post-fire tree mortality models are from the USA, from fires occurring from 1981 to 2016  traits or suites of traits may be a good middle ground (Hood et al. 2018). There is considerable promise in using mechanistically linked traits including bark thickness (Hengst and Dawson 1994;Lawes et al. 2011), protected buds (Hood et al. 2010), epicormic resprouting ability (Hoffmann et al. 2009;Catry et al. 2010Catry et al. , 2013Pausas and Keeley 2017), presence of a primary bark beetle (Hood and Bentz 2007;Davis et al. 2012), and depth of surface roots (Varner et al. 2009) to model mortality probabilities for groups of species. Integrating pre-fire and post-fire stress from drought and competition may also improve models (van Mantgem et al. 2003(van Mantgem et al. , 2013Nesmith et al. 2015), and could potentially be coupled with a species-traits approach since the same physiological processes-impairment of hydraulic integrity and carbon demandare important for both fire-and drought-driven mortality (Michaletz et al. 2012;West et al. 2016;Bär et al. 2018). Contingent relationships may require a more flexible statistical modeling approach than logistic regression modeling (Menges and Deyrup 2001;Shearman et al. 2019). The FTM database illustrates the need for a robust comparison of different statistical approaches and method for accounting for different species traits, as well as adding even representation of the full range of predictor variables. While model refinement is an ongoing process, practitioners are still reliant on existing decision support systems to inform planning and land management decisions. Our approach to identifying thresholds and model evaluation results could be easily incorporated into FOFEM and associated models. These updates could be coupled with guidance on how to set threshold values to optimize prediction accuracies in a way that mirrors management objectives and quantifies uncertainty of this widely used decision support tool. By adjusting the thresholds, we were able to obtain discriminating sensitivities and specificities above 80% for most models (Additional file 1). Adjusting the thresholds does not make sense for species for which the suggested adjusted thresholds were unrealistically high or low. For example, for Populus deltoides, a threshold of 1 would be needed to achieve specificity of >0.8, reflecting the low specificity of the model, despite high accuracy, sensitivity, PPV, and NPV. Applying suggested thresholds in cases for which the adjust threshold is within reasonable bounds (e.g., 0.10 ≤ threshold ≤ 0.90) may be sensible.
We did not evaluate all steps in the FOFEM modeling process. The pre-fire and R-A models most commonly use percentage crown scorch (either percentage of crown length or percentage of crown volume), which cannot be observed prior to the fire. Instead, FOFEM allows users to either enter predicted flame length (often generated based on experience or via fire behavior prediction software from fire intensity) or scorch height, as well as tree height and live crown ratio; percentage crown scorch is then calculated from these inputs (Lutes et al. 2012). The error we showed between field-based and calculated CVS scores reflects error in FOFEM in a step in this process, when CLS is converted to CVS using a standardized equation (Peterson and Ryan 1986). We should expect more error in modeled variables; nevertheless, this equation is also in need of external validation, and may need to be modified for species with different canopy architecture than the species for it was developed. Barker et al. (2019) evaluated   Table 11 Statistical comparison of AUCs (area under the receiver operating characteristic [ROC] curve) between samples for which percentage crown volume scorched (CVS) was sampled in the field and calculated based on other measurements of canopy injury (e.g., crown length scorch, change in crown ratio, change in canopy base height). We tested for statistical differences in AUCs between first and second fires using the method of DeLong et al. (1988) as modified for the pROC package in the statistical program R to test unpaired ROC curves (Robin et al. 2011). U = generalized U-statistic; R-A = Ryan and Amman model, which can be applied to many species; Pre-fire models predict morality after fire based on predictors available before fire. Data used to evaluate post-fire tree mortality models are from the USA, from fires occurring from 1981 to 2016 mortality based on multiple simulated weather scenarios, and assessed errors in predicted mortality at stand, forest type, and species scales. They found high model errors, with mortality being over-predicted for more extreme fire-weather scenarios due to overpredictions of flame lengths (Barker et al. 2019). Thus, validating other steps in the simulation modeling process with independent datasets is very much needed. Information on how uncertainty is compounded through multiple steps in the model is also needed. Bark thickness is modeled based on DBH using species-specific bark thickness coefficients. The coefficients and the assumption of a linear relationship between Table 12 Descriptive summaries of our visual evaluations, using the maps in Additional file 2, of the presence of regional patterns in fire-scale model prediction error, from this study of post-fire tree mortality models from the USA, from fires occurring from 1981 to 2016. Models are defined in Tables 5 and 6 Model Direction or errors Regional patterns in errors  Table 13 The direction of error in model predictions by species evaluated, from this study of post-fire tree mortality models from the USA, from fires occurring from 1981 to 2016. Data used to evaluate post-fire tree mortality models are from the USA, from fires occurring from 1981 to 2016. Models are defined in Tables 5 and 6; R-A = Ryan and Amman model, which can be applied to many species; Pre-fire models predict morality after fire based on predictors available before fire; Post-fire models predict morality after fire based on predictors available after fire. Criteria for assigning qualitative descriptors of frequency of under-prediction and overprediction are defined in Table 7 Underpredicts DBH and bark thickness have not been evaluated for many species. Substantial evidence exists that bark on the most fire-resistant species follows a negative allometry; that is, these species add bark at a proportionally higher rate as juveniles, thereby protecting their underlying cambium at the tree's most vulnerable time (Jackson et al. 1999). Bark is an important protective trait for woody plant species, and thickness is only one of suite of protective properties provided by bark. Other heat-protective traits include moisture content, surface roughness, and thermal diffusivity (Hare 1965;Dickinson and Johnson 2001). Likewise, the bark thickness relationship in FOFEM assumes that bark thickness of a tree just reaching 1.37 m in height (e.g., breast height) is 0. This further discounts fireprone species that over-allocate to bark, particularly as saplings, a pattern found in Quercus and Pinus and other fire-prone genera (Jackson et al. 1999;Hammond et al. 2015). The R-A model consistently overpredicted mortality of small-diameter trees, which may reflect underestimated bark thickness of these small-diameter trees (Additional file 4). Non-linearity over size (or age) could be an important addition to FOFEM, given its importance in differentiating species and models in our analysis, but additional data on bark thickness relationships to size by species are needed to test this hypothesis.
Our maps of fire-scale error indicate that model performance may vary regionally for some species. Potential causes of this spatial variation in model error include differences in the size of trees and range of fire injuries in individual fires (i.e., reflect error that is correlated with predictor variables), local environmental differences (e.g., deeper litter and duff in more productive regions and long-unburned regions), interspecific variation in fire defense traits, and regional stress from drought and bark beetles. While wildfires may typically burn under higher intensities than prescribed fire, theoretically, if fire injuries to trees are similar, there should be no difference among fire types (i.e., wildfire, wildland fire use, or prescribed fires). Conversely, differences in the patchiness of surface fuel combustions between wildfires and prescribed fires (Blomdahl et al. 2019) or between prescribed fires with differing ignition patterns (Hiers et al. 2020) could translate to differences in model performance between wildfires and prescribed fires. Future research could explore these sources of spatial error and determine if regionally specific models are needed for some species. In the meantime, managers can use these maps, in addition to the individual species results, as an additional consideration when they are using FOFEM to make operational predictions.

Conclusion
We suggest a three-pronged approached to future development and use of post-fire tree and stem mortality models. First, existing empirical models should continue to be validated and modified to improve prediction accuracy. For example, allowing choice of thresholds for different management outcomes (Table  1) could be integrated into decision support software systems. Estimates of model uncertainty are also needed in decision support systems so that managers Table 13 The direction of error in model predictions by species evaluated, from this study of post-fire tree mortality models from the USA, from fires occurring from 1981 to 2016. Data used to evaluate post-fire tree mortality models are from the USA, from fires occurring from 1981 to 2016. Models are defined in Tables 5 and 6; R-A = Ryan and Amman model, which can be applied to many species; Pre-fire models predict morality after fire based on predictors available before fire; Post-fire models predict morality after fire based on predictors available after fire. Criteria for assigning qualitative descriptors of frequency of under-prediction and overprediction are defined in can apply precautionary principles to planned operations. Second, researchers could target data collection and modeling on data gaps in the FTM database and poorly performing models identified in this study, and track stem versus individual mortality in basal resprouting species. Additional variable collection for thin-barked gymnosperms and angiosperms and thick-barked Eastern conifers may be necessary to parameterize accurate models. Third, researchers should explore development of models for species with a common set of traits using a diversity of statistical approaches to produce models with stronger mechanistic linkages to processes and, hopefully, greater prediction accuracy. Fig. 8 Model evaluation summary statistics and qualitative ratings for species for which multiple models were evaluated. AUC = area under the receiver operator characteristic curve; Acc. = accuracy; Sens. = sensitivity; Spec. = specificity; PPV = positive predictive value; NPV = negative predictive value; see Table 1 for formulas. Warmer colors indicate higher values. Group represents statistically significant differences between model area under receiver operator characteristic curve (AUC) values, and group labels are assigned alphabetically to groups with higher AUC values. Data used to evaluate post-fire tree mortality models are from the USA, from fires occurring from 1981 to 2016
Additional file 1. Species-level model evaluations and data visualizations for 68 logistic regression models predicting mortality after fire used in the First Order Fire Effects Model (FOFEM version 6.7) software system. Data used in this study to evaluate post-fire tree mortality models are from the USA, from fires occurring from 1981 to 2016. Each page shows information on data quality and model performance for a model as applied to an individual species, presented in alphabetical order by species name, and then by model type: Ryan and Amman (R-A), pre-fire species-specific models, and then post-fire species-specific models. Model equations are listed in Tables 5 and 6 The bi-plot shows where the observations used to evaluate models (orange points) fall within the species' bioclimatic niche space (black points) in terms of temperature (x-axis) and precipitation (y-axis). (Middle left) Model evaluation summary statistics including the AUC (area under the receiver operator characteristic curve) and the confidence intervals (CI) around the AUC. Model evaluation statistics also include accuracy, sensitivity (Sens.), specificity (Spec.), positive predicted values (PPV), and negative predictive values (NPV), summarized over a range of probability thresholds (0.1 to 0.9; rows), with the commonly used threshold of 0.5 shown in bold (see Table 1 for a complete description of model summary statistics). Warmer colors indicate greater values. The top three bold rows show model performance metrics for the "best" threshold, which optimizes sensitivity and specificity, the best threshold with sensitivity >0.8, and the best threshold with specificity >0.8. (Middle right) Scatter plot shows the distribution of defense (diameter at breast height [DBH], as an interpretable representation of bark thickness) and injury (i.e., crown volume scorch, crown length scorch, crown volume kill, bark char height) variables used in the model are shown with bi-plots. Box plots in the margins of show median (center bar), interquartile range (IQR; box; 25th and 75th percentiles), and whiskers show the minimum and maximum values within 1.5×IQR. Dots are values outside IQR. (Bottom left and bottom right) We assessed species-level error, grouping data in relation to the primary crown injury variable used in thus model (i.e., crown volume scorched, crown length scorched, crown length killed, or bark char height), and the DBH, as a measure of defense from heating. The predicted probability of mortality using a 0.5 threshold (P m ; orange points show values and shading shows range) and the observed proportion of trees or stems killed (gray points) within binned observations of the primary injury variables (bottom left), and the DBH (bottom right). Qualitative ratings of data quality, model performance, and direction or error in model predictions are listed at the bottom of each figure. Table 7 defines the qualitative ratings.
Additional file 2. Evaluation of fire-scale model error for 21 models that had ≥10 fires with ≥10 trees in each fire, in our study of post-fire tree mortality models from the USA, from fires occurring from 1981 to 2016. Each page shows a map, at the scale of the conterminous United States, with the location of fires. Fire locations are colored by the direction of model error. Pages show models in alphabetical order by species name, and then by model type: Ryan and Amman (R-A), and then pre-fire species-specific models. The species and the model are listed on the top of each figure.
Additional file 3. Difference between the probability of predicted mortality (P m ) and proportional observed mortality (y-axis) over the range of crown volume scorch (CVS; x-axis). Larger dot sizes represent exponentially larger sample sizes; missing dots means no data exist. Values above the bold dashed line mean that the model over-predicts mortality, and values below the line means that the model underpredicts mortality. Red shaded areas show where over-prediction of >0.25 (light red) and >0.50 are, and blue shaded areas show underprediction of <−0.25 (light blue), and <−0.5 (dark blue). Shaded classes correspond to qualitative summary descriptions of over-prediction and under-predictions (Table 13). Lines and points are orange for angiosperms and aqua for gymnosperms. Species are ordered from thin-barked to thick-barked, and the species-specific bark thickness coefficient (BT coef.) is shown at the top of each pane. Data used to evaluate post-fire tree mortality models are from the USA, from fires occurring from 1981 to 2016.
Additional file 4. Difference between the probability of predicted mortality (P m ) and proportional observed mortality (y-axis) over the range of diameter at breast height (DBH; x-axis) values for R-A models. Larger dots represent exponentially larger sample sizes; missing dots means no data exist. Values above the bold dashed line mean that the model overpredicts mortality, and values below the bold dashed line means that the model under-predicts mortality. Red shaded areas show where overprediction of >0.25 (light red) and >0.50 are, and blue shaded areas show under prediction of <−0.25 (light blue), and <−0.5 (dark blue). Lines and points are orange for angiosperms and aqua for gymnosperms. Species are ordered from thin-barked to thick-barked, and the species-specific bark thickness coefficient (BT coef.) is shown at the top of each pane. Data used to evaluate post-fire tree mortality models are from the USA, from fires occurring from 1981 to 2016.
Additional file 5. Species and models included in the Fires Order Fire Effects Model (FOFEM) software system that we were not able to evaluate in our study of post-fire tree mortality models from the USA, from fires occurring from 1981 to 2016. Models are defined in Tables 5  and 6; R-A = Ryan and Amman model, which can be applied to many species; Pre-fire models predict mortality after fire based on predictors available before fire; Post-fire models predict mortality after fire based on predictors available after fire. For the R-A models, FOFEM has built-in bark thickness coefficients for these species. We also included the number of observations with relevant predictor variables for any year post fire, and at 3 years post fire, and total number of observations in the FTM database. We excluded FTM data from M. Battaglia, S. Hood, and V. McDaniel that were used to create species-specific models (Battaglia et al. 2009;Hood and Lutes 2017;Keyser et al. 2018) from the totals provided here, because data used to create models cannot be used for external validation of those models.
developed analyses, and co-wrote the manuscript. JMV and PVM provided data, revised analyses, and co-wrote the manuscript. The author(s) read and approved the final manuscript.

Funding
We acknowledge funding from the Joint Fire Science Program under project JFSP 16-1-04-8, USFS Forest Health Protection, and Rocky Mountain Research Station, and the National Fire Plan. Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable.