Topological data analysis reveals parameters with prognostic skill for extreme wildfire size

A topological data analysis (TDA) of 200 000 U.S. wildfires larger than 5 acres indicates that events with the largest final burned areas are associated with systematically low fuel moistures, low precipitation, and high vapor pressure deficits in the 30 days prior to the fire start. These parameters are widely used in empirical fire forecasting tools, thus confirming that an unguided, machine learning (ML) analysis can reproduce known relationships. The simple, short time scale parameters identified can therefore provide quantifiable forecast skill for wildfires with extreme sizes. In contrast, longer aggregates of weather observations for the year prior to fire start, including specific humidity, normalized precipitation indices, average temperature, average precipitation, and vegetation indices are not strongly coupled to extreme fire size, thus afford limited or no enhanced forecast skill. The TDA demonstrates that fuel moistures and short-term weather parameters should optimize the training of ML algorithms for fire forecasting, whilst longer-term climate and ecological measures could be downweighted or omitted. The most useful short-term meteorological and fuels metrics are widely available with low latency for the conterminous U.S, and are not computationally intensive to calculate, suggesting that ML tools using these data streams may suffice to improve situational awareness for wildfire hazards in the U.S.


Introduction
Many previous studies demonstrate the importance of climate and weather parameters in predicting wildfire size, either directly (Westerling et al 2006, Holden et al 2007, Dennison et al 2014, Parks et al 2018 or as components of the U.S. National Fire Danger Rating System (NFDRS) (e.g. Fang et al 2015, Price et al 2015, Freeborn et al 2016, Preisler et al 2016, Williams et al 2019. However, complementary studies also indicate a role for physiographic and contextual parameters such as slope, aspect, and time since previous fire (e.g. Keyser and Westerling 2017) as well as measures of vegetation and fuel (Parks et al 2018). The utility of particular parameters or combinations thereof for forecasting wildfire spread is highly dependent on the methodological approach  (Holden et al 2018). Furthermore, parameters based on discrete temporal or spatial observations such as weather measurements can be averaged and normalized in a variety of ways and over different time scales, approaches that are likely to influence their statistical relationship to wildfire characteristics. Many of these weather and fuels parameters are highly correlated to each other and combined into composite metrics (e.g. Holden et al 2018), confounding efforts to either empirically derive prognostic relationships using statistical tools or determine physics-based parameterizations of wildfire growth.
Despite the inherent challenges, optimizing algorithms or statistical tools with an ability to forecast the probability that a wildfire may become large is a high-priority goal as the economic cost associated with catastrophic wildfire grows (Thomas et al 2017). While tools currently exist (i.e. NFDRS), improving forecasts of fire propagation (spread) risk can further reduce hazard by identifying conditions when special care should be taken to reduce ignition events, when suppression response should be at high readiness, or when protective measures such as evacuations or alerts should be emplaced. Machine learning (ML) approaches can accomplish this need, and have been used for analogous problems where many different parameters measured with variable precision and resolution, may contribute in complex and nonlinear ways to an outcome of interest. For example, a range of different ML techniques have produced recent advances in earthquake aftershock forecasting (Devries et al 2018), weather forecasting (Scher and Messori 2018), human genomics (Libbrecht and Noble 2015), and financial risk analysis (Leo et al 2019).
One ML approach particularly well-suited to identifying patterns in datasets with a large number of events with many incomplete, correlated, or noisy characteristics is topological data analysis (TDA). This approach represents comparisons of event parameters with a shape, or topology, in which distances and connections between sets of events quantify the similarity of their observational properties and the persistence of similarity within groups. Such methods can be highly effective at identifying groups of events that share many characteristics or have unique properties, even if other measured properties are not sorted in the same way or are distributed stochastically. These methods are also extremely efficient at differentiating which of many observable characteristics associate most and least strongly with a targeted parameter, such as fire size.
We therefore use TDA as a novel method to investigate which of several readily available weather and fuel parameters are most strongly associated with the final size of a wildfire, with special emphasis on the largest wildfires. Final size is an operationally useful test parameter because it efficiently captures the ecological and economic costs of the fire. In contrast, the simple presence of a wildfire regardless of size is less demonstrative of event impacts and more sensitive to highly stochastic ignition events, such as lightning strikes.

Methods
We used the most recent update of Short (2017), a quality-checked and georeferenced catalog of 1.88 million wildfires occurring in the U.S. between 1992 and 2015, to analyze continental scale fire dynamics. There are some limitations with the catalog discussed in detail in the supporting publication, but our analyses assume that it is representative of the distribution of recorded parameters over the catalog interval. Because we are predominantly interested in large fires, and therefore in fire spread and persistence rather than simply ignition, we first filtered the catalog for fires with area greater than 5 acres. We then randomly selected two independent (no common events) catalogs containing 100 000 events from the >5 acre set. These two catalogs were analyzed separately with identical approaches. The independent analyses of two separate data sets serve as a check on the robustness of the results, ensuring that we do not interpret topological characteristics that arise in only one catalog.
To analyze the sub-catalogs, we extracted the date of discovery, fire size (acres within the final perimeter of the fire), latitude, and longitude for each event from the full catalog. Then, using the discovery date and geographic location, we calculated 15 weather and fuel parameters using publicly available data from the gridMET data set (Abatzoglou 2013): mean annual precipitation (meanP), mean annual daily maximum temperature (meanT), total precipitation for the 30 days prior to the event date (pre-cip30), total precipitation for the 365 days prior to the event date (precip365), mean humidity for the 30 days prior to the event date (humid30), mean daily maximum temperature for the 30 days prior (maxT30), mean daily maximum temperature for the 365 days prior (maxT365), 100 hour fuel moisture for the 30 days prior (100h30), 1000 hour fuel moisture for the 30 days prior (1000h30), vapor pressure deficit for the 30 days prior (VPD30), vapor pressure deficit for the 365 days prior (VPD365), normalized total precipitation (observed/meanP) for the 30 days prior (normP30), normalized total precipitation for the 365 days prior (normP365), normalized maximum temperature (observed/meanT) for the 30 days prior (normT30), and normalized maximum temperature for the 365 days prior (normT365). These parameters were chosen in part because they can be expected to have some influence on fire spread or persistence based on a physical understanding of wildfire processes and in part because they are readily available in public data sets and do not require specialized sensors or operational capabilities. We also extracted two additional parameters, median net primary productivity (NPP; calculated using Robinson et al 2018) and landcover from the National Land Cover Dataset (NLCD; Yang et al 2018) for a supplementary analysis considering vegetation characteristics (table S1).
We then loaded the resulting matrices into a TDA engine in the Symphony AyasdiAI computing environment. As briefly summarized in the introduction, the TDA approach identifies relationships among N parameters by calculating an N-dimensional topology for large data sets using distance measures of similarity (Carlsson 2009, Lum et al 2013. The output of a single analysis is a network in which the most similar events are grouped together in single network nodes, slightly less similar events are linked to one another, and then similarity decreases with increasing distance between nodes. The number of nodes and linkages can be adjusted by adjusting gain and resolution of the algorithm, and the form of the topology depends on the choice of distance calculation (metric, in AyasdiAI) and the choice of statistics selected for comparison of events (lenses, in AyasdiAI).
A range of topologies using different lenses and metrics were generated for each of the two master subcatalogs. We compared these topologies to optimize resolution of correlates to fire size by searching for lenses and metrics that effectively separate large fires from smaller ones. The preferred metric for this study is normalized correlation on lenses of Linfinity centrality (which calculates for each point, x, the maximal distance to any other point in the data set) and Gaussian density (which applies a kernel Gaussian density estimator considering each row as Euclidian) (figures 1 and S1 (available online at stacks.iop.org/ERL/15/104039/mmedia)). The resulting topologies with segregated large fires were then used to identify sets of similar large fires for the statistical analyses presented in the results section and described below. The same fires are also located in other topological models using different lenses and metrics (figure S2) to check for persistence of clusters across metrics and lenses. Such persistence of grouping over many topologies is a critical indicator that groupings are significant rather than stochastic outcomes in a single topology; similarity of topology in the analyses of the two subcatalogs also confirms that the groupings of events are not stochastic.
Finally, we extract statistical measures of the persistent large fire groups, including p-values and Kolmogorov-Smirnov scores for comparisons of the groups with the rest of the catalog. These measures are used to identify which parameters map closely with fire size to distinguish sets of large fires that also share other characteristic values. Kernel density estimates of these parameters are generated and compared with the distribution of parameters for the whole test catalog. All of the topological analyses and statistical comparisons are repeated for both independent subsamples of 100 000 fires to confirm cluster persistence and parameter mapping with fire size.

Results
Two independent subcatalogs of 100 000 wildfires larger than 5 acres show persistent and statistically significant topological separation of the largest fires (figures 1 and S1; table 1). Because the TDA represents fire characteristics as a spatial form (topology), sets of fires that are similar to one another and different from the rest of the catalog can be identified as groups of linked nodes that are separated from the rest of the network (figure 1). We therefore identify three distinct fire groups, shown in labeled boxes in figure 1. Groups 1 and 3 are the most different from the distribution of parameter values of the rest of the catalog (tables 1 and 2); they also contain the largest fires by three and two orders of magnitude, respectively  2 and table 2). The topological segregation of these extreme fires demonstrates their statistical distinctness from most fires (the main grouping of nodes in the topology) and implies that these extreme events differ from the vast majority of events in many parameters. This separation alone implies that ML methods are able to distinguish extreme fires from typical fires, and that the anomalous parameter values associated with extreme events could be used to forecast their likelihood. Group 2 contains the largest fires that are not strongly distinct from the distribution of parameter values in the bulk of the catalog, and therefore serves as a useful comparison for the two more extreme groups. The groups containing the largest fires have systematically low values for fuel moistures in the prior 30 days (both hundred hour fuels [100h30] Figure 1. Preferred topology for subsampled catalog 1. Node color is by fire size, with distinct large fire groups used in the analyses boxed and labeled. In TDA, a topology is generated by calculating a distance function based on all of the event parameters. Events combined into a single node in the network are most similar; node similarity to other nodes is depicted by distance, with closer nodes and nodes linked by lines more similar than those separated by greater distances. . These parameters therefore have high potential to add skill to fire hazard forecasts or ML algorithms for extreme fire size prediction. They also have a reasonable physical relationship to fire propagation insofar as very low fuel moisture, little precipitation, and high vapor pressure deficits are all indicators of environmental dryness that facilitate wildfire activity. In contrast, the largest fires have no systematic relationship to specific humidity in the prior 30 days (humid30), normalized maximum daily temperature over the prior 365 days (normT365), normalized precipitation over the prior 365 days (normP365), or calendar year, as expected if there were particularly 'bad' fire years over the whole study area. These parameters therefore have little potential to contribute to fire hazard forecasts or algorithms for fire size prediction. In general, the very largest fires are strongly associated with anomalous values for parameters sampling moisture over the 30 day window prior to the fire start, such as fuel moisture, precipitation and VPD. They are less consistently or strongly associated with anomalous values for temperature or parameters integrated over the 365 day window prior to a fire start (table 1). Interestingly, the vegetation indices analyzed here do not appear to be coupled to extreme fire sizes (table S1). This indicates that the incidence of extreme fire size is more sensitive to the near-term, short time scale weather conditions immediately prior to the start of the fire than to a longer record of antecedent conditions, consistent with other studies (e.g. Freeborn et al 2016, Holden et al 2018. Our results suggest this is more strongly the case for the very largest fires. Extracting the groups of large fires identified with the preferred topological analysis, we can explicitly compare the frequentist statistics of these sets to the remainder of the catalog (table 2) and the distribution of critical parameters to their distribution in the whole catalog (figure 2). As indicated by the nonparametric statistics, large fire groups (groups 1 and 3) differ significantly (≫99% confidence) from the entirety of the event catalog with respect to fire size, fuel moistures, and 30 day temperature and precipitation parameters. Because the 100 hour and 1000 hour fuel moistures are highly correlated to one another and the 30 day VPD is correlated to the 30 day temperature and precipitation parameters, the simple ttest results may overestimate the statistical significance of the composite set of parameters, so the K-S scores better represent the potential contribution of each of these parameters to a forecasting model. Group 2 also differs from the entirety of the event catalog, but the average fire size is much closer to the overall distribution of sizes, and these events are more distinguished by the average annual precipitation and the prior year's total precipitation than the shorterterm weather conditions.

Discussion
Two complementary trends offer the possibility of major advances in wildfire hazard forecasting. First, interest and potential investment in operational forecasts has grown in response to several particularly catastrophic wildfires in the past decade, which are becoming more frequent in the context of climate change (Abatzoglou andWilliams 2016, Di Virgilio et al 2019). Second, very large and high-quality catalogs of past events have recently been published, especially Short (2017), that are well-suited for data mining and ML approaches to data sets with many events and many parameters. Although many prior studies have applied linear regressions or other statistical approaches to characterizing empirical relationships between either wildfire starts or wildfire propagation (e.g. Preisler and Westerling 2007, Keyser and Westerling 2017, Parks et al 2018, the large number of possible parameters, their interdependence, and the underlying nonlinearity of wildfire dynamics have the potential to confound traditional statistical modeling. An advantage of TDA is the utilization of complex mathematics such as persistent homology to determine fundamental patterns in datasets which illuminate simple and easy to understand results when complimented with traditional (in this case frequentist) approaches.
The TDA approach has been used in analogous cases to identify parameters most likely to improve forecast or prediction skill for some outcome of interest, such as genomics (Camara et al 2016, Cámara 2017, Rizvi et al 2017, econometric analysis, (Gidea and Katz 2018), brain and spinal cord injury (Nielson et al 2015), and engineering failures (Perelman and Ostfeld 2011). Our analyses of wildfires that exceed 5 acres identifies short-term weather parameters and fuel moisture parameters are the most likely to improve skill in forecasting the most extreme fires (those with very large total areas). Further, these parameters can provide prognostic skill for a threshold large final fire size prior to and at the time of ignition without requiring updating during the time the wildfires is burning. These results are consistent with recent studies indicating that fire season precipitation and VPD strongly influence burned area (Holden et al 2018) that fuels measures have the strongest correlations to fire severity, followed by fire weather, (Parks et al 2018), and that short term (monthly) measures have more significant prediction skill (Preisler and Westerling 2007) than longer term (previous year) weather metrics for fire-danger forecasting. This consistency suggests that ML approaches can either complement or explicitly quantify existing empirical model relations.
Other kinds of ML tools (such as convolutional and or recurrent neural networks) could therefore be most efficiently trained to forecast potential catastrophic fire risk using the parameters identified by the TDA as useful based on a fixed K-S score cutoff, or using parameters weighted by K-S scores reported in table 1. For example, because the 365 day aggregate parameters are more computationally intensive to calculate and are not distinct for large fires, they can be omitted from or downweighted within neural network training sets or forecasting algorithms without degrading forecast skill. 30 day aggregated weather and fuel parameters provide much better return on computational and observational investments, and their prognostic skill suggests that some information about potential fire size is available tens of days in advance of ignition up to the time of ignition. Because the fuel moisture parameters are highly correlated to each other and have comparable utility, whichever is most readily available or easily measurable could be integrated into a forecasting algorithm. Although we did not implement or assess the performance of an operational forecasting tool based on our results, this study represents an initial step towards this contemporary method of prediction. Future efforts have the opportunity to evaluate the performance of different ML approaches to develop a new class of fire forecasting models which may enhance situational awareness of extreme wildfire risk and complement existing forecasting tools.
The analyses reported here used parameters that are readily available for the entire continental United States at sufficiently low latency and high resolution (Abatzoglou 2013) to be integrated into a ML based generalized forecasting tool. Such a tool that provided a regional aggregated time-dependent risk of larger fires has considerable situational utility, even if the specific location and timing of single events cannot be precisely predicted. Additional analyses on shorter window aggregates of weather data or near real-time weather and fuel parameters should be undertaken to test whether such specialized data streams provide sufficient improvements in operational forecasting of potential catastrophic fire risk to warrant development of the necessary sensor and communications infrastructure.

Conclusion
Our analysis investigated complex topological relationships between environmental parameters and extreme wildfire size using a novel ML approach, TDA. Our analysis identifies persistent topologies that separate very large wildfire events, and their associated environmental conditions, from the rest of a 200 000 event catalog of wildfires over 5 acres across the continental United States. Meteorological processes and fuel moisture in the 30 days leading to ignition are the most deterministic of extreme fire size, while conditions in the year prior to ignition and climatological context are much less relevant. This indicates that rapidly evolving moisture conditions throughout the fire season regulate the template for extreme fire spread, determining the vulnerability of local communities to significant ecological and economic cost. Rapid fuel dessciation is becoming increasingly likely in the context of enhanced atmospheric evaporative demand and climate change, suggesting short timescale, high frequency monitoring and modeling as priority for accurate fire risk prediction and prompt, on-the-ground action. ML models trained on datasets highlighted by this TDA are likely to improve situational awareness for wildfire hazards across the U.S.