Deep graphical regression for jointly moderate and extreme Australian wildfires

Recent wildfires in Australia have led to considerable economic loss and property destruction, and there is increasing concern that climate change may exacerbate their intensity, duration, and frequency. Hazard quantification for extreme wildfires is an important component of wildfire management, as it facilitates efficient resource distribution, adverse effect mitigation, and recovery efforts. However, although extreme wildfires are typically the most impactful, both small and moderate fires can still be devastating to local communities and ecosystems. Therefore, it is imperative to develop robust statistical methods to reliably model the full distribution of wildfire spread. We do so for a novel dataset of Australian wildfires from 1999 to 2019, and analyse monthly spread over areas approximately corresponding to Statistical Areas Level~1 and~2 (SA1/SA2) regions. Given the complex nature of wildfire ignition and spread, we exploit recent advances in statistical deep learning and extreme value theory to construct a parametric regression model using graph convolutional neural networks and the extended generalized Pareto distribution, which allows us to model wildfire spread observed on an irregular spatial domain. We highlight the efficacy of our newly proposed model and perform a wildfire hazard assessment for Australia and population-dense communities, namely Tasmania, Sydney, Melbourne, and Perth.


Introduction
Due to their unprecedented scale and severity, the infamous Black Summer bushfires, which occurred from late 2019 to early 2020, were a significant environmental disaster.Their effects are believed to have been heavily exacerbated by climate change (Ward et al., 2020;Haque et al., 2021).
Destruction of property on a massive scale, with an estimated 2000 homes destroyed and billions of dollars in damages, motivates a real need for informed urban planning and updates to both building codes and disaster management strategies, with a view towards minimizing losses and maximizing resilience in wildfire-prone areas (Gibbons et al., 2012;Price and Bradstock, 2012).Such extreme wildfires have negative impacts on human health via the facilitation of hazardous air quality levels, which can lead to respiratory issues in vulnerable populations (Finlay et al., 2012).Aside from their human impact, the Black Summer fires had devastating impacts on local wildlife and biodiversity, with an estimated excess of 3 billion animals thought to have been affected through habitat loss (Levin et al., 2021).In order to mitigate the impacts of future, similarly devastating events, there is a need to develop statistical models that can perform robust wildfire hazard assessments in Australia.To that end, we build a hybrid statistical deep-learning model for the occurrence and spread of Australian wildfires.We utilize this model to build continental hazard maps that facilitate hazard evaluation of population-dense communities.
Studies of wildfires often make use of either point-pattern (Genton et al., 2006;Hering et al., 2009;Juan et al., 2012) or areal (Serra et al., 2012;Bivand et al., 2015) datasets.Whilst the former extract information from the (random) location of wildfire occurrences, it can be difficult to build parsimonious models for them without relying on complex point process models (see, e.g., Koh et al., 2023).By contrast, areal datasets, that describe wildfire characteristics aggregated over space-time polygons, may be better suited to facilitate hazards assessment as they describe the characteristics of wildfires in a fixed frame of reference.Whilst many areal datasets of Australian burnt areas exist (Matthews et al., 2012;Giglio et al., 2018;Bouguettaya et al., 2022), they are often criticised as the burnt area is aggregated over regular spatial gridboxes.If one were to perform a hazard analysis with these data, it would not truly reflect the hazard to local populations and communities, as these are not arranged in a regular fashion.In this paper, we construct a novel dataset of aggregated burnt areas over irregular spatial regions fabricated from the Level 1 and 2 Australian Statistical Geography Standard (ASGS) boundaries (Australian Bureau of Statistics, 2011).The smaller Level 2 boundaries are utilised in the population-dense "greater" capital city areas, that is, Sydney, Melbourne, Brisbane, Perth, Darwin, Hobart, and Adelaide, with the coarser Level 1 boundaries adopted elsewhere.
GPDs are asymptotically-justified by extreme-value theory and are thus often used to estimate quantiles far into the upper tails of a distribution (Coles, 2001).They are typically exploited in a peaks-over-threshold framework, and fitted only to excesses above a pre-specified threshold, with empirical estimates of the distribution used to model the distribution below the threshold; see Coles (2001) for details.As such, they do not provide a characterisation of the behaviour of non-extreme (i.e., low or moderate) events.Moreover, their inference may be sensitive to the choice of threshold, which can be problematic in practice (Naveau et al., 2016).Modeling non-extreme, as well as extreme, wildfires is still crucial for hazard analysis, as they can have significant impacts on local ecosystems, homes, and communities, and can quickly develop into extreme and devastating wildfires.This is particularly relevant for Australia, where wildfires are a common occurrence (Valente and Laurini, 2021).Whilst the GPD can be fitted to the entire range of data, a more appropriate model may be one that has asymptotically-justified GPD upper-tails but a different parametric characterisation for the bulk of the distribution.Mixture distributions that can capture both the upper-tails and bulk of data have been constructed by stitching together GPDs with parametric distributions, see, for example, Frigessi et al. (2003) and Carreau and Bengio (2009), but these models often place restrictive constraints on the parameter space that make their inference difficult.More recently, there have been extensive research efforts to develop models with more flexible lower-tail, bulk, and upper-tail behaviours (e.g., Yadav et al., 2021;?;Stein, 2021).In particular, Papastathopoulos and Tawn (2013) and Naveau et al. (2016) propose the extended generalized Pareto distribution (eGPD) as a flexible alternative to mixture distributions, which addresses both their limitations and those of standard GPD models.The eGPD has been successfully applied in the context of precipitation (Naveau et al., 2016;Tencaliec et al., 2020;Carvalho et al., 2022;Haruna et al., 2023) and landslide (Yadav et al., 2023) modelling, but its potential for modelling wildfires has not yet been fully exploited.We adopt the eGPD as our distributional model for wildfire spread.
Machine learning methods have shown promising success in the modelling of wildfire spread and susceptibility, see, for example, Radke et al. (2019), Zhang et al. (2019), Bergado et al. (2021), andCisneros et al. (2023a,b), as well as the propagation of wildfire fronts (Dabrowski et al., 2023;Yoo and Wikle, 2023).Hence, we leverage their predictive power to model wildfire spread through a parametric regression framework.In particular, we adopt a deep regression framework, as neural networks are capable of capturing the highly complex structure that we expect to be exhibited by the processes that drive Australian wildfire ignition and spread (Tolhurst et al., 2008).Deep extreme value regression models have been successfully implemented by, for example, Carreau and Bengio (2009), Carreau andVrac (2011), Rietsch et al. (2013), and Pasche and Engelke (2022), but they adopt neural network architectures that do not exploit spatial structure in data.Richards and Huser (2022) and Richards et al. (2023) advocate the use of convolution neural networks (CNNs), see review by Gallego and Ríos Insua (2022), to capture complex spatial structure in the predictor variables when they model U.S. and European wildfires, respectively.Both studies illustrate that better fitting models for extreme wildfire spread, relative to those that use standard denselyconnected neural networks, can be produced by exploiting spatial structure in the covariates.A crucial drawback of CNNs is that they are only applicable to regularly gridded data, which is a property not exhibited by the data that we use in our study.We instead develop a graphical regression model using graph convolutional neural networks (GCNNs), see Kipf and Welling (2017), which extend the notion of convolutions in CNNs from regularly gridded to irregularly gridded or graphical input data.
The rest of the paper is organized as follows.In Section 2, we introduce the novel dataset of burnt areas resulting from Australian wildfires that we use in our wildfire hazard assessment study and we discuss its construction.In Section 3, we detail our novel hybrid statistical deep regression model with details of the eGPD and GCNN components provided in Sections 3.2 and 3.3, respectively.In Section 4, we use our novel framework to jointly model low, moderate, and extreme Australian burnt areas, and provide a wildfire hazard assessment of four population-dense communities: Sydney, Tasmania, Perth, and Melbourne.We identify spatial trends in the frequency and severity of wildfire burnt areas across Australia and these communities, and highlight spatial variation in temporal trends of the wildfire distribution across the domain.We also identify important predictors that drive the frequency and severity of Australian wildfires.In Section 5, we conclude the paper with some discussion on future research directions.

Australian burnt area data
We begin by describing the construction of our novel burnt area data.We first collect burnt area boundaries from the data portal of each Australian state government and combine them into a single database which consists of the boundaries of area burnt by individual wildfires, from June 1999 to December 2018, inclusive (see Ryu et al., 2023).Boundaries of events in the Northern Territory are of relatively poor quality (i.e., low resolution) or sparsely available, and so we choose to exclude the events in this region from our dataset.We assign, to each boundary, the month of the ignition of its corresponding wildfire.For cases where the ignition date is missing, we assign, in its place, the date of wildfire extinguishment, and assume that this is the same as the missing ignition month.In the very rare case where both ignition and extinguishment dates are missing (1,100 cases out of 241,801), the wildfire events are removed.
For our newly constructed database of wildfire boundaries, we aggregate the burnt areas into monthly values and onto a temporally-consistent set of spatial reference units that are defined by Level 1 Australian Statistical Geography Standard (ASGS) boundaries.These boundaries are used by the Australian Bureau of Statistics to form Statistical Areas Level 1 (SA1) regions and perform population census (Australian Bureau of Statistics, 2011).In most risk reduction activities, local governments have limited jurisdiction outside of their governmental units.By using Level 1 units of ASGS boundaries, we ensure that the considered regions fall well within the smallest governmental units, thus allowing local agencies to use the results of our analysis to inform actionable mitigation policies.In the "greater" city areas (i.e., Sydney, Melbourne, Brisbane, Perth, Darwin, Hobart, and Adelaide), where governmental jurisdiction is city-wide, we instead construct polygons using the higher-resolution Level 2 ASGS boundaries.In total, we obtain 7,901 artificially constructed and temporally-consistent polygons that encompass mainland Australia and Tasmania; 311 of these polygons form the Northern Territory.With these reference units, we perform a spatial overlay of the wildfire burnt area polygons and calculate the total burnt area (BA; km 2 ), due to wildfires, within the reference unit in each month.
To model wildfire occurrence and spread, we incorporate meaningful covariates that play a key role in wildfire dynamics, for example, land-cover types and meteorological conditions (Fusco et al., 2019;Nadeem et al., 2020).For the former, we use the monthly average Normalized Difference Vegetation Index (NDVI), which gives a measure of the types of biomass available for ignition in a spatial region (see Figure 1).NDVI is a satellite-derived index that reflects the "greenness" or density of vegetation.Higher NDVI values indicate denser and healthier vegetation, while lower NDVI values suggest sparse or senescing (deteriorating) vegetation.For the meteorological conditions, we include five variables: 2m air temperature (K), precipitation (m), evaporation (m), and both northerly and easterly wind speed components (m/s).For each of these five variables, we use both its monthly average and monthly maximum.Descriptors of the domain topography (i.e., a polygons' average slope ( • ) and aspect ( • )) are also available.The monthly covariates we consider are aggregated onto the same spatial reference units as the response variable.Spatial overlay is performed using covariate images obtained from different sources.The static topographical properties (slope and aspect) are derived from the Shuttle Radar Topography Mission digital elevation model (Farr and Kobrick, 2000).For the dynamic predictors, monthly averages and maxima of the meteorological variables are obtained from the same hourly values used to derive the ERA-5 monthly land averages (Muñoz Sabater, 2019) and the NDVI is taken from the Tier-1 orthorectified Landsat-7 scenes converted to the top of atmosphere reflectance (Chander et al., 2009).Note that, whilst the response data are missing for the Northern Territory, all covariates are available for polygons located within these regions (and can thus be used to make predictions with our model).Monthly aggregated BA and the covariates, are publically available at https://github.com/Jbrich95/pinnEV,and the individual wildfire boundaries are available upon request.
We focus on modeling the monthly burnt area, which is a measure of the hazard exhibited by a region experiencing wildfires.Following Richards and Huser (2022), we model the square- and were a part of a larger series of fires that burned across multiple states, including New South Wales, Victoria, and South Australia, and which were exacerbated by prolonged drought, hot temperatures, and strong winds (Chafer et al., 2004).In the span of the following years (2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009), Victoria experienced three mega-fires that burnt 40% of the state's public land (Attiwill and  and 95% quantiles of non-zero values of monthly radial wildfire spread across the entire observation period.While the 5% quantile tends to have higher magnitudes nearer the mid-Eastern region of Australia, the 95% quantile shows larger magnitudes near the southern coastal areas, indicating different spatial patterns in the bulk and tail of the response distribution.The bottom panel of Figure 2 provides a time series of the monthly 5% and 95% quantiles of the response pooled over the entire spatial domain.We observe a slight positive trend in the 95% quantile of the response, but a negative trend in the lower tail.An exploratory analysis of the data reveals that the total number of wildfires in Australia has increased from 3,685 to 7,299 between 1999 and 2018, with an average of 538 fires per year during the study period.The southern regions of Australia appear to exhibit a higher frequency of wildfires compared to the northern regions, with Victoria and New South Wales having the highest number of fires.The wildfire season also seems to be extending, with a significant increase in the number of fires in January and February.Additionally, the analysis shows that the BA of the wildfires within each polygon varies, with the majority of the fires being small (less than 50km 2 ), and only a small proportion (10%) of the fires being extremely large (greater than 1,200km 2 ).Motivated by these finding, we include spatial coordinates and time as covariates in our model, allowing us to approximate spatial random effects and account for temporal seasonality and trends when combined, non-linearly, using neural networks.

Methodology
We present our two-stage deep-learning framework for separate modeling of the occurrence probability of wildfires and the subsequent radial spread of wildfires, conditional on an ignition.

Overview
Define a spatio-temporal process {Y (s, t) : s ∈ S, t ∈ T } indexed by a discrete time domain T ⊂ N and a discrete spatial partitioning S = {s 1 , . . ., s V }, which comprises V irregular polygons that For ease of exposition, we hereafter treat each polygon s ∈ S as a point location and note that V = 7, 901 in our application.We then let X(s, t) = (X 1 (s, t), . . ., X d (s, t)) ′ denote a vector of d predictors available at all space-time locations (s, t) ∈ S × T .We further denote by y(s, t) and x(s, t) the observations of Y (s, t) and X(s, t), respectively.
For our data, the response Y (s, t) is taken to be the square-root of the monthly BA observed over region s in the t-th month.The predictor set X(s, t) is composed of the covariates described in Section 2 and, as well as the latitude and longitude coordinates of the centroid of s, and t-th month and year.We seek to estimate the marginal distribution function of radial spread Y (s, t) conditional on predictors X(s, t) = x(s, t), which we denote by We assume the form where is the probability of a wildfire occurrence in region s at time t and where is the distribution function of non-zero values of the radial spread Y (s, t), conditional on a wildfire occurrence.We model p 0 (s, t) using a deep logistic regression framework.We represent log[p 0 (s, t)/{1 − p 0 (s, t)}] as a function of x(s, t), which is estimated using the graph convolutional neural network (GCNN) that we describe in Section 3.3.For F (s,t),+ (•), we assume the parametric form determined by the extended generalised Pareto distribution (eGPD), which we describe in Section 3.2.

Extended generalized Pareto distribution
Classical asymptotic peaks-over-threshold models utilise the generalised Pareto distribution (GPD) to characterise the upper tail behavior of a random variable (see, e.g., Pickands, 1975;Davison and Smith, 1990); see Davison and Huser (2015) for a review on extreme value theory.For a random variable Y , we assume some high threshold u exists such that (Y − u) | (Y > u) follows the GPD, denoted GPD(σ u , ξ), for scale and shape parameters σ u > 0 and ξ ∈ R, respectively.Its distribution function is for y > 0 and where a + = max(a, 0).The shape parameter ξ determines the rate of tail decay, with slow power-law decay for ξ > 0, exponential decay for ξ = 0, and polynomial decay towards a finite upper bound when ξ ≤ 0. Although a number of studies have shown that wildfire burnt areas are heavy-tailed, and an appropriate model for them requires ξ > 0 (Pereira and Turkman, 2019;Richards and Huser, 2022), we note that our monthly burnt area data y(s, t) exhibits the natural physical constraint that it cannot be larger than the area of the corresponding SA1/SA2 region.This would suggest that a model with ξ < 0 might be more appropriate for our data far in the upper-tail.Following Richards et al. (2023), we instead constrain ξ > 0 throughout as this circumvents numerical issues related to inference of neural network-based regression models with parameter-dependent support (see Richards and Huser, 2022), and this also leads to very good fits with our data all the way from low to high (but not exceedingly high) quantiles (see Section 4.1).
Moreover, in Section 4, we estimate ξ to be significantly non-zero, suggesting that ξ > 0 is a reasonable consistent for our data.
The GPD is a popular choice for modelling the upper-tails of random variables, and has been  2023).However, it suffers from three major drawbacks: i) it requires specification of the high exceedance threshold u, ii) it does not provide any model for the bulk, Y | Y < u, and iii) the scale parameter σ u depends u, making its interpretation difficult (Richards et al., 2023).Efficient and optimal selection of u is an ongoing problem of discussion in the literature, see, for example, Dupuis (1999) and Deidda (2010); with u too low, the asymptotic arguments that justify the use of the GPD do not hold, and, with u too large, there is not enough data left available for reliable inference.Typically the bulk of the distribution of Y is modelled empirically (Davison and Smith, 1990).However, in such a case it is difficult to model the impacts of predictors on the distribution of Y | Y < u, which here corresponds to spread of non-extreme wildfires.Parametric mixture models can be adopted instead (Behrens et al., 2004;de Melo Mendes and Lopes, 2004;Carreau and Bengio, 2009;Carreau and Vrac, 2011) , but they impose strong restrictions on the parameter space that would make inference using neural networks computationally problematic.This motivates our use of the extended generalised Pareto distribution (eGPD) as a model for (2), as this can characterise the behaviour of the entirety of Y , has asymptotically-justified GPD upper-tails (and lower-tails) and has simple box constraints on the parameter space.
Papastathopoulos and Tawn ( 2013), with extensions by Naveau et al. (2016), propose several families of extended GPDs.We adopt the first family (as named by Naveau et al., 2016) and ξ > 0, and for H(•) in (3).This model has been previously applied in a regression setting by, for example, Carvalho et al. (2022) and Carrer and Gaetan (2022).Here, we take only σ(s, t) to be a function of predictors x(s, t), and leave ξ and κ fixed over space and time.Following Richards et al. ( 2023), we use an offset term in the scale function.Defining a(s) > 0 as the area of polygon s, we let σ(s, t) ∝ a(s).We note that different formulations, with heterogeneous κ and ξ, were also considered in our analysis, but these did not lead to a significant improvement in model fits.
The parameter σ can be interpreted as a measure of the severity of monthly radial wildfire spread, conditional on at least one wildfire occurrence.We refer to σ as the "conditional spread severity".
Whilst we do not use an offset term for p 0 (s, t), we do include a(s) as an extra covariate in x(s, t) to model wildfire occurences.

Deep regression
We model the occurrence probability p 0 (s, t) and conditional spread severity σ(s, t) as functions of the predictor set {x(s, t)}.Specifically, we let logit{p 0 (s, t)} = m p [{x(s, t)}] and log{σ(s, t)} = log{ a(s)} + m σ {x(s, t)} for neural networks m p and m σ .In the context of modelling GPD parameters, Rietsch et al. (2013), Carreau and Bengio (2009), Carreau and Vrac (2011) and Richards et al. (2023), illustrate excellent fitting models when using standard densely-connected neural networks (DNNs).Richards and Huser (2022) advocate the use of convolutional neural networks (CNNs) in place of DNNs, as these can capture spatial structure in the predictors (see review by Gu et al., 2018).One limitation of CNNs is that they are only applicable to data observed over a regular spatial grid, which is not the case with our data.An alternative family of neural networks are graph neural networks (Zhou et al., 2020), which are applicable to graphical data, and have already been applied in the context of logistic regression (see, e.g., Tonks et al., 2022).A sub-class of graph neural networks are graph convolutional neural networks (GCNNs), which extend the concept of grid-based convolutions to graphs (Duvenaud et al., 2015;Kipf and Welling, 2017).We represent our spatial data as a graph and investigate the efficacy of GCNNs, relative to DNNs, and now describe both architectures.This architecture has been developed for a general function m ⋆ , where the subscript ⋆ notation can be replaced by p or σ as above.
A general feed-forward neural network with J hidden layers, each of width n j , can be written as an iterative composition of functions.For the i-th node of the j-th layer, with i = 1, . . ., n j , define a function m j,i (s * , t), s * ∈ S, and the vector of outputs from layer j by m j (s * , t) = (m j,1 (s * , t), . . ., m j,n j (s * , t)) ′ , with the convention that n 0 = d is the number of predictors and m 0,i (s * , t) := x i (s * , t) for i = 1, . . ., d; note that the n 0 -th layer output is simply the input data x(s * , t).The output of the final layer, at region s * ∈ S and time t, is for estimable weights and biases w (J+1) ∈ R n J and b (J+1) ∈ R, respectively.The output at the i-th node of the j-th layer, for j = 1, . . ., J and i = 1, . . ., n j , is for an estimable bias b (j,i) ∈ R, fixed activation function a j (•), and parametric function g j,i (•, •); note that the latter is a function of both the region s * and {m j−1 (s, t) : s ∈ S}, that is, the set of the previous layers features evaluated across the entire spatial domain.Following advocacy by Glorot et al. (2011), we take the rectified linear unit (Relu) activation function, that is, a j (x) = max{0, x}, for all j = 1, . . ., J. If our neural network is a DNN, then we set g j,i (•, •), for all i = 1, . . ., n j , j = 1, . . ., J, to be for weights w (j,i) ∈ R n j−1 ; note that this is a function of s * only and no other s ∈ S.
Graph neural networks learn representations of features on a graph G = (V, E), where V and E are a set of V vertices (numbered 1 to V ) and corresponding edges, respectively.We represent our spatial domain S as a graph whereby the centroid of each polygon s ∈ S corresponds to a vertex and there exists a (possibly zero) weighted edge between all pairs of regions, The graphical structure can be encoded within a symmetric adjacency matrix , where each element A[ij] ≥ 0 denotes the strength of connection between region s i and s j .We define a graph without self-loops, such that the diagonal elements of A are exactly zero.A typical graph convolutional neural network (Kipf and Welling, 2017) can be constructed by setting 4), for all i = 1, . . ., n j , j = 1, . . ., J, to for estimable weights w (j,i) ∈ R n j−1 and where Ã = D −1/2 AD −1/2 is the normalised adjacency matrix with D = diag(δ 1 , • • • , δ v ) a diagonal matrix with diagonal elements δ i set to the degree of the i-th vertex (i.e., the sum of elements in the i-th row of A).Unlike layers in a DNN, the GCNN layer in (6) pools information across regions in S that are neighbours of s * , with their contribution to the output determined by the weights in A. However, as A (and equivalently Ã) has zeroes along its diagonal, representation (6) does not exploit information at region s * directly when updating the features at the same region.We overcome this issue by adding a so-called "skip connection", which has been shown to improve training convergence of GCNNs (Xu et al., 2021).This is achieved by combining ( 5) and ( 6) and adopting, for all i = 1, . . ., n j , j = 1, . . ., J, g j,i (s * , {m j−1 (s, t) : for weights w ∈ R n j−1 ; we adopt (7) throughout.Note that equivalence between our GCNN and the DNN is achieved by setting either A to be a matrix of zeroes, that is, having a fully-disconnected graph, or fixing w (j,i) 2 ∈ R n j−1 to have all-zero entries.
In typical applications of graph neural networks, the adjacency matrix A is composed only of binary weights, where A[ij] = 1 if and only if the i-th and j-th vertices are connected (Wu et al., 2020).In our setting, it may be more appropriate to assign non-binary weights to A that reflect the strength of dependence between covariates observed at regions s ∈ S. In classical geostatistics, this is assumed to be a monotonically decreasing function of pairwise distance h ij = ∥s i − s j ∥ for s i , s j ∈ S, where ∥ • ∥ is some distance metric (Diggle et al., 1998).In particular, we use the great-circle distance (measured in km) for ∥ • ∥.A parameterised distance metric accommodating directionality was also considered, but this did not lead to significant improvements in model fits.
For the weighted adjacency matrix A, we adopt the model for λ > 0, α > 0, and ∆ ≥ 0. The cut-off distance ∆ induces sparsity in A; note that if ∆ = 0, then A has all-zero entries and equivalence between our GCNN and DNN is achieved.Note that A[ij] = 0 does not explicitly encode into the model that, for sites s i and s j such that ∥s i −s j ∥ > ∆, covariates X(s i , t) do not impact the distribution of response Y (s j , t).Information can pass between nonadjacent nodes in a graph neural network due to a phenomenon called "message passing" (Zhou et al., 2020), which occurs when composing multiple layers of the form given by ( 7).Due to computational constraints, the adjacency matrix A in ( 8) is not optimised during training of the neural networks characterising p 0 (s, t) and σ(s, t), and instead its hyperparameters λ, α, and ∆ are manually optimised using cross-validation with a coarse grid-search.To reduce computational time for hyperparameter optimisation, we constrain α to take only two values (i.e., 1 or 2); the case where α = 2 is equivalent to the thresholded-Gaussian kernel proposed by Shuman et al. (2013).
The use of similarly weighted adjacency matrices in graph neural networks have been exploited previously by, for example, Li et al. (2018), Yu et al. (2018), and Yao et al. (2018) in the context of traffic prediction, where their data are point-referenced.We adopt a similar weighting mechanism for our areal data, and illustrate that it facilitates good model fits in Section 4.

Inference
Inference is conducted by fitting the logistic and the eGPD model ( 2) separately, and combining these to produce estimates of the full distribution in (1).The neural network representations of occurence probability p 0 (s, t) and eGPD scale σ(s, t) are trained using the the R interface to Keras (Allaire and Chollet, 2021) and the Spektral Python package (Grattarola and Alippi, 2021); this procedure is implemented in the R package pinnEV (Richards, 2022).We use the Adaptive Moment Estimation (Adam) algorithm (Kingma and Ba, 2014), with default parameters and maximal minibatch size, to minimise the negative log-likelihood associated with p 0 (s, t) and the eGPD model.
For the eGPD model, we note that the parameters κ, σ(s, t), and ξ, are jointly estimated.
For parameter uncertainty quantification, we use a stationary bootstrap (Politis and Romano, 1994)

Overview
In order to improve numerical stability, prior to training, we standardise each predictor by subtracting and dividing by its marginal mean and standard deviation, respectively.To select the best fitting model, and optimise hyperparameters (i.e., A and Adam learning rate) and neural network architecture, we minimise goodness-of-fit metrics evaluated on the validation data.To assess the classification performance of the occurrence probability model (i.e., the logistic model for p 0 (s, t) in ( 1)), we evaluate the area under the receiver operating characteristic curve (AUC).To compare fits of the eGPD model for the conditional radial wildfire spread Y (s, t) | Y (s, t) > 0, we utilize the threshold-weighted continuous ranked probability score (twCRPS) (Gneiting and Ranjan, 2011), which is a proper scoring rule.The twCRPS can be estimated empirically using a sequence of monotonically increasing thresholds and an appropriate weight function.In our analysis, we use 22 irregularly spaced thresholds ranging from u 1 = 0.01 to u 22 = 200, where u 22 > max (s,t)∈S×T {y(s, t)} and set the weight function as r(x) = r(x)/r(u 19 ), r(x) = 1 − (1 + (x + 1) 2 /10) −1/4 , which puts strong focus on calibration in the upper-tail.The estimator of the twCRPS is then where F(s,t),+ (•) denotes estimates of distribution function (2) and 1(•) is the indicator function.
To assess models in the bulk, we also evaluate the regular CRPS (i.e., the twCRPS with constant weight function r(x)).
We present model results as pointwise quantiles across 250 bootstrap samples.All neural networks are trained for 3,500 epochs, with the loss evaluated at each iteration using all available observations, that is, with maximal mini-batch size.The optimal architecture for the occurrence probability model p 0 (s, t) is a GCNN with J = 3 hidden layers and 18 nodes per layer (1483 parameters).For the conditional spread severity σ(s, t) from the eGPD model, the optimal architecture is a simpler GCNN with J = 3, but with 8 nodes per layer (531 parameters), which is unsurprising as less data is available for inference with the eGPD model.For the adjacency matrix A in (8), we found that the optimal range λ, cut-off distance ∆, and shape α are 650km, 700km, and 2, respectively.The optimal hyperparameters are the same for both the occurrence probability and eGPD model.The median (and 2.5% and 97.5% quantiles) of the bootstrap eGPD shape parameter estimates are κ = 0.831 (0.812, 0.851) and ξ = 0.161 (0.145, 0.175), suggesting that Australian burnt areas are heavy-tailed.A value of κ significantly different to one suggests that our model is not equivalent to a regular GPD model.Hence, we have constructed a more flexible model for the bulk and tails of wildfire burnt areas that omits the need for intermediate threshold estimation.
To investigate the increase in model performance when exploiting the GCNN over standard neural networks, we fit the logistic and eGPD models with a DNN in place of the GCNN (see Section 3.3).For fair comparison, we keep the complexity of both models comparable by ensuring that the DNN-based models use the same width and number of layers as the GCNN-based models (as above) and are fitted to the exact same bootstrap samples; we report the median (and 2.5% and 97.5% quantiles) of each diagnostic measure evaluated across all bootstrap samples.For the occurrence probability (logistic) model, the GCNN-based and DNN-based models provide excellent AUC values of 0.919 (0.918, 0.919) and 0.907 (0.906, 0.908), respectively.For the eGPD model, the 643 (16,521,16,770) and 16,658 (16,532,16,809),respectively,and twCRPS values of 116,488 (114,763,118,233) and 116,558 (114,464,118,660), respectively.Whilst the difference in these model diagnostics is fairly moderate overall, we still observe that the use of the GCNN generally yields better performance, especially for the eGPD model.Unreported results also show that the validation loss for the GCNN-based models is consistently lower.
To visualise the goodness-of-fit for the conditional spread eGPD model, we present a pooled quantile-quantile (Q-Q) plot (Heffernan and Tawn, 2001) in Figure 3. Observations of non-zero radial spread Y (s, t) | {Y (s, t) > 0, X(s, t)} are transformed onto standard exponential or Gaussian margins using the estimated eGPD model.We then compare the empirical quantiles of the standardised data against their theoretical counterparts.This process is repeated for each bootstrap sample to build 95% tolerance intervals.Transforming the data onto standard margins allows us to better assess the fit, with the exponential and Gaussian scales used to better illustrate fits in the upper-tail and bulk, respectively.We observe good model fits in both cases as the estimated 95% tolerance bands include the diagonal.
Figure 3 illustrates good fits of our wildfire spread model globally, that is, the quality of the fit over the entirety of Australia.Figure 4 presents a similar diagnostic (for the eGPD model), albeit using only data sampled within four population-dense communities, namely Tasmania, and areas encompassing Perth, Melbourne, and Sydney.We observe satisfactory local fits of the eGPD model within these population-dense communities, as there is a generally strong alignment between the observed and expected quantiles, though some slight model overestimation of the quantiles is visible for Tasmania and Perth.This suggests that our newly proposed model adequately captures the distribution of the data at local levels quite well, providing further evidence of its efficacy.in August 2011, scorching 2.5 million hectares and obliterating 1,800 homes.These fires resulted from a mix of dry weather, strong winds, and lightning strikes.They coincided with a drought, exacerbating the wildfire-prone conditions.
Figure 5 displays the median estimates (across all bootstrap samples) of the wildfire occurrence probability (i.e., p 0 (s, t)), the relative log-conditional spread severity log{σ(s, t)/ a(s)}, and a relative compound hazard measure, denoted by CH, which we take to be the 99% quantile of the square-root of the burnt area divided by polygon area BA/a(s); note that the latter two measures are standardised by the area of the polygon (i.e., a(s)), to remove the influence of the area on their value and provide a clearer indication of the areas most hazardous in terms of extreme wildfires.
The log-conditional spread severity can be interpreted as a relative measure of wildfire spread severity, given a wildfire occurrence, whilst the compound hazard measure combines information from both the spread and occurrence models.
For all selected months, we observe generally low occurrence probability estimates across most regions of Australia.In the January months, the locations with higher probability of wildfire occurrence (i.e., p 0 (s, t)) are generally concentrated in the western and southern regions of Australia, whereas in August 2011 we instead observe the highest estimates in the north east.Across all selected months, we observe generally lower estimates of the conditional spread severity in the eastern regions of Australia and higher values in the northern areas.Similarly to p 0 (s, t), different spatial patterns are observed in January and August; the areas of lowest severity tend to be located in the South for August.We note that regions of high probability of wildfire occurrence do not necessarily correspond to regions of high conditional spread severity, suggesting different spatial trends in the behaviour of these two model components.Maps of the estimated compound hazard do not exhibit the same spatial patterns as those for p 0 (s, t) and σ(s, t).In January, there is little spatial coherence in the regions most-at-risk of wildfires; in August, the regions that experience the highest compound hazard are generally located along the north coast.
Figures 6-8 display estimates of the model summaries, detailed above, for the population-dense communities of Tasmania and the neighborhoods surrounding Perth, Melbourne, and Sydney, for the same three selected months detailed above.During the summer months, a noticeable trend emerges whereby the probability of wildfire occurrence (Figure 6) is higher in the regions surrounding the major cities.In contrast, the conditional spread severity (Figure 7) is notably higher within densely populated regions, particularly in Perth, which stands out as the sole area exhibiting any significant hazard during the winter season.We observe generally higher estimates of both the wildfire occurrence probability and conditional spread severity in the hotter January months across all considered communities.Amongst the regions, and across all months, Tasmania exhibits the lowest values of the estimated relative compound hazard metric (see Figure 8).However, we note that the Southwest National Park in Tasmania, a major national park known for its abundance of burnable material (Colhoun et al., 1999), exhibits the highest hazard within Tasmania.For the other regions, low compound hazard estimates are generally observed in the central, densely pop- 8), respectively.Figure 9 illustrates slight positive and negative trends in the frequency and severity (conditional on an ignition), respectively, of wildfires across the time period.However, the resulting trend in the compound hazard metric is slightly negative, which may suggest that increasing wildfire frequency is not generating increased impact due to decreasing severity.Figure 10 illustrates that there may be spatial variation in the temporal trends.We observe slight negative trends for the relative conditional spread severity across all four communities, but only Sydney appears to exhibit a negative trend in wildfire frequency.After combining changes in p 0 (s, t) and σ(s, t)/ a(s) to In Tasmania, positive trends in p 0 (s, t) and negative trends in σ(s, t)/ a(s) appear to cancel each other out as (visually) there is no significant trend in CH.

Variable importance
To rank the importance of individual covariates in the estimation of p 0 (s, t) and σ(s, t), we use Deep Learning Important Features (DeepLIFT); see Shrikumar et al. (2017); Yuan et al. (2022).This algorithm quantifies the impact of a covariate on the parameter estimates, in a post hoc manner, via contribution scores.Such scores measure the change in the partial gradient of the parameter function at location (s, t) (with respect to the i-th covariate at the same space-time location) evaluated for both the observed and a reference set of covariates.DeepLIFT is considered a fast approximation to Shapley values, which have been previously used to interpret deep learning regression models (see, e.g., Lundberg and Lee, 2017;Dahal and Lombardo, 2023;Wikle et al., 2023).
We detail the derivation of the individual contribution scores for the parameter p 0 (s, t) only; Straight lines denote estimated least-squares linear trends.The left-hand y-axis (black) corresponds to p 0 (s, t) and CH, whilst the right-hand y-axis (orange) corresponds to σ(s, t)/ a(s).Metrics are averaged over Melbourne, Tasmania, Sydney, and Perth (top to bottom rows).however, they can be similarly derived for the conditional spread severity σ(s, t).Define g (s,t),i (x * t ), i = 1, . . ., d, (s, t) ∈ S × T , as the scalar partial derivative ∂p 0 (s, t)/∂x i (s, t) evaluated at the observed covariate set x * t := {x(s, t) : s ∈ S} (for details on how p 0 (s, t) functionally relates to x i (s, t), see Section 3.3).We then construct some reference set x (0) t , here taken to be x * t but with all zero entries; recall from Section 4.1 that this corresponds to all covariates fixed to their respective mean value.Following Shun et al. (2020), the contribution score CS i,(s,t) for the effect of predictor .
We can interpret CS i,(s,t) as the amount of difference-from-mean in p 0 (s, t) that can be attributed to the difference-from-mean of x i (s, t) (Shrikumar et al., 2017).We estimate the contribution scores for both p 0 (s, t) and σ(s, t), and for all covariates i = 1, . . ., d, and all space-time locations (s, t) ∈ S ×T .Note that we cannot directly infer the direction of the effect of a covariate x i (s, t), for a single (s, t), on a model parameter from its score CS i,(s,t) alone, as the value of the latter depends on whether or not x i (s, t) exceeds its marginal mean.Instead, we pool estimates of CS i,(s,t) over all (s, t).As these estimates span zero, we rank the impact of covariates on the model parameters by the magnitude of the variability in their corresponding estimates.
In Figure 11, we present boxplots of the estimated contribution scores for both p 0 (s, t) and σ(s, t), pooled over all space-time locations.For the occurrence probability p 0 (s, t), estimates from the model suggest that all covariates, except the area a(s), have approximately equivalent importance.For the eGPD model, the five most important covariates for estimating the conditional spread severity σ(s, t) are the monthly mean of temperature (TemperatureMe) and evaporation (EvaporationMe), monthly maxima of both wind speed components (UwMx and VwMx), and NDVI.We note that characteristics of the land surface, including slope and aspect, have a relatively less important impact on the conditional distribution of the monthly radial spread, given a wildfire occurrence.Such difference between the covariate effects retrieved for the occurrence probability and the eGPD models could be interpreted with both terrain and climatic influences being required to initiate the fire, whereas the spreading process may be mostly controlled by meteorological conditions and vegetation types.Interestingly, we find that monthly maxima of precipitation, temperature, and evaporation play a much weaker role than their corresponding monthly means, suggesting that it is typical climate conditions that drive wildfire spread, rather than climatic extremes.

Conclusion
We have developed a hybrid statistical deep-learning regression model for burnt areas from Australian wildfires.Our approach uses a semi-parametric regression model with graph convolutional  2020).Whilst we focus on capturing only spatial structure in our data, graph neural networks that capture space-time structures do exist and their efficacy could be explored; one choice could be the recurrent graph convolutional neural network of Seo et al. (2018), which extends the regular convolutional long short-term memory network (Shi et al., 2015), to a graphical setting.
A further methodological pursuit is to allow the parameters in the adjacency matrix, A in (8), to be jointly estimable with the neural network weights, rather than optimising A using a grid search, or to consider different forms for A. We use the great-circle distance metric in (8) and, whilst we found this works well for our data, it is not designed to measure distances between polygons (which we treat as point locations).More appropriate forms for A could be chosen which account for the irregular lattice structure of our data, e.g., with elements A[ij] having more weight if regions s i and s j share a longer border.Alternatively, we could exploit graph neural networks that benefit from learnable edge weights (see, e.g., Jiang et al., 2020), where the initial specification for A may be somewhat arbitrary.
Finally, we note that, whilst our framework exploits spatial structure in the covariates it does not explicitly account for dependence in the burnt area response Y (s, t).Hence, it cannot be used to quantify the joint risk of wildfires at different space-time locations (s, t) or provide estimates of high quantiles for aggregates of wildfire burnt area over space-time regions.This limitation can be solved by couching our deep eGPD regression model within a specialised spatial extremal dependence model (see, e.g., Davison and Huser, 2015;Huser and Wadsworth, 2020), which we leave as a further consideration.

Figure 1 :
Figure 1: Maps of the monthly radial wildfire spread [ √ BA; km] (top-left), polygon area [km 2 ] (topright), monthly average air temperature [K] (bottom-left), and NDVI [unitless] (bottom-right) for January 2002.Grey regions in the top-left panel are Statistical Areas Level 1 within the Northern Territory where the response, BA, is not available.

Figure 2 :
Figure 2: Top panels: region-wise 5% (left) and 95% (right) quantiles of the monthly radial wildfire spread [ √ BA; km] pooled across the observation period.Grey regions are Statistical Areas Level 1 within the Northern Territory where the response, BA, is not available.Bottom: time series of 5% (orange) and 95% (red) quantiles of √ BA pooled across the entire spatial domain.Straight lines denote estimated least-squares linear trends.

Figure 2
Figure 2 presents a climatology of the observed wildfires.The top panels map the region-wise 5% adopted in the context of modelling wildfires by, for example, Mendes et al. (2010), Turkman et al. (2010), Pimont et al. (2021) and Richards et al. (

Figure 3 :Figure 4 :
Figure 3: Left: Pooled Q-Q plot for the eGPD model for the monthly radial wildfire spread on standard exponential margins (left), and standard Gaussian margins (right).Data are sampled across the entire spatial domain.95% tolerance bounds are given by the blue dashed lines.Black points correspond to observations.

Figure 6 :
Figure 6: Bootstrap median of the estimated probability of wildfire p 0 (s, t) [unitless] for January 2002 (top row), January 2005 (second row), and August 2011 (third row).Columns correspond to spatial domains of Melbourne, Tasmania, Sydney, and Perth (left to right).

Figure 9 :
Figure9: Time series of bootstrap median estimates of spatially-averaged hazard metrics: probability of wildfire occurrence p 0 (s, t) [unitless] (red), relative conditional spread severity σ(s, t)/ a(s)[unitless] (orange), and relative compound hazard metric CH [unitless] (blue).Straight lines denote estimated least-squares linear trends.The left-hand y-axis (black) corresponds to p 0 (s, t) and CH, whilst the right-hand y-axis (orange) corresponds to σ(s, t)/ a(s).Metrics are averaged over the entire spatial domain.

Figure 10 :
Figure 10: Time series of bootstrap median estimates of spatially-averaged hazard metrics: probability of wildfire occurrence p 0 (s, t) [unitless] (red), relative conditional spread severity σ(s, t)/ a(s) [unitless] (orange), and relative compound hazard metric CH [unitless] (blue).Straight lines denote estimated least-squares linear trends.The left-hand y-axis (black) corresponds to p 0 (s, t) and CH, whilst the right-hand y-axis (orange) corresponds to σ(s, t)/ a(s).Metrics are averaged over Melbourne, Tasmania, Sydney, and Perth (top to bottom rows).
neural networks and the extended generalized Pareto distribution, and enables us to accurately model extreme wildfire spread observed over an irregular spatial domain.Using a novel dataset of Australian wildfires spanning June 1999 to December 2019, we analyze monthly burnt area over irregularly arranged polygons of approximately equal population density.We assess the propor-tional hazard attributed to wildfires in each polygon and identify those regions most susceptible to wildfires.We also quantify the relative importance of the considered covariates in the estimation of the distribution of wildfire occurrence and spread.Our deep learning framework facilitates fantastic model fits by capturing the spatial structure exhibited by the covariates.We adopt a relatively simple, but powerful, convolution-based graph neural network, and so future work may include exploiting alternative graph neural networks to further improve model performance; for a recent review of such methods, see, for example,Zhou   et al. ( to account for temporal dependence, and we set the expected size of spatio-temporal blocks to k = 2 months.A single bootstrap sample is created by repeating the following until obtaining a sample of length greater than or equal to |T | months: draw a starting time tIn order to mitigate over-fitting, we use a robust validation scheme.Each bootstrap sample is randomly partitioned into 80% training and 20% validation data by assigning space-time locations (s, t) to a validation set at random, with equal probability for each (s, t).Neural networks are then trained for a finite number of iterations by minimising the associated negative log-likelihood on the training data, but the final model fit is taken to be that which minimises the negative log-likelihood, estimated on the validation set, across all training iterations.For a single bootstrap sample, the same training and validation sets are used for estimation of p 0 (s, t) and the eGPD parameters.