Statistical Inference for H\"usler-Reiss Graphical Models Through Matrix Completions

The severity of multivariate extreme events is driven by the dependence between the largest marginal observations. The H\"usler-Reiss distribution is a versatile model for this extremal dependence, and it is usually parameterized by a variogram matrix. In order to represent conditional independence relations and obtain sparse parameterizations, we introduce the novel H\"usler-Reiss precision matrix. Similarly to the Gaussian case, this matrix appears naturally in density representations of the H\"usler-Reiss Pareto distribution and encodes the extremal graphical structure through its zero pattern. For a given, arbitrary graph we prove the existence and uniqueness of the completion of a partially specified H\"usler-Reiss variogram matrix so that its precision matrix has zeros on non-edges in the graph. Using suitable estimators for the parameters on the edges, our theory provides the first consistent estimator of graph structured H\"usler-Reiss distributions. If the graph is unknown, our method can be combined with recent structure learning algorithms to jointly infer the graph and the corresponding parameter matrix. Based on our methodology, we propose new tools for statistical inference of sparse H\"usler-Reiss models and illustrate them on large flight delay data in the U.S., as well as Danube river flow data.


Introduction
In statistical modelling, conditional independence and graphical models are wellestablished concepts for analyzing structural relationships in data (Lauritzen, 1996;Wainwright and Jordan, 2008). Particularly important are Gaussian graphical models, also known as Gaussian Markov random fields (Rue and Held, 2005). The graphical structure of a multivariate normal distribution with positive definite covariance matrix Σ can be read off from the zeros of its precision matrix Σ −1 .
For risk assessment in fields such as climate science, hydrology, or finance, the primary interest is in extreme observations, with attention to both the marginal tails and the dependence between multiple risk factors. In view of the growing complexity and dimensionality of modern data sets, sparsity and graphical models are becoming crucial notions for the analysis of extremes (e.g., Engelke and Ivanovs, 2021).
There are two different ways of defining graphical models for extreme value distributions. The first is based on max-linear models (Gissibl and Klüppelberg, 2018) and the second one studies multivariate Pareto distributions (Engelke and Hitz, 2020). We follow the second approach since their new notions of conditional independence and extremal graphical models link naturally to the well-known Hammersley-Clifford theorem for density factorizations. Moreover, in the case of tree graphs, Segers (2020) shows that the extremes of regularly varying Markov trees converge to these extremal tree models. Lee and Joe (2018) propose parsimonious models for extreme value copulas; the link with extremal graphical models is made in Asenova et al. (2021).
The class of extremal graphical models with Hüsler-Reiss Pareto distributions is of particular interest. In d dimensions, the parameter of this family is a variogram matrix Γ ∈ R d×d . Because of their flexibility and stochastic properties, Hüsler-Reiss distributions can be seen as the counterpart of the Gaussian family for multivariate extremes. In combination with extremal graphical models, the Hüsler-Reiss family constitutes a powerful tool for sparse extreme value modelling, with many open questions still to explore.
For a connected, undirected graph G = (V, E) with nodes V = {1, . . . , d} and edges E, Engelke and Hitz (2020) show that such a distribution's graphical structure can be read off from a set of (d − 1) × (d − 1) precision matrices Θ (k) , for k ∈ V . While zeros in Θ (k) correspond to extremal conditional independence of nodes i, j ̸ = k, the information on edges involving the kth node is encoded only indirectly through the row sums of this matrix. A natural question, appearing also in the discussion of Engelke and Hitz (2020), is if there exists a symmetric approach involving a single d × d precision matrix.
Statistical inference for Hüsler-Reiss graphical models is limited so far to the simple structures of trees and block graphs (Engelke and Volgushev, 2020;Asenova and Segers, 2023). The parameter matrix Γ is then additive on the graph and the maximum likelihood estimator is an explicit combination of the bivariate estimators on the edges. Since block graphs lack flexibility for general applications, several discussion contributions of Engelke and Hitz (2020) have emphasized the need for estimators suitable for more general graphs.
In this paper, we obtain new theoretical results on Hüsler-Reiss distributions that answer the two open questions above and enable statistical inference on extremal graphical models on decomposable and non-decomposable graphs. Our main contributions are threefold. Firstly, in Section 3, we introduce the Hüsler-Reiss precision matrix Θ ∈ R d×d as Θ ij = Θ (k) ij for some k ̸ = i, j, a definition which-surprisingly-is independent of the particular choice of k ∈ V . This positive semi-definite matrix indeed reflects the sparsity of the extremal graph by zero off-diagonal entries. We give several characterizations of this matrix, one of them as the Moore-Penrose inverse of a projection of the parameter matrix Γ.
Secondly, we study how a Hüsler-Reiss distribution on a given, general graph G = (V, E) and fixed marginal distributions on the edges of the graph can be constructed. Thanks to the new Hüsler-Reiss precision matrix, this task can be framed as a matrix completion problem, which aims to find a conditionally negative definite variogram matrix Γ that has specified values Γ ij in the entries corresponding to the edges (i, j) ∈ E of graph G and whose precision matrix Θ has zeros in the remaining entries, i.e., Θ ij = 0 for (i, j) / ∈ E. A concrete example in dimension d = 4 for the graph in Figure 1b  where the completion problem corresponds to finding the entries with "?". In Section 4, we show that such a completion exists and is unique. Our results can be seen as a semi-definite extension of matrix completion problems for the covariance matrix of Gaussian distributions studied in Speed and Kiiveri (1986) and Bakonyi and Woerdeman (2011). Thirdly, we leverage our theoretical results to provide effective statistical tools for extremal graphical models. The Hüsler-Reiss precision matrix allows us to represent the maximum (surrogate-)likelihood estimate of Γ on the graph G as the maximizer of the constrained optimization problem where Γ is the empirical variogram (Engelke and Volgushev, 2020) and | · | + denotes the pseudo-determinant. In Section 5, we prove that the solution to this optimization problem is given by the solution of the above matrix completion problem. We further combine recent structure learning methods (Engelke et al., 2022c) with our completion to yield the first estimator that is jointly sparsistent for the extremal graph and consistent for the model parameters. Our methodology enables new model assessment plots that allow for model interpretation and comparison of models with different degrees of sparsity. This is illustrated in a case study of large delays in the domestic U.S. air travel network in Section 6. The supplementary material contains another case study, together with mathematical details and proofs. The Hüsler-Reiss precision matrix and the properties we derive in this paper have already been used for multiple purposes, including the parameterization of sparse statistical models (Röttger and Schmitz, 2023;Röttger et al., 2023a), structure estimation (Engelke et al., 2022c;Wan and Zhou, 2023), efficient statistical inference (Röttger et al., 2023b;Lederer and Oesting, 2023), and a characterization in terms of pairwise interaction models (Lalancette, 2023).

Extremal graphical models 2.1 Multivariate generalized Pareto distributions
Multivariate extreme value theory studies the tail behavior of a random vector X = (X 1 , . . . , X d ). A first summary of the extremal dependence structure of the bivariate margins for i, j ∈ V := {1, . . . , d} is the extremal correlation (2.1) defined whenever the limit exists and where F i is the distribution function of X i . When χ ij > 0 we say that X i and X j are asymptotically dependent, and when χ ij = 0 we speak of asymptotic independence. In the former case, there are two different, but closely related approaches for modelling extremal dependence: through componentwise maxima of independent copies of X leading to max-stable distributions (de Haan and Resnick, 1977); and through threshold exceedances of X resulting in multivariate generalized Pareto distributions (Rootzén and Tajvidi, 2006). Here, we concentrate on the threshold exceedance approach since it is well-suited for graphical modelling (Engelke and Hitz, 2020;Segers, 2020). For statistical models for asymptotic independence, we refer to Heffernan and Tawn (2004), for instance, and to Papastathopoulos et al. (2017) in the context of extremes of Markov chains as well as Casey and Papasthatopoulos (2023) for extremes in decomposable graphical models.
To make abstraction of the univariate marginal distributions and concentrate on the extremal dependence, it is usually assumed that all variables X i follow a given continuous distribution. Throughout, we use standard exponential margins, that is, P(X i ≤ x) = 1 − exp(−x) for x ≥ 0 and i ∈ V . Let 0, 1, and ∞ denote vectors of adequate size with all elements equal to 0, 1, and ∞ respectively. A random vector Y = (Y 1 , . . . , Y d ) is said to follow a multivariate generalized Pareto distribution (Rootzén and Tajvidi, 2006) if for any z ∈ L = x ∈ R d : x ̸ ≤ 0 , we have for some random vector X, which is then said to be in the domain of attraction of Y ; the simple normalization by subtracting u1 comes from the assumption of exponential margins of X. Multivariate generalized Pareto distributions are threshold-stable in the sense that for a vector a ≥ 0, the conditional random vector Y − a given Y ≰ a is again multivariate generalized Pareto. This also implies useful stochastic representations; see Rootzén et al. (2018)  We assume Λ to be absolutely continuous with respect to the d-dimensional Lebesgue measure and let λ denote its Radon-Nikodym derivative. The set of valid exponent measure densities λ is characterized by the following two properties: Since the distribution of Y is proportional to the restriction of Λ to L, its density f then also exists and is proportional to the exponent measure density λ as f (y) = λ(y)/Λ c (0) for all y ∈ L, since Λ c (0) = Λ(L). When considering marginal distributions of multivariate generalized Pareto distributions, it is often more useful to work with the limit distributions that arise in (2.2) by replacing X with the sub-vector X I , also in the conditioning event, for some non-empty I ⊂ {1, . . . , d}. The resulting "generalized Pareto marginal distribution", here denoted Y (I) , is supported on L I = x ∈ R |I| : x ̸ ≤ 0 , and its corresponding exponent measure and density are the actual marginals Λ I and λ I . Note that Y (I) is equal in distribution to Y I conditioned on the event Y I ̸ ≤ 0. See Rootzén and Tajvidi (2006, Section 4) or Rootzén et al. (2018, Section 6) for a detailed discussion of these marginals. Due to the choice of standard exponential univariate margins for X, the univariate "generalized Pareto marginals" Y ({i}) d = (Y i | Y i > 0) are standard exponential as well. For a discussion how to transform exponent measure densities between different choices of marginals, see Remark S.5.12.

Hüsler-Reiss distributions
A popular class of multivariate generalized Pareto distributions is the class of Hüsler-Reiss distributions, which are parametrized by symmetric conditionally negative definite matrices, here denoted by Γ (Hüsler and Reiss, 1989;Kabluchko et al., 2009). The set of these so-called variogram matrices is defined as Variogram matrices are closely related to covariance matrices and can be constructed from a centered random vector W with covariance matrix Σ as Notably, this so-called covariance transform (Farris et al., 1970) is not injective. One collection of covariance matrices corresponding to a given Γ, used for example in Engelke and Hitz (2020), can be obtained for k ∈ V as The kth row and column of these matrices are identically zero. Let Σ (k) ∈ R (d−1)×(d−1) denote the matrices with these entries removed, conveniently indexed by V \ {k}, and φ k the corresponding mapping.

6
Notably, this expression does not depend on the choice of k (e.g., Engelke et al., 2015, Theorem 3.3). The marginals Y (I) (in the generalized Pareto sense of Section 2.1) of a Hüsler-Reiss Pareto distributed vector with variogram matrix Γ are also Hüsler-Reiss Pareto distributed with variogram matrix Γ I×I (Engelke and Hitz, 2020, Example 7).
The popularity of Hüsler-Reiss distributions has theoretical and practical origins. From known properties about this model class, it becomes obvious that the Hüsler-Reiss family takes the same role among multivariate generalized Pareto distributions as the Gaussian distribution in the non-extreme setting. The results of our paper will further shed light on this analogy. Moreover, Lalancette (2023) recently showed that Hüsler-Reiss distributions constitute the only pairwise interaction models among multivariate generalized Pareto distributions. From a practical point of view, Hüsler-Reiss distributions have become a standard model in multivariate extreme value modeling because of their flexibility and tractability. Importantly, they are the finite dimensional distributions of the widely used spatial Brown-Resnick (Kabluchko et al., 2009) processes. We refer to Davison et al. (2012) and Engelke and Ivanovs (2021) for overviews of applications of these models.

Extremal conditional independence
In classical statistics, graphical models are defined through conditional independence relations between components of a random vector (e.g., Lauritzen, 1996). Since the support L of a multivariate generalized Pareto distribution is not a product space, this definition is impractical in this case. Instead, Engelke and Hitz (2020) introduce a novel notion of extremal conditional independence in terms of the vectors . . , d}, defined as the multivariate generalized Pareto vector Y conditioned on the event {Y k > 0}, and supported on the product spaces L (k) = {x ∈ L : x k > 0} = x ∈ R d : x k > 0 . The extremal version of conditional independence between sub-vectors Y A and Y B given Y C , for non-empty, disjoint sets A ∪ B ∪ C = V , is then defined as We consider the index set V as a set of nodes and let E ⊆ E(V ) be the set of undirected edges of a connected graph G = (V, E), where E(V ) denotes the set of all possible edges. Figure 1 shows four different graph structures in increasing generality; see Section S.1.3 for definitions of the related terms. We say that Y follows an Engelke and Hitz (2020) show that the existence of a positive and continuous exponent measure density λ of Y implies connectedness of the corresponding graph G and the equivalence of (2.6) to the factorization If Y is an extremal graphical model on a decomposable graph G, this result, applied recursively, yields a Hammersley-Clifford theorem that states that the exponent measure density of Y factorizes on the graph into lower-dimensional marginal densities; for an extension to unconnected graphs see Engelke et al. (2022b). The extremal conditional independence Y A ⊥ e Y B | Y C has an intuitive interpretation in terms of classical conditional independence. Suppose that we observe an extreme event in variables corresponding to C, that is, max j∈C Y j > 0, then, conditionally on Y C , the sub-vectors Y A and Y B are independent in the usual sense. For an extremal graphical structure G, this translates into an extremal local prediction property of the form where δ(i) denotes the neighborhood of vertex i in G (see Section S.1.3). Thus, knowing that an extreme event has occurred in the neighborhood of Y i , is suffices to know the neighbors of Y i for predicting its value.

Hüsler-Reiss precision matrix
An important property of Gaussian graphical models with covariance matrix Σ is that the conditional independence structure can be read off from the zeros in the (Gaussian) precision matrix Σ −1 . Hüsler-Reiss graphical models satisfy a similar property. Indeed, let Y be a Hüsler-Reiss Pareto vector with parameter matrix Γ ∈ D d that is an extremal graphical model on the undirected graph G = (V, E). Engelke and Hitz (2020) show that the graph G is necessarily connected and that the graphical structure can be read off from the precision matrices Θ (k) = (Σ (k) ) −1 for any k ∈ V ; in the following we index these (d − 1) × (d − 1) matrices by V \ {k} for the sake of simpler notation. Two distinct nodes i, j ̸ = k are extremal conditionally independent in the sense that ij is zero. If one of the nodes, say j, is equal to k, extremal conditional independence is equivalent to the row sum l̸ =k Θ (k) il being zero. Each Θ (k) thus contains all information on conditional independence of Y . A natural question, already raised in the discussion of Engelke and Hitz (2020), is whether there is a single d × d precision matrix that contains this information but does not require a choice of k. In fact, such a matrix can be defined as follows.
Lemma 1 and Proposition 3 in Engelke and Hitz (2020) imply that the matrix Θ is well-defined and represents extremal conditional independence of distinct nodes i, j ∈ V in the corresponding Hüsler-Reiss model by While Definition 3.1 is a natural way to jointly represent the information contained in the Θ (k) matrices, it remains to be shown that the matrix Θ allows for useful mathematical representations. In order to give a first such representation, let ( · ) + denote the Moore-Penrose inverse (see Section S.1.1 for details), let I d be the d × d identity matrix, write e d = d −1 1, and let Π = I d − 1e ⊤ d be the d × d centering matrix, i.e., the projection matrix onto the orthogonal complement of 1. Recall (2.4), which can be written as a mapping for arbitrary matrices S ∈ R d×d as see Definition S.5.9 and Lemma S.5.10 for details.
Proposition 3.2. Let Γ ∈ D d and S ∈ R d×d satisfy γ(S) = Γ. Then Furthermore, the matrix Θ from Definition 3.1 satisfies Since the mapping γ preserves (a)symmetry and Γ is symmetric, the condition γ(S) = Γ requires S to be a symmetric matrix as well. It turns out that, restricted to symmetric matrices, (3.3) is in fact equivalent to γ(S) = Γ; see Lemma S.5.1.
In practice, the matrix S is usually a (definite or semi-definite) covariance matrix, so it is natural to only consider symmetric matrices here. For the sake of completeness we show in the proof of Proposition 3.2 that (3.3) is a sufficient and necessary condition for (3.4), also allowing asymmetric matrices S. Lemma S.5.8 shows that (3.3) is further equivalent to S being a generalized inverse of Θ in the sense ΘSΘ = Θ.
In light of (3.3), an obvious choice of S for a given Γ is setting S = − 1 2 Γ, while other interesting choices for S are the matricesΣ (k) =φ k (Γ), defined in (2.5). Furthermore, these matrices satisfy Π Further connections between Γ,Σ (k) and Θ are investigated in Wan and Zhou (2023, Propositions 3.1 and 3.2).
To better understand representation (3.4), let P 1 d ⊂ R d×d denote the set of symmetric positive semi-definite matrices with kernel equal to span({1}). Let σ and θ be the mappings from a variogram matrix Γ to the corresponding matrices Σ and Θ, respectively: Proposition 3.3. The mappings σ and θ are homeomorphisms between D d and P 1 d . Using γ(·) defined in (3.2), their inverses are From this proposition it follows that the matrix Θ from Definition 3.1 is always positive semi-definite with kernel equal to span({1}). Furthermore, the class of Hüsler-Reiss distributions, parametrized by the set D d of conditionally negative definite matrices, can just as well be parametrized by P 1 d , interpreted either in the role of Θ or Σ. As discussed after Proposition 3.4, the matrix Σ is the degenerate covariance matrix of a particular transformation of the Hüsler-Reiss Pareto vector Y . Similarly to the Gaussian case, the precision matrix Θ can be obtained from this covariance matrix as Θ = Σ + , using the Moore-Penrose inverse due to its singularity.
For a given Γ, the characterizations in Proposition 3.2 give a straightforward way to compute Θ in Definition 3.1 as Θ = θ(Γ) = Π(− 1 2 Γ)Π + and thus retrieve the conditional independence structure of the corresponding extremal graphical model. In a Gaussian model, another property of the precision matrix is that it can be used to express its probability density function in a concise way. The following result shows that a similar expression is also possible for Hüsler-Reiss distributions.
Proposition 3.4. Let Γ ∈ D d and Θ = θ(Γ). Then the exponent measure density λ( · ; Γ) from Definition 2.1 can be expressed as This expression is remarkably similar to a degenerate Gaussian density on the hyperplane {1} ⊥ , differing only by the value of the constant and the term exp(−y ⊤ e d ).
In fact, when restricting the support to the half-space x ∈ R d : 1 ⊤ x ≥ 0 , this density is proportional to the probability density function of a random variable Y 1 = W Σ + R1, with R ∼ Exp(1) and W Σ a degenerate normal vector with covariance matrix Σ = Π − 1 2 Γ Π, independent of R. A careful look at other characterizations of λ such as in Wadsworth and Tawn (2014), or the slightly more general definitions of the Hüsler-Reiss exponent measure density in Ho and Dombry (2019) and Kiriliouk et al. (2018), shows that the precision matrix Θ appears naturally in these parameterizations as well.
Remark 3.5. Besides the extremal conditional independence structure and exponent measure density, the precision matrix Θ is also useful to describe other stochastic properties of Hüsler-Reiss distributions. Based on the results of the present paper, Röttger et al. (2023b) and Röttger and Schmitz (2023) show that multivariate total positivity of order two, a notion of positive dependence, can be encoded as Θ ij ≤ 0 for all i ̸ = j. Engelke et al. (2022c) suggest that the precision matrix can be used to estimate extremal graphical structures by penalizing its L 1 -norm ∥Θ∥ 1 , and a similar approach is taken in Wan and Zhou (2023) and Lederer and Oesting (2023). Compared to the Gaussian case, a difficulty of dealing with Θ in mathematical derivations and statistical implementations is that it is not invertible.

Matrix completion problems
A well-studied problem related to Gaussian graphical models is the task of constructing a model with a given graphical structure and specified marginal distributions on the fully connected subsets of vertices, or equivalently, to complete a partially defined covariance matrix such that its precision matrix has zeros where there are no edges in the specified graph (Speed and Kiiveri, 1986;Bakonyi and Woerdeman, 2011). For extremal Hüsler-Reiss models, the same problem can be posed, expressed as a matrix completion problem on the variogram and precision matrix.
To formalize the notion of a partially specified matrix, we define the setR := R∪{"?"}, consisting of the real numbers and the placeholder "?" for unspecified values (see e.g., Bakonyi and Woerdeman, 2011, for this use of "?"). For an undirected graph G = (V, E), a matrix is said to be "specified on G" if it is specified on the diagonal and the entries corresponding to the edges of G. A matrix is said to be "partially conditionally negative definite" if it is symmetric, its diagonal is fully specified, and all fully specified principal submatrices are conditionally negative definite (a principal submatrix is any submatrix obtained by removing the same index set from both the columns and rows of the matrix). In computations involving partially specified matrices, an entry in the result is "?" as soon as any of the entries used to compute it is itself "?". For definitions of graph theoretical concepts used in this section, we refer to Section S.1.3.
Definition 4.1. Let G = (V, E) be an undirected graph and letΓ ∈R d×d be a partially conditionally negative definite matrix, specified on G. The corresponding matrix completion problem is to find a conditionally negative definite matrix Γ ∈ D d and Θ = θ(Γ) such that where E denotes the edge set E augmented by the diagonal entries {(i, i) : i ∈ V }.
An example for such a matrix completion problem on the graph in Figure 1b is given in (1.1). In the Gaussian case, a similar problem can be posed with the covariance matrix Σ instead of Γ and with Θ = Σ −1 . To this positive definite completion problem, an explicit solution for decomposable graphs and a convergent algorithm for general graphs is given in Speed and Kiiveri (1986). In this section we discuss semi-definite matrix completion problems for Hüsler-Reiss models as in Definition 4.1, starting from simple graph structures such as trees and finishing with general graphs. We assume throughout that the graph G is connected since only those can be associated to non-degenerate Hüsler-Reiss models; see Section 2.3.
Example 4.2. Block graphs are simple graph structures where the separator sets contain only single nodes. Trees are a special case of this class where all cliques consist of exactly two nodes. The particular structure of block graphs makes them appealing for the construction of parametric models and for statistical inference (Engelke and Hitz, 2020;Asenova et al., 2021). In fact, this structure also yields a simple explicit solution to the matrix completion problem in Definition 4.1. IfΓ is a partially conditionally negative definite matrix on the connected block graph G = (V, E), then a unique completion exists (Engelke and Hitz, 2020, Proposition 4) and can be expressed as where path(i, j) is the unique shortest path between i and j in G; see also Engelke and Volgushev (2020) and Asenova and Segers (2023). Using this result, the missing entries in (1.1) can be computed to be Γ 13 = Γ 31 = 13 and Γ 34 = Γ 43 = 12.

Decomposable graphs
In this section, a solution to the matrix completion problem will be given for connected, decomposable graphs. First, consider the simplest (non-trivial) example from this class of graphs, a graph consisting of exactly two cliques. Recall Σ (k) ,Σ (k) and φ k from (2.5), γ from (3.2), and observe that φ −1 k (Σ (k) ) = γ(Σ (k) ).
Lemma 4.3. Let G = (V, E) be a connected decomposable graph consisting of two cliques C 1 and C 2 , separated by D 2 = C 1 ∩ C 2 ̸ = ∅. LetΓ be a partially conditionally negative definite matrix, specified on G. For some k ∈ D 2 , letΣ (k) = φ k (Γ) and let Σ (k) be its unique positive definite completion with graphical structure G V \{k} . Then is the unique solution of the matrix completion problem 4.1 forΓ and graph G.
The existence and uniqueness of such a positive definite completion for general graph structures is shown in Speed and Kiiveri (1986), and a short proof of the case with two cliques is given in Lemma S.5.13. Using this result, a completion for a general decomposable graph G can be constructed as follows. Let {C 1 , . . . , C N } be the cliques of G, ordered according to the running intersection property (cf. Section S.1.3), and, without losing generality, assume that the vertices in V are ordered accordingly, in the sense i ≤ j for all i ∈ C k and j ∈ C l with k < l. Let Γ 1 =Γ C 1 , V n = C 1 ∪ . . . ∪ C n , and K n = V n × V n . Further, for n = 2, . . . , N , iteratively defineΓ n ∈R |Vn|×|Vn| with entries Proposition 4.4. Let G = (V, E) be a connected, decomposable graph andΓ a partially conditionally negative definite matrix, specified on G. Then there exists a unique solution to the matrix completion problem 4.1 withΓ and G. This solution can be computed explicitly as Γ N in the construction above.
The class of decomposable graphs is a significant extension of the class of block graphs. For instance, a non-decomposable graph can be approximated in a non-trivial fashion by computing a so-called decomposable completion. In contrast, the only block graph completion of any biconnected graph (i.e., a graph that remains connected after removal of any one vertex) is already the complete graph, since the only biconnected block graph is the complete graph, and adding edges to a graph preserves connectivity. See for instance Lauritzen (1996, Section 5.3) for further details about decomposable graphical models.
Example 4.5. Figure 2 shows an example of the matrix completion algorithm from Proposition 4.4. Starting with the clique {1, 2, 3}, the missing values are computed clique by clique. Edges whose corresponding matrix entries were already computed, are considered as part of the graph in subsequent steps, such that each computation is a direct application of Lemma 4.3. Thanks to the running intersection property, the required conditional independence structure is preserved.

General graphs
The class of general connected graphs is much larger than the class of decomposable ones; see Figure 1d for an example. In applications, when the graph is estimated from data without restrictions, non-decomposable structures often arise. The following results provide a more general but slightly weaker solution to the matrix completion problem in Definition 4.1, where, as before, Proposition 4.6. Let G = (V, E) be a connected graph and letΓ ∈R d×d be specified on G, such that there exists a fully specified conditionally negative definite matrix that agrees withΓ on the entries (i, j) ∈ E. Then there exists a unique conditionally negative definite matrix Γ that solves the matrix completion problem (4.1) forΓ and G.
This result provides the same theoretical existence of a unique solution as Proposition 4.4 and allows the definition of the following mapping. We use the notation Γ G ∈R d×d to denote the restriction of a fully specified matrix Γ to a graph G = (V, E), in the sense Definition 4.7. For a connected graph G = (V, E) letD G = Γ ′ G : Γ ′ ∈ D d be the restriction of conditionally negative definite matrices to G. The function maps a partial matrixΓ ∈D G to its unique completion Γ satisfying the matrix completion problem in Definition 4.1 with respect to G.
The uniqueness of this completion can be understood intuitively by counting the number of parameters and constraints involved in (4.1). Notably, the existence of any conditionally negative definite completion ofΓ is sufficient for the existence of a graphical completion. In the decomposable case, this can be guaranteed by verifying the definiteness of all fully specified principal submatrices, but in the general case this criterion does not work, as the following counter-example shows.
Example 4.8. For d ≥ 4, letΓ ∈R d×d be a partial matrix on the d-dimensional ring graph with entriesΓ ij = 1 if |i − j| = 1,Γ 1d =Γ d1 = (2d) 2 , zeros on the diagonal, and "?" elsewhere. Then all fully specified principal submatrices of thisΓ are conditionally negative definite, but there exists no conditionally negative definite completion of the entire matrix (see Section S.5.5.3 for details).
In general, computing the completion is less straightforward than in the decomposable case, but the following algorithm follows from the proof of Proposition 4.6. This procedure is similar to the one described in Speed and Kiiveri (1986, Section 4), using the above results instead of positive definite matrix completions.
Corollary 4.9. Let G andΓ be as in Proposition 4.6, and let Γ 0 ∈ D d be such that with t n ∈ {1, . . . , m}, t n ≡ n mod m, and C Gt n computed as in Proposition 4.4. Then Γ n converges to the unique completion C G (Γ) as n → ∞. In order to apply Corollary 4.9, an initial (non-graphical) completion ofΓ, to be used as Γ 0 , is required. The problem of finding such a matrix is also known as the Euclidean distance matrix completion problem, with solution algorithms for example in Bakonyi and Johnson (1995) and Fang and O'Leary (2012).
Example 4.11. To illustrate the algorithm from Corollary 4.9, consider the matrix Γ 0 below, which does not have any non-trivial graphical structure, and its "completion" , and the convergence of Θ ij to zero for (i, j) / ∈ E as the graphical completion of Γ is computed (left). In each iteration, the graph defined by edge set Γ, whose conditional independence structure is described by the graph in Figure 3. These two matrices differ only in the highlighted entries, corresponding to non-edges in G. During the computation of Γ, the entries of Γ 0 that correspond to edges in G do not change, and the convergence of the remaining entries in Θ to zero is plotted in Figure 3. It can be seen that the entries corresponding to edges (2, 5) and (3, 5) stay at zero (up to numerical precision of magnitude 10 −15 ) after the first few iterations, whereas the maximum of entries (2, 4) and (1, 3) converges to zero at a slower rate. This observation is in line with the fact that vertices {1, 2, 3, 4} induce the chordless cycle that makes the graph non-decomposable.

Statistical inference
Estimation of a Hüsler-Reiss parameter matrix Γ that is an extremal graphical model on a given graph G = (V, E) is currently restricted to the simple structures of trees (Engelke and Volgushev, 2020;Hu et al., 2022), latent trees (Asenova et al., 2021) and block graphs (Engelke and Hitz, 2020; Asenova and Segers, 2023). In the discussion of Engelke and Hitz (2020), this has been pointed out as too restrictive in practice, since many data sets require more general graph structures.
In this section we solve this issue by applying our results on matrix completions for variograms. In particular, we show how any consistent estimator Γ can be transformed into a consistent estimator Γ G with desired graph structure. When the graph is unknown, it can be recovered by existing structure learning methods G. Our completion then produces the first estimator Γ G that jointly estimates the graph and the variogram matrix for general graphs G in a consistent way.

Matrix completion as likelihood optimization
For a more statistical perspective on the matrix completions in Section 4, we first characterize them as constrained maximum likelihood estimators; see Uhler (2017) for the Gaussian case. For a variogram matrix Γ ∈ D d and a connected graph G = (V, E), we will show that the maximizer of the positive semi-definite Gaussian log-likelihood log |Θ| + + 1 2 tr ΓΘ , (5.1) under suitable graph constraints, is equal to our completion operator C G from Definition 4.7 applied to Γ G . The connection to likelihood estimation for Hüsler-Reiss distributions arises by choosing for Γ the empirical variogram Γ (Engelke and Volgushev, 2020). In this case, (5.1) is the (surrogate) log-likelihood of the Hüsler-Reiss model parameterized in terms of the precision matrix Θ (Röttger et al., 2023b, Section 5.1); see Section S.2.1 for a simple derivation.
as well as the map θ and its inverse in Proposition 3.3.
where Θ G is the unique maximizer of (5.1) over all Hüsler-Reiss precision matrices Θ ∈ P 1 d under the constraint that Θ ij = 0 for all We note that the result holds for any variogram matrix Γ, but only for the empirical variogram Γ there is an interpretation in terms of maximum (surrogate) likelihood estimation. In practice, solving this optimization provides an alternative to Corollary 4.9 to compute the graphical completion of a matrix. We leave this approach for future research.

Consistency
In order to show consistency results, a useful property of the completion C G is that it is a continuous mapping from the spaceD G of partially specified variogram matrices to the space of variogram matrices D.
Lemma 5.2. The mapping C G from Definition 4.7 is continuous.
By continuity of C G , a consistent estimator on the edges of a Hüsler-Reiss graphical model can be extended to a consistent estimator of the whole parameter matrix.
Theorem 5.3. Consider a Hüsler-Reiss graphical model with graphical structure G = (V, E) and variogram matrix Γ. Let Γ ≡ Γ n be an estimator sequence of Γ G which satisfies diag( Γ ) ≡ 0, is symmetric, and is consistent in the sense that for all Then with probability tending to one there exists a completion of Γ , that is, P( Γ ∈ D G ) → 1 as n → ∞. Let Γ G = C G ( Γ ) denote this completion and Θ G the corresponding precision matrix (and set both matrices to the zero matrix if the completion does not yet exist). Then Γ G is a consistent estimator for Γ with the correct graph structure, that is, Theorem 5.3 provides the first consistent estimator of a Hüsler-Reiss variogram matrix that respects the structure of a general graph G = (V, E). Indeed, the only ingredient that is needed is a consistent estimator of Γ on the edge set E. There are many different possibilities for such estimators in the literature. A natural and computationally efficient estimator is the empirical variogram (Engelke et al., 2015;Engelke and Volgushev, 2020), which for the Hüsler-Reiss distributions is the empirical version of the parameter matrix Γ. Other proposals include M-estimators (Einmahl et al., 2012(Einmahl et al., , 2016Lalancette et al., 2021), proper scoring rules (de Fondeville and Davison, 2018) and likelihood methods. The latter have the advantage that they can incorporate censoring of components of the data that are not extreme (Ledford and Tawn, 1996;Wadsworth and Tawn, 2014). Pairwise likelihood methods (Padoan et al., 2010) reduce the high computational cost of censoring.
In order to guarantee that even for a fixed sample size n there exists a completion, it can be advantageous to start with a consistent estimator Γ ∈ D of the entire parameter matrix Γ. Since Γ G ∈D G by definition, Proposition 4.6 ensures that exists and is a consistent estimator with graph structure G. In larger dimensions d = |V |, estimating all entries of Γ by likelihood optimization or censoring is often infeasible, and the only option is then the empirical variogram or estimators based on simple summary statistics such as extremal coefficients (Einmahl et al., 2018). If the graph G is sparse, then there is an efficient alternative to estimating every entry of Γ, which requires only estimation on all cliques separately; see Section S.2.2. Furthermore, in Section S.2.3, we perform a simulation study that illustrates the performance gains achieved by this method, while yielding a similar quality of estimation of Γ.
The graph G is in practice often unknown and has to be estimated from the data. Consistent structure estimation methods for extremal graphs exist for trees (Engelke and Volgushev, 2020;Hu et al., 2022), and for general graphs based on lasso-type L 1 penalization (Engelke et al., 2022c; Wan and Zhou, 2023). The latter paper proposes the EGlearn method and shows, under certain conditions, its sparsistency even in the high-dimensional case where the dimension may grow with the sample size. More precisely, if G is the graph implied by the zero entries in the true Θ, and G = (V, E) is the estimated graph from EGlearn then While they consistently recover a general graph structure, they do not obtain an estimate of the Γ matrix on the estimated graph structure. Our theory complements the structure estimation in Engelke et al. (2022c). In combination, we are now able to estimate jointly any graph structure and the corresponding Hüsler-Reiss parameter matrix consistently.
Corollary 5.4. Consider a Hüsler-Reiss graphical model with graphical structure G = (V, E) and variogram matrix Γ. Let Γ be a consistent estimator for Γ and G = (V, E) a sparsistent estimator of G as in (5.4). If Γ G = C G ( Γ G ) is the completion on G and Θ G its precision matrix, then Γ G is a consistent and sparsistent estimator for Γ, that is, for all ε > 0, In Corollary 5.4, not only is Γ G a consistent estimator of Γ, it is also, with probability going to one, a graphical model with respect to the true graph G.

Application
Evaluation of the risk of climate extremes such as heatwaves (e.g., Reich et al., 2014) or floods (e.g., Asadi et al., 2015;Cooley and Thibaud, 2019) is the bread and butter of extreme value analysis, and our methods are perfectly suitable for such data. In order to illustrate our methods on a widely used data set and to underline their applicability to a broad range of domains, we study the river flow data of Asadi et al. (2015) in the Supplementary Material (Section S.4). In this section, we extend the range of possible applications and use our methodology to study large flight delays in the U.S. flight network. Excessive delays have a variety of negative effects, ranging from inconveniences for passengers and congestion of critical airport infrastructure to financial losses for involved parties.

Data and exploratory analysis
The United States Bureau of Transportation Statistics 1 provides records of domestic flights in the U.S. that are operated by major carriers (at least 1% market share) at airports accounting for at least 1% of domestic enplanements. We use this data from 2005 to 2020, selecting only airports from the contiguous U.S. with a minimum of 1000 flights per year. For each airport, we compute the accumulated (positive) flight delays (in minutes) on a given day. As there are a few days for which no data are available, this results in a data set with n = 5601 observations x 1 , . . . , x n ∈ R d of daily accumulated flight delays for the 16 years at d = 169 airports. This pre-processed data set is available in the R-package graphicalExtremes (Engelke et al., 2022a).
The models in this paper are suitable for variables where the largest observations are asymptotically dependent. For our data, such dependence between the largest daily accumulated flight delays at different airports may result from several factors.
For instance, large delays at a hub in the network may induce delays at other airports that have frequent connections to that hub. On the other hand, independently of flight connections, meteorological events such as severe storms or heavy snowfall can cause simultaneous delays at airports in the same geographical area.
In order to find groups of airports at risk of simultaneous large delays, we first run a k-medoids clustering (Kaufman and Rousseeuw, 2009), where similarities are defined in terms of the strength of extremal dependence. Clustering is frequently used in the extreme value literature to identify regions that are homogeneous in terms of dependence properties (e.g., Bernard et al., 2013;Saunders et al., 2020;Vignotto et al., 2021). We follow these approaches but use a different dissimilarity measure, namely the empirical versionχ ij (p) of the extremal correlation (2.1) at probability level p = 0.95. The clusters and subsequent results are stable with respect to the exact probability level in a range of about p ∈ [0.8, 0.95], yielding both a sufficiently high threshold and enough exceedances.
The resulting clusters are shown in Figure 4, for an ad-hoc choice of four clusters. Even though no information on the airport locations is used, the resulting groups exhibit strong geographical proximity, supporting our intuition on the importance of flight connectivity and meteorological factors. Figure S.3 shows that the extremal dependence of large delays is much stronger within clusters than between different clusters. In the following we focus on the southern cluster around Texas, consisting of 29 airports, which is particularly stable even when modifying the number k of clusters and threshold value p; similar analyses can be conducted for the other clusters. The IATA codes of this cluster are given in Table 4. The existence of a (regular) flight connection between two airports defines a natural graph, denoted by G flight and shown in the left panel of Figure 5.

Graphical modeling
The flight graph G flight is based on domain knowledge and is certainly a good first candidate for statistical modeling. It is however not necessarily the best graph in terms of conditional independence properties. We therefore also consider extremal graph structures estimated from data. To evaluate the fitted models out-of-sample, all rows with missing values were removed, and the data was split into a training set with 1764 observations (2005-01-01 to 2010-12-31) for estimation, and a validation set with 1839 observations (2011-01-01 to 2020-12-31) for selection of tuning parameters and model comparisons. As a base estimator, we use the empirical extremal variogram Γ computed at probability threshold p = 0.95 on the training data.
The first, sparse graph is obtained as the minimum spanning tree with weight Figure 4: Clusters of airports in the contiguous U.S., using k-medoids clustering with empirical extremal correlation as dissimilarity. Edges represent airports connected by (regular) flights. Airport sizes are proportional to the average number of daily flights.
matrix Γ as proposed in Engelke and Volgushev (2020); this tree is denoted by T and is shown in the center of Figure 5. Alternatively, the empirical extremal correlation χ as weight matrix can also recover an underlying tree (Engelke and Volgushev, 2020; Hu et al., 2022). As a second family of estimated extremal graph structures G ρ we apply the EGlearn algorithm (Engelke et al., 2022c) to the flights data set with different regularization parameters ρ ≥ 0. The latter governs the amount of sparsity in the estimated graph, where larger ρ values correspond to sparser graphs. EGlearn produces a whole sequence of estimated graphical models, but without estimates of the corresponding parameter matrices.
We apply our methodology to fit a Hüsler-Reiss multivariate generalized Pareto distribution with different extremal graphical structures. As before, the empirical variogram Γ acts as base estimator. Applying the completion operator C G to the restriction of Γ to a graph G as in (5.3), we obtain graph structured estimators Γ G flight , Γ T and Γ Gρ for the flight graph, the extremal tree and the ρ-regularized general graph, respectively. We note that, previously, only parameter estimation on the tree was possible through the tree metric property (4.2), but not on the more general graphs.
Our method thus allows to estimate the Hüsler-Reiss parameters of an extremal graphical model based on a graph learned from the data, the only restriction being that the graph needs to be connected.
A first benefit of our approach is that it complements structure learning methods such as EGlearn by allowing for a data-driven sparsity tuning. One approach would be to compare the Hüsler-Reiss log-likelihood values of the different parameter matrices Γ Gρ , for instance by AIC or BIC applied to the training data. Instead, we directly compare the log-likelihood values on the validation set and plot them against the tuning parameter ρ in Figure 6. The likelihood is computed based on the Hüsler-Reiss Pareto density, and since we are mainly interested in the dependencies between different airports, the univariate marginals were normalized to the standard exponential scale using the empirical distribution functions; Figure S.4 shows the univariate shape parameters before normalization. The structural estimators in this section are all part of the family of Hüsler-Reiss distributions, allowing the comparison of the log-likelihood values, in the same was as normal distributions with different degrees of sparsity are compared in classical graphical modelling (e.g. Friedman et al., 2007). The best fitting model on the independent validation data set comes from EGlearn at ρ * = 0.15, and the corresponding graph with 101 edges is shown on the right-hand side of Figure 5. The flight graph performs significantly worse, and the tree seems to be too sparse. The Hüsler-Reiss precision matrix defined in Section 3 can be used to track how sparsity is induced in the EGlearn algorithm. Figure 6 shows the entries of the precision matrices Θ Gρ as a function of the tuning parameter ρ. Similar to a usual lasso, we observe that they tend to zero in a possibly non-monotone way.
The completed variogram matrices also serve to assess the fitted model and evaluate its goodness of fit. Figure 7 compares the values of the empirical extremal variogram Γ to the variogram estimates implied by the fitted graphical model Γ G for graph G being the flight graph, the extremal minimum spanning tree and the optimal EGlearn graph, respectively. The extremal correlations are χ = 2−2Φ( √ Γ/2) for a Hüsler-Reiss distribution with variogram matrix Γ, where Φ is the standard normal distribution function and all functions are applied componentwise. The results confirm the likelihood considerations and, in particular, convincingly show that a tree model is not flexible enough to capture all extremal dependencies in this data set. It is worthwhile to note that this finding is in contrast to the results of the application to river flow data in Section S.4, where the performance of the tree model is very similar to the EGlearn method, outperforming both the complete graph and the physical flow graph of the river network. This stresses the usefulness and applicability of our methodology in various situations with different amounts of sparsity. This solution is called the Moore-Penrose inverse or simply "pseudo-inverse" of A and is denoted by A + := B.
Remark S.1.2. The pseudo-inverse is defined in a similar way for matrices with complex-valued entries, using the conjugate transpose in place of the transpose.
Lemma S.1.3. The pseudo-inverse has the following properties. ( (3) AA + is the orthogonal projection onto the image of A.
Proof. The first statement follows from the interchangeability of A and B in Definition S.1.1. The second statement can be proven by substituting the left-and right-hand side in the defining equations from Definition S.1.1. The last two statements are from Golub and Van Loan (1996, Section 5.5.4).
Lemma S.1.4. Let A and B be two symmetric matrices of equal size and let P A denote the orthogonal projection matrix onto the image of A. Then the following two conditions are sufficient and necessary for A = B + : • Im B ⊆ Im A (or equivalently ker A ⊆ ker B); • AB = P A .
Proof. In the first bullet point, the equivalence of Im B ⊆ Im A and ker A ⊆ ker B follows from the identity (ker M ) ⊥ = Im M for symmetric matrices M . For A = B + , the two conditions follow directly from Lemma S.1.3; note that by the lemma and by symmetry of A and B, we have AB = (AB) ⊤ = B ⊤ A ⊤ = BA.
Conversely, using the fact that the projection matrix P A is symmetric, the defining equations from Definition S.1.1 follow from the two conditions as follows: where we used Im B ⊆ Im A in the last step.

S.1.2 Pseudo-determinant
Definition S.1.5 and Lemma S.1.6 are from Knill (2013, Section 2) and only adapted in scope and notation for their use here.
Definition S.1.5. Let A be a square matrix with eigenvalues {λ i }. Then its pseudodeterminant, denoted by |A| + , is defined as the product of its non-zero eigenvalues: If all eigenvalues are zero, |A| + = 1.
Lemma S.1.7. Let V be a linear subspace of R d and put Then for any A 1 , A 2 ∈ Q V , with B ′ i ∈ R l×l invertible and 0 a×b ∈ R a×b equal to zero in all entries. Using the properties from Lemma S.1.6 it follows that Since M is chosen independently of i, Furthermore, continuity of matrix multiplication implies that B 0 = lim i→∞ B i and hence lim i→∞ B ′ i = B ′ 0 . Using the continuity of the regular determinant, it follows that

S.1.3 Graph theory
All graphs considered in this paper are undirected, simple graphs without loops. A graph G = (V, E) is defined by a set of vertices V = {1, . . . , d} and a set of edges E ⊆ V × V . Since we only consider undirected graphs without loops, the set of all possible edges is E(V ) := (V × V ) \ {(i, i) : i ∈ V }, and it must always hold that (i, j) ∈ E ⇒ (j, i) ∈ E. The set of loops from each node to itself is denoted as These loops are no edges in the sense defined above, but can be useful when identifying edges and subgraphs with matrix entries and submatrices, respectively. For a set of edges E ⊆ E(V ), the inclusion of the loops from each node to itself is denoted as of G is a graph consisting of a subset of vertices V ′ ⊆ V and a subset of E such that all endpoints lie in V ′ , i.e., E ′ ⊆ E ∩ E(V ′ ). If E ′ is maximal (i.e., the latter set inclusion is an equality), G ′ is called the subgraph induced by V ′ . A subset of vertices C ⊆ V is called complete if its induced subgraph is a complete graph. A subset of vertices is called a clique if it is complete and not a strict subset of another complete subset.
The neighborhood of a vertex i is defined as δ(i) = {j ∈ V : (i, j) ∈ E}. The neighborhood of a vertex including the vertex itself is denoted asδ(i) := δ(i) ∪ {i}. A path of length m between vertices i and j is a sequence of m + 1 distinct vertices p 0 , p 1 , . . . , p m such that p 0 = i, p m = j, and (p i−1 , p i ) ∈ E for all i = 1, . . . , m. If there exists a path between two vertices, they are said to be connected. A graph is connected if any two of its vertices are connected. A cycle of length m is a sequence of m distinct vertices p 1 , . . . , p m such that (p m , p 1 ) ∈ E and (p i−1 , p i ) ∈ E for all i = 2, . . . , m. A chord is an edge between two vertices of a cycle that is not itself part of the cycle. Definition S.1.8 (Decomposable Graph). A decomposable graph is a graph in which all cycles of four or more vertices have a chord.
A useful property of decomposable graphs is the following running intersection property (see e.g., Lauritzen, 1996, Proposition 2.17).
Lemma S.1.9 (Running intersection property). A graph G is decomposable, if and only if its set of cliques C = {C 1 , . . . , C N } can be ordered such that the running intersection property is fulfilled, that is, for all i = 2, . . . , N there exists k(i) ∈ {1, . . . , i − 1} such that The multiset D = {D 2 , . . . , D N } is independent of the chosen ordering of C and its elements are called separators. For connected graphs, all separators are non-empty.
Definition S.1.10 (Block graph). A block graph is a decomposable graph in which all non-empty separators consist of single vertices: Definition S.1.11 (Tree Graph). A tree graph or a tree is a connected graph that does not contain any cycle.
Remark S.1.12. For d ≥ 2, the set of trees is identical to the set of connected block graphs in which all cliques consist of exactly two vertices.
Definition S.1.13 (Graph Laplacian). For an undirected graph G = (V, E) the graph Laplacian matrix L ∈ R d×d is defined by where the degree deg(i) of a vertex i is defined as |δ(i)|.

S.2 Details on matrix completion S.2.1 Matrix completion as likelihood optimization
For a Hüsler-Reiss distribution Y with parameter matrix Γ, the random vectors Y (k) , k ∈ V , defined in Section 2.3 satisfy (Engelke et al., 2015) For an independent sample of size n from Y , one can obtain samples from this (d − 1)-dimensional normal distribution by selecting only data with Y k > 1 and following (S.2.1). We denote the corresponding empirical covariance matrix by Σ (k) , augmented by a kth row and column of zeros to make a d × d matrix. Ignoring the information on the parameter matrix Γ in the mean vector, the (surrogate) log-likelihood of this model can be written in terms of our Hüsler-Reiss precision matrix Θ as where | · | + is the pseudo-determinant. Setting Γ (k) = γ( Σ (k) ) gives a nonparametric estimator of Γ. By Proposition 3.2 it holds that Σ (k) = Π(− 1 2 Γ (k) )Π and therefore the cyclic permutation property of the trace operator together with the fact that ΘΠ = ΠΘ = Θ for any Θ ∈ P 1 d shows that the right-hand side of (S.2.2) is equal to log |Θ| + + 1 2 tr ΓΘ , with Γ = Γ (k) . In order to use all data in the sample, we can consider this likelihood with Γ equal to a combined version of the variogram estimators over all k, defined as Γ := d −1 d k=1 Γ (k) and called the empirical variogram (Engelke and Volgushev, 2020).
A natural way to obtain a graphical model is therefore to maximize this surrogate Hüsler-Reiss likelihood over Θ ∈ P 1 d under the constraint of a graph-structured precision matrix. As shown in Proposition 5.1, solving this optimization problem corresponds to our completion operator C G from Definition 4.7.

S.2.2 Estimation strategy for sparse graphs
Let C be the collection of all cliques C ⊆ V of a connected graph G = (V, E) and suppose that for every C ∈ C, Γ C is a consistent estimator of the entries Γ ij for i, j ∈ C.
Note that Γ C can be computed using information only from the components in C, which reduces the computational cost drastically if the cliques are small, especially if censoring is used. Since the cliques are overlapping on the separator sets, we need to combine different estimators to obtain a partial variogram matrix Γ ∈R d×d on the whole graph. We do this by averaging the estimators on the intersections. Let C (i,j) = {C ∈ C : i, j ∈ C} denote the set of all cliques containing the edge (i, j) ∈ E and put This partial matrix is clearly a consistent estimator for Γ G as in (5.2), since it is an average of consistent estimators. For fixed sample size n, the estimator Γ is not guaranteed to be a valid (partial) variogram matrix; this is the price to pay for the more efficient estimation using only information within each clique, and by Theorem 5.3, the probability of Γ being invalid converges to zero for n → ∞. If it exists, the completion Γ G = C G ( Γ ) can be computed using Proposition 4.4 or Corollary 4.9, and by Theorem 5.3 it is a consistent estimator of Γ with correct graph structure G.
A natural question is whether it is possible to replace the arithmetic mean by another function that guarantees a valid partial variogram. The following example shows that this is not possible, as long as entries that belong to only a single clique are not altered, as well.
Then there are x 1 , x 2 ∈ R such that with x = x 1 the principal submatrixΓ {1,2,3} is conditionally negative definite, and with x = x 2 the principal submatrixΓ {2,3,4} is conditionally negative definite, but there exists no x such that both submatrices are conditionally negative definite at the same time.
Proof. Gower (1982) shows that Γ can be interpreted as a Euclidean distance matrix with Γ ij = d 2 ij , where d ij , for i, j ∈ {1, . . . , d}, is the pairwise distance between the points generating Γ (see also Section S.5.5.3). The triangle inequality then requires 9 ≤ x 1 ≤ 25 and 0 ≤ x 2 ≤ 4, which can be satisfied for each submatrix but not simultaneously for a single value of x.

S.2.3 Simulation study
We perform a simulation study to illustrate Theorem 5.3 and the estimation strategies from Section S.2.2. Throughout, the underlying graph G is assumed to be known.

S.2.3.1 Setup
We randomly generate two decomposable graphs of sizes d = 6 and d = 10, shown in Figure  following matrices (rounded to two decimals): From each of these Hüsler-Reiss Pareto distributions we generate n samples, and then estimate the parameter matrix Γ, using the following methods. First, we use the empirical variogram (Engelke and Volgushev, 2020) to directly estimate all entries of Γ ("Full variogram"). This method can be modified as described in Section S.2.2 by applying the empirical variogram to each clique of G and combining the results through the matrix completions from Section 4 ("Clique-wise variogram").
Second, we use maximum likelihood estimation with respect to the probability density function f (y) = λ(y)/Λ c (0), where λ(y) = λ(y; Θ) from Proposition 3.4. Here, we consider three different approaches: estimating all entries of Γ directly ("Full MLE"); estimating only the entries in Θ corresponding to edges of G, while setting all others to zero ("Graphical MLE"); and estimating the entries of Γ C for each clique C separately, with respect to the "marginals" f (I) (y) = λ I (y)/Λ c I (0), combining them as described in Section S.2.2 ("Clique-wise MLE").

S.2.3.2 Results
We compare the methods in terms of their computational cost, and the quality of their estimate Γ. The computational cost is expressed by the time taken to compute each estimate, using our implementation in R. The quality of the estimate is measured by the mean squared error This expression considers only the entries in the upper triangular part, since any valid variogram matrix is symmetric with zero diagonal. The results are illustrated in Figure S.2, Table 1 for n = 20 and d = 6, and Table 2 for n = 200 and d = 10; other combinations of n and d exhibit the same general trends. Throughout, the two methods based on the empirical variogram are very fast, requiring less than a second. The next fastest method is the clique-wise MLE, followed by the graphical MLE, with the full MLE being the slowest. Notably, for d = 6 the clique-wise MLE is faster than the full MLE by a factor of 10, whereas for d = 10 it is faster by a factor of 200.
In terms of MSE, the two variogram based methods perform similarly, with the clique-wise variogram slightly outperforming the full variogram. In line with the significantly higher computational cost, the MLE based methods perform better than the variogram based methods, and quite similar to each other.

S.2.3.3 Discussion
The simulation clearly illustrates advantages of sparse models in terms of computational performance. The estimation accuracy is highly comparable for all three MLE based methods, while the graphical methods are significantly faster. In fact, for larger dimensions d, full MLE quickly becomes infeasible, as the number of parameters that need to be optimized simultaneously is d(d − 1)/2. For the graphical MLE, the number of parameters is identical to the number of edges |E|, which can be significantly smaller, bounded below by d − 1 for connected graphs.
In the clique-wise MLE method, a separate optimization problem with |C| (|C| − 1)/2 parameters is solved for each clique C ∈ C. The resulting total number of parameters is which is generally larger than the number of edges |E|. However, since the optimization problem for each clique is solved separately, this can be faster to compute. Furthermore, on suitable hardware significant speedups can be achieved by solving the problems in parallel (to allow a better comparison, we did not perform any parallel computations here).
In this simulation study we considered the underlying graph G as known during the estimation of Γ. In practice, this graph might need to be estimated in a separate step, possibly leading to worse performance of the graphical methods. Due to the modeling choices and hyperparameter-tuning involved, we did not include structure estimation in the context of this simulation study but refer to Section 6 and Section S.4. Furthermore, we only considered decomposable graphs G, for which the matrix completion step can be performed very quickly using Proposition 4.4. A graph estimated from data is likely not decomposable, implying that the iterative method in Corollary 4.9 has to be invoked, which is slower and has to be truncated at some point.

S.4 Application to Danube data
We perform a statistical analysis of the Danube river flow data from Asadi et al. (2015), which has also been investigated for example in Engelke and Hitz (2020) We follow the same procedure as in Section 6, but do not perform the initial clustering step since there is significant extremal correlation between all pairs of measuring stations over a large range of thresholds; see Figure S.5. The data consist of clustered measurements from 1960 to 2010, representing flow volumes measured at 31 different stations situated on tributaries of the Danube river, yielding a total of 428 observations. For details on the preprocessing, we refer to Asadi et al. (2015).
The flow graph G flow , shown in the left panel of Figure S.7, is based on domain knowledge and constitutes a good first candidate for the statistical graph structure of the data. Furthermore, we consider extremal graph structures estimated in a data-driven way, using the minimum spanning tree method suggested in Engelke and Volgushev (2020) and the EGlearn method from Engelke et al. (2022c).
In order to evaluate the fitted models out-of-sample, the data set was split into a training set with 220 observations (1960 to 1985)   probability threshold p = 0.9 on the training data. This choice of threshold value follows Engelke and Hitz (2020) and Röttger et al. (2023b) and is slightly lower than in Section 6, since the data set at hand is smaller and already consists of clustered maxima. The EGlearn tuning parameter ρ * is chosen on the basis of the log-likelihood of the estimated models on the test set; see Section 6 for further details on the applied methods. Figure S.7 shows the estimated graph structures. The minimum spanning tree T is rather similar to the flow graph, whereas EGlearn estimates a significantly denser graph structure G ρ * with 69 edges. Figure S.8 shows the log-likelihoods and edge counts of the models estimated by EGlearn for different values of the tuning parameter ρ, as well as the log-likelihoods of the models on T , G ρ * , and the complete graph. Notably, the log-likelihood corresponding to T is quite close to that of EGlearn, outperforming both the flow graph and the complete graph. This is in stark contrast to the performance of the minimum spanning tree in Section 6, indicating that this data set can be better described by a sparse model. However, the denser graph estimated by EGlearn still performs best in terms of log-likelihood. This trend is further supported by the comparison of empirical and fitted extremal variogram entries, shown in Figure S. Figure S.7: The flow graph G flow (left), the estimated tree graph T (center), and the graph estimated using EGlearn, G ρ * , for ρ * = 0.06 (right). Before proving the Proposition we establish some auxiliary results. The actual proof of Proposition 3.2 is a rather short combination of these results and is given at the end of the Section. The auxiliary results in this Section are given in a sequential order, in the sense that each result uses only previously established results. In the following, for a given Γ ∈ D d we write Σ = Π − 1 2 Γ Π. We assume that the matrix S ∈ R d×d satisfies ΠSΠ = Σ, the matrix S not necessarily being symmetric. Lemma S.5.1 shows that this is in fact a slightly weaker assumption than γ(S) = Γ, which is used in the Proposition. Furthermore, for any t ∈ R we introduce the abbreviation The choices S = − 1 2 Γ or S = Σ simplify many of the proofs significantly. Lemma S.5.1. Let Γ ∈ D d and recall the map γ in (3.2). Then Since this map preserves (a)symmetry, and all Γ ∈ D d are symmetric, it follows that S = S ⊤ is a necessary condition for γ(S) = Γ. Furthermore, on the one hand, if γ(S) = Γ, then, since Π1 = 0, we get Σ = Π − 1 2 Γ Π = Π − 1 2 γ(S) Π = ΠSΠ. On the other hand, for symmetric S and writing v := Se d with e d = d −1 1, we have, Lemma S.5.2. The matrix Σ is symmetric and positive semi-definite, and its kernel is span({1}).
Proof. To prove the invertibility of Σ [t] for t ̸ = t 0 consider the following two cases.
Setting t 0 := 0, and using Im is the symmetric part of S. Further, since Σ = Π(− 1 2 Γ)Π is symmetric, the matrix S ∈ R d×d satisfies ΠSΠ = Σ if and only if ΠS ⊤ Π = Σ and thus if and only if ΠŠΠ = Σ. Hence, to show that Σ [t] is positive definite for t > t 0 , we can without loss of generality assume S is symmetric (or more precisely, replace S by its symmetric part).
So assume S is symmetric and let v 0 be the vector from the proof of Lemma S.5.3. First, we show that Σ [t 0 ] is positive semi-definite; to do so, it is sufficient to show that its non-zero eigenvalues are positive. To this end, recall that Σ [t 0 ] v 0 = 0 and consider an eigenvector v 1 ̸ ∈ span({v 0 }) of Σ [t 0 ] with eigenvalue α 1 ̸ = 0. The vector satisfies u ⊥ 1 and thus Πu = u. Moreover, since 1 ⊤ v 0 ̸ = 0, it follows that u ̸ = 0. Hence, since Γ is conditionally negative definite and since v 0 and v 1 are orthogonal (as eigenvectors of the symmetric matrix Σ [t 0 ] associated to distinct eigenvalues), we have implying α 1 > 0, as required. A similar argument shows that Σ [t 0 ] has rank d − 1: otherwise, we could find an eigenvector v 1 ̸ ∈ span({v 0 }) of Σ [t 0 ] orthogonal to v 0 with eigenvalue 0 as well, and this would lead to a contradiction by the same calculation as above. For t > t 0 , the matrix Σ [t] is invertible by Lemma S.5.3, and since it is the sum of the two positive semi-definite matrices Σ [t 0 ] and (t − t 0 )11 ⊤ , it is also positive semi-definite and hence positive definite.

54
for some constant c ̸ = 0 and C ∈ R (d−1)×(d−1) satisfying which is symmetric positive definite, since Σ is symmetric positive semi-definite with kernel span({1}). Hence, using The latter matrix is symmetric since W andW differ only in the first column, and since v j ⊥ 1 for all j = 1, . . . , d − 1, its kernel contains 1.
Proof. We will check the two criteria in Lemma S.
Since Σ = ΠSΠ, it follows that Im Π = Im Σ and thus that Π is the projection matrix onto the image of Σ. Further, the results from Lemma S.5.2 and Lemma S.5.5 yield Hence, the claimed equality follows from Lemma S.1.4 with A = Σ and B = lim t→∞ Θ [t] .

S.5.2 S as generalized inverse of Θ
The following result yields an interesting alternative characterization of the matrices S that can be used in Proposition 3.2, in terms of generalized inverses, of which the Moore-Penrose inverse is a specific instance. Proof. First, suppose ΠSΠ = Σ = Θ + . Since Π is the projection matrix onto the image of Σ and thus also onto the image of Θ = Σ + (see proof of Lemma S.5.6), Second, let (S.5.2) be satisfied. Since Θ + Θ = ΘΘ + is the projection matrix onto the image of Θ and thus equal to Π, we find ΠSΠ = Θ + ΘSΘΘ + = Θ + ΘΘ + = Θ + = Σ. 57 S.5.3 Proof of Proposition 3.3 Definition S.5.9. The covariance transform γ is defined for arbitrary square matrices as This definition does not require S to have any special properties but also does not guarantee many useful properties of γ(S). A more useful mapping can be obtained by restricting the domain as follows. Proof. For Σ satisfying the conditions above let Γ = γ(Σ). Then clearly Γ = Γ ⊤ and diag(Γ) = 0. Moreover, for 0 Proof of Proposition 3.3. Lemmas S.5.2 and S.5.10 show that σ and γ do indeed map between D d and P 1 d . Since both maps consist only of elementary matrix operations and the continuous map diag(·), they are also continuous. Using the intermediate results from the proof of Lemma S.5.1, it can be seen that γ(σ(Γ)) = γ Π − 1 2 Γ Π = Γ, ∀Γ ∈ D d , σ(γ(Σ)) = Π − 1 2 γ(Σ) Π = Σ, ∀Σ ∈ P 1 d , implying that σ is bijective between D d and P d with continuous inverse σ −1 = γ.
By Lemma S.1.3, Item (4), the Moore-Penrose inverse of a symmetric positive semi-definite matrix is again symmetric positive semi-definite. Hence, θ also maps D d into P 1 d and can be written as the composition φ • σ, with φ : As shown in Rakočević (1997), the map φ is continuous, and hence is a continuous, idempotent bijection; note that it is surjective since every Σ ∈ P 1 d can be written as Σ = φ(φ(Σ)). Therefore, θ is also continuous with continuous inverse θ −1 = γ • φ. X A ⊥ X C |X B , and hence Cov(X A , X C | X B ) = 0. Using e.g. Rue and Held (2005, Section 2.1.7) to compute this conditional covariance yields from which the Lemma immediately follows.
Lemma S.5.14. Let G = (V, E), for V = {1, . . . , d}, be an undirected graph, with some node k ∈ V being connected to all other nodes. LetΓ be a matrix specified on G that is conditionally negative definite on all fully specified principal submatrices. Let Σ (k) = φ k (Γ) for φ k in Definition 2.1, for ease of notation indexed by V \ {k}. ThenΣ (k) is positive definite on all fully specified principal submatrices and preserves specified entries in the sense that Proof. To show that specified entries are preserved, recall that Since k is connected to all other nodes in G,Γ ik is specified for all i ∈ V \ {k}, yielding the claimed equivalence.
To show positive definiteness of fully specified principal submatrices, let M be the index set of such a submatrix and observe which is positive definite.
Proof of Lemma 4.3. First, note that φ k is in fact bijective with inverse whereΣ (k) ∈ R d×d is defined as Σ (k) with a zero-valued kth row and column added, and γ is from Definition S.5.9. Upon a permutation of the indices, the (d − 1) × (d − 1) matrixΣ (k) takes the form (S.5.5) with blocks of indices A = C 1 \ {k}, B = D 2 \ {k} (possibly empty), and C = C 2 \ {k}. Lemma S.5.13 permits to find the unique positive definite completion ofΣ (k) , say Σ (k) . The matrix is thus well-defined, and sinceΣ (k) is symmetric positive semi-definite, with kerΣ (k) = span(e k ), it follows from Lemma S.5.10 that Γ is conditionally negative definite. Since φ k and φ −1 k preserve specified entries, the condition is satisfied by construction. Furthermore, from Definition 3.1 and Lemma S.5.13 it follows that The uniqueness of this completion follows from Corollary S.5.21 below.
Lemma S.5.15. Let Y be a random variable following a multivariate generalized Pareto distribution with positive, continuous exponent measure density λ. Let G = (V, E) be a connected, decomposable graph G = (V, E), consisting of two cliques C 1 , Suppose Y satisfies the (extremal) pairwise Markov property relative to G, and the marginal Y C 1 conditionally on max i∈C 1 Y i > 0 satisfies the (extremal) pairwise Markov property relative to G ′ . Then Y satisfies the (extremal) pairwise Markov property relative to the graph Remark S.5.16. Since the pairwise Markov property only requires but not vice versa, a distribution can satisfy this for a number of distinct graphs. In particular, adding edges to a graph strictly weakens the condition above.
Proof. Since all graphs involved in the lemma are connected and decomposable, Theorem 1 from Engelke and Hitz (2020) can be used to show , with C ′′ , D ′′ being the cliques and separators of G ′′ , and C ′ , D ′ those of G ′ . Here, the function λ I , for non-empty I ⊂ V , is the exponent measure density corresponding to the I-th marginal Y I conditionally on max i∈I Y i > 0. By the same theorem, the above decomposition of λ(y) implies that Y satisfies the pairwise Markov property relative to G ′′ .
Proof of Proposition 4.4. Setting Γ = Γ N , the condition Γ ij =Γ ij for all (i, j) ∈ E is satisfied by construction, and Lemma 4.3 guarantees that each Γ n is conditionally negative definite. The condition Θ ij = 0 for all (i, j) / ∈ E is satisfied too, since Lemma S.5.15 can be applied in each step. The uniqueness of this completion follows from Corollary S.5.21.

S.5.5.2 General graphs
The proofs shown here closely follow the proofs in Speed and Kiiveri (1986) and are adjusted where necessary to hold for conditionally negative definite variogram matrices instead of positive definite covariance matrices. Recall the map σ( · ) in (3.6).
Corollary S.5.21. Let G = (V, E) be a connected graph,Γ a partially conditionally negative definite matrix on G, and Γ a graphical completion ofΓ, in the sense of Definition 4.1. Then Lemma S.5.20, Item (2), with Γ 1 = Γ 3 = Γ and S = E shows the uniqueness of the completion from Lemma 4.3 and Propositions 4.4 and 4.6.
Recall S ⊆ V × V as introduced prior to Lemma S.5.20.
∥p 1 − p d ∥ = 2d. However, by the triangle inequality this leads to the contradiction Bakonyi and Johnson (1995) show that such a counter-example can be constructed for any non-decomposable graph.
S.5.6 Proofs of statistical inference related results S.5.6.1 Matrix completion as likelihood optimization Proof of Proposition 5.1. Let U : R d×d → R d(d−1)/2 denote the mapping that maps a matrix to the vector containing the entries in its upper triangular part (excluding the diagonal). Note that the restriction U P d is invertible, since the lower triangular part of Θ ∈ P d is defined by symmetry and the diagonal is such that the row sums are zero. In the computations below, we consider P ∈ U (P d ) and write Θ :≡ Θ(P ) for notational convenience.
Proposition 4.6 shows that the unique solution satisfying this condition and the constraint is θ C G (Γ) .

S.5.6.2 Matrix completion is continuous
Proof of Lemma 5.2. Enumerate the edges as E = {e 1 , . . . , e m } with m = |E| the cardinality of E. Let θ denote the mapping sending a variogram matrix Γ to its precision matrix Θ = θ(Γ) as constructed in Proposition 3.2. Let π G be the restriction map where v has entries v k for k = 1, . . . , m, with v k = M ij for e k = (i, j), i < j.
In words, π G extracts from a matrix M the m elements M ij corresponding to the edges (i, j) ∈ E and stacks them in a vector. Let Q G be the set of symmetric d × d matrices with zero row sums and with zeros in off-diagonal positions corresponding to non-edges of G: Clearly, M ∈ Q G is determined by its elements M ij for (i, j) ∈ E such that i < j, i.e., by π G (M ). Indeed, the elements of M below the diagonal are determined by the symmetry constraint M = M ⊤ and its diagonal elements are determined by the zero row-sum constraint M 1 = 0. Since all restrictions imposed on M are linear, Q G is a linear subspace of R d×d of dimension m. The space Q G is thus isomorphic with R m : identify M ∈ Q G with the elements M ij for index pairs (i, j) ∈ E such that i < j. Furthermore, the map π G above, when restricted to Q G , is a linear isomorphism between Q G and R m ; in particular, it is continuous.
The set of precision matrices Θ ∈ P 1 d with Θ ij = 0 for (i, j) ̸ ∈ E, i.e., those that figure in the matrix completion problem in Definition 4.1, is equal to P G := {M ∈ Q G : ker M = span({1}), M positive semi-definite}.
Let λ 1 (S) ≥ . . . ≥ λ d (S) be the d real eigenvalues of a symmetric matrix S ∈ R d×d , counted with multiplicities and ordered decreasingly. We have The functions λ j are well-known to be Lipschitz on S ∈ R d×d : S = S ⊤ . It follows that P G is open in Q G . Upon the above identification of Q G with R m via π G , we can thus view P G as an open subset of R m . Formally, this subset is denoted as P G := π G (P G ) ⊂ R m .
Let f 1 denote the inverse of the restriction of π G to P G , that is, Since π G was continuously invertible on the set Q G that contains P G , the function f 1 is a continuous bijection too. The set of variogram matrices Γ corresponding to precision matrices M in P G is By definition, these are exactly the variogram matrices Γ ∈ D d of which the associated precision matrices M = θ(Γ) have zero elements M ij = 0 for (i, j) ̸ ∈ E. In other words, D G corresponds to all possible solutions of the matrix completion problem in Definition 4.1 with respect to G. By Proposition 4.6, a variogram matrix Γ in D G is uniquely determined by the values of Γ ij for (i, j) ∈ E. Since Γ has zero diagonal, this means that Γ ∈ D G is uniquely determined by π G (Γ) = (Γ ij : (i, j) ∈ E, i < j). Another way to say the same thing is that the map π G restricted to D G is injective. For clarity, let D G := π G (D G ) ⊂ R m denote the image of D G under π G and let f 2 denote the restriction of π G to D g : The inverse mapping of f 2 is the completion map C G in Definition 4.7. 2 Knowing that f 2 is continuous (since π G is continuous), we need to show that C G is continuous as well.
To do this, consider the mapping The map f is represented schematically in Figure it follows that C G is continuous too, as required.
Proof of Theorem 5.3. First, we show that with probability tending to one Γ allows a completion. To this end, recall the definition of the set of conditionally negative definite matrices from (2.3), (S.5.8) and let K = v ∈ R d : v ⊥ 1, ∥v∥ ∞ = 1 . The set K is compact and, hence, the value with v∥v∥ −1 ∞ ∈ K, the inequality condition in (S.5.8) is equivalent to ∆ M < 0. Next, let G, Γ, and Γ be as in the Theorem. Let ε = −∆ Γ /(2d 2 ) > 0, and consider the ball B ε = M ∈ R d×d : ∥M − Γ∥ ∞ ≤ ε , where ∥M ∥ ∞ denotes the infinity norm 72 applied to M interpreted as a d 2 -dimensional vector. Then any M ∈ B ε satisfies For a set of matrices S ⊆ R d×d denote S G = M G : M ∈ S , with · G as in (4.3), and observe that By assumption, any realization of Γ is always in R d G and hence Together with the continuity from Lemma 5.2 this completes the proof.