Bayesian inference for multivariate extreme value distributions

Statistical modeling of multivariate and spatial extreme events has attracted broad attention in various areas of science. Max-stable distributions and processes are the natural class of models for this purpose, and many parametric families have been developed and successfully applied. Due to complicated likelihoods, the efficient statistical inference is still an active area of research, and usually composite likelihood methods based on bivariate densities only are used. Thibaud et al. (2016, Ann. Appl. Stat., to appear) use a Bayesian approach to fit a Brown--Resnick process to extreme temperatures. In this paper, we extend this idea to a methodology that is applicable to general max-stable distributions and that uses full likelihoods. We further provide simple conditions for the asymptotic normality of the median of the posterior distribution and verify them for the commonly used models in multivariate and spatial extreme value statistics. A simulation study shows that this point estimator is considerably more efficient than the composite likelihood estimator in a frequentist framework. From a Bayesian perspective, our approach opens the way for new techniques such as Bayesian model comparison in multivariate and spatial extremes.


Introduction
Extremes and the impacts of rare events have been brought into public focus in the context of climate change or financial crises.The temporal or spatial concurrence of several such events has often shown to be most catastrophic.Arising naturally as limits of rescaled componentwise maxima of random vectors, max-stable distributions are frequently used to describe this joint behavior of extremes.The generalization to continuous domains gives rise to max-stable processes that have become popular models in spatial extreme value statistics [e.g., Davison and Gholamrezaee, 2012], and are applied in various fields such as meteorology [Buishand et al., 2008, Engelke et al., 2015, Einmahl et al., 2016] and hydrology [Asadi et al., 2015].
For a k-dimensional max-stable random vector Z = (Z 1 , . . ., Z k ) with unit Fréchet margins, there exists an exponent function V describing the dependence between the components of Z such that P[Z ≤ z] = exp{−V (z)}, z ∈ (0, ∞) k .Many parametric models {F θ , θ ∈ Θ} for the distribution function of Z have been proposed [cf.Schlather, 2002, Boldi and Davison, 2007, Kabluchko et al., 2009, Opitz, 2013], but likelihood-based inference remains challenging.The main reason is the lack of simple forms of the likelihood L(z; θ) in these models, which, by Faá di Bruno's formula, is given by where P k is the set of all partitions τ = {τ 1 , . . ., τ |τ | } of {1, . . ., k} and ∂ τj V (•; θ) denotes the partial derivative of the exponent function V = V θ of F θ with respect to the variables z i , i ∈ τ j .The fact that the cardinality of P k is the kth Bell number that grows super-exponentially in the dimension k inhibits the use of the maximum likelihood methods based on L(z; θ) in (1).
The most common way to avoid this problem is to maximize the composite pairwise likelihood that relies only on the information in bivariate sub-vectors of Z [Padoan et al., 2010].Apart from the fact that this likelihood is misspecified, there might also be considerable losses in efficiency by using the composition of bivariate likelihoods instead of the full likelihood L(z; θ).To avoid this efficiency loss, higher order composite likelihood has been considered [Genton et al., 2011, Huser and Davison, 2013, Castruccio et al., 2016].
In practice, inference is either based on threshold exceedances [Engelke et al., 2014, Wadsworth and Tawn, 2014, Thibaud and Opitz, 2015] or the block maxima method.For the latter, the data, typically a multivariate time series, is split into disjoint blocks and a max-stable distribution is fitted to the componentwise maxima within each block.To increase the efficiency, not only the block maxima but additional information from the time series can be exploited.The componentwise occurrence times of the maxima within each block lead to a partition τ of {1, . . ., k} with indices belonging to the same subset if and only if the maxima in this component occurred at the same time.The knowledge of this partition makes inference much easier, as a single summand L(z, τ ; θ) in the full likelihood L(z; θ) given in (1) corresponds to the likelihood contribution of the specific partition τ .This joint likelihood L(z, τ ; θ) was introduced in Stephenson and Tawn [2005b] and is tractable for many extreme value models and, consequently, can be used for inference if occurrence times are available.
In real data applications, however, the distribution of the block maxima is only approximated by a max-stable distribution and the distribution of the ob-served partitions of occurrence times are only approximations to the limiting distribution (as the block size tends to infinity) given by the likelihood L(z, τ ; θ).This approximation introduces a significant bias in the Stephenson-Tawn estimator and a bias correction has been proposed in Wadsworth [2015].
In many cases, only observations z (1) , . . ., z (N ) ∈ R k of the random maxstable vector Z are available, but there is no information about the corresponding partitions τ (1) , . . ., τ (N ) .In this case, the Stephenson-Tawn likelihood cannot be used since the partition information is missing.In the context of conditional simulation of max-stable processes, Dombry et al. [2013] proposed a Gibbs sampler to obtain conditional samples of τ (l) given the observation z (l) , l = 1, . . ., N .Thibaud et al. [2016] use this approach to treat the missing partitions as latent variables in a Bayesian framework to estimate the parameters of a Brown-Resnick model [cf., Kabluchko et al., 2009] for extreme temperature.They obtain samples from the posterior distribution L(z (l) , τ (l) ; θ), via a Markov chain Monte Carlo algorithm, where π θ is the prior distribution on Θ.
In this paper we extend the Bayesian approach to general max-stable distributions and provide various examples of parametric models F θ where it can be applied.The first focus is to study the statistical efficiency of the point estimators obtained as the median of the posterior distribution (2).This frequentist perspective allows to compare the efficiency of the Bayesian estimator that uses the full likelihoods to other frequentist estimators.A simulation study shows a substantial improvement of the estimation error when using full likelihoods rather than the commonly used pairwise likelihood estimator of Padoan et al. [2010].
From the Bayesian perspective, this approach opens up many new possibilities for Bayesian techniques in multivariate extreme value statistics.Besides readily available credible intervals, we discuss how Bayesian model comparison can be implemented.Thanks to the full, well-specified likelihoods in our approach, no adjustment of the posterior distribution as in the composite pairwise likelihood methods [Ribatet et al., 2012] is required.
In Section 2 we provide some background on max-stable distributions and their likelihoods, and we present the general methodology for the Bayesian fulllikelihood approach.Section 3 develops an asymptotic theory for the resulting estimator.We show in Section 4 that our method and the asymptotic theory are applicable for the popular models in multivariate and spatial extremes, including the Brown-Resnick and extremal-t processes.The simulation studies in Section 5 quantify the finite-sample efficiency gains of the Bayesian approach when used as a frequentist point estimator of the extremal dependence parameters.Interestingly, this advantage persists when the dependence is a nuisance parameter and one is only interested in estimating marginal parameters (Section 5.3).The posterior distribution and genuinely Bayesian techniques are studied imsart-ejs ver.2014/10/16 file: Bayesian_Methodology_arxiv_V2.tex date: August 9, 2017 in Section 6, with a focus on Bayesian model comparison.Section 7 concludes the paper with a discussion on computational aspects.

Methodology
In Section 2.1 we review some facts on max-stable distributions and their likelihoods.We describe the general setup of our approach and review the Markov chain Monte Carlo algorithm from Thibaud et al. [2016] and the Gibbs sampler from Dombry et al. [2013] in Section 2.2.

Max-stable distributions, partitions and joint likelihoods
Let us assume from now on that the max-stable vector Z belongs to a parametric family {F θ , θ ∈ Θ}, where Θ ⊂ R p is the parameter space, and that it admits a density f θ .The exponent function of If there is no confusion we might omit the dependence on θ for simplicity.
Recall that if Z has standard Fréchet margins, it can be represented as the componentwise maximum where {ψ (j) : For more details and an exact simulation method of Z via this representation, we refer to Dombry et al. [2016].
Analogously to the occurrence times in case of block maxima, the Poisson point process induces a random limit partition T of the index set {1, . . ., k}, where two indices i 1 = i 2 belong to the same subset if and only if i2 for the same j ∈ N [Dombry and Éyi-Minko, 2013].The joint likelihood of the max-stable vector Z and the limit partition T under the model and it equals the likelihood introduced in Stephenson and Tawn [2005b].This fact provides another interpretation of Equation ( 1), namely that the likelihood of Z is the integrated joint likelihood of Z and T .
In Dombry et al. [2017] it has been shown that the existence of a density for the simple max-stable random vector Z with exponent measure Λ is equivalent to the existence of a density λ I for the restrictions of Λ to the different faces E I ⊂ E defined by that is, Thus, the Stephenson-Tawn likelihood L(z, τ ; θ) can be rewritten as where and τ 1 , . . ., τ denote the = |τ | different blocks of the partition τ , and z τj and z τ c j are the restrictions of z to τ j and τ c j = {1, . . ., k} \ τ j , respectively.Equation ( 5) provides a formula for the joint likelihood of max-stable distributions with unit Fréchet margins and its partition.From this we can deduce a formula for the joint likelihood of a general max-stable distribution that admits a density.More precisely, let Z be a k-dimensional max-stable random vector whose ith component, i = 1, . . ., k, has a generalized extreme value distribution with parameters Then, U i ( Zi ) has unit Fréchet distribution where U i denotes the marginal transformation For a vector z I = (z i ) i∈I with I ⊂ {1, . . ., k}, we define U (z , where V is the exponent measure of the normalized max-stable distribution U ( Z).Consequently, the joint density of the general max-stable vector Z and the limit partition T is

Bayesian inference and Markov chain Monte Carlo
Extreme value statistics is concerned with the estimation and uncertainty quantification of the parameter vector θ ∈ Θ.Here, θ might include both marginal and dependence parameters of the max-stable model.In a Bayesian setup we introduce a prior π θ (θ) on the parameter space Θ.Given independent data z (1) , . . ., z (N ) ∈ R k from the max-stable distribution Z ∼ F θ , we are interested in the posterior distribution of the parameter θ conditional on the data.As explained in Section 1, the complex structure of the full likelihood L({z (l) }; θ) = N l=1 L(z (l) ; θ) prevents a direct assessment of the posterior distribution, which is proportional to the product of L({z (l) }; θ) and the prior density π θ (θ).Instead, we introduce the corresponding limit partitions T (1) , . . ., T (N ) as latent variables and sample from the joint distribution of (θ, T (1) , . . ., T (N ) ) conditional on the data z (1) , . . ., z (N ) , which is given in Equation (2).
It is customary to use Monte Carlo Markov Chain methods to sample from a target distribution which is known up to a multiplicative constant only.The aim is to construct a Markov chain which possesses the target distribution as stationary distribution and has good mixing properties.To this end, in each step of the Markov chain, the parameter vector θ and the partitions T (1) , . . ., T (N ) are updated separately by the Metropolis-Hastings algorithm and a Gibbs sampler, respectively [cf., Thibaud et al., 2016].
For fixed partitions T (l) = τ (l) , l = 1, . . ., N , and the current state θ for the parameter vector, we propose a new state θ * according to a probability density q(θ, •) which satisfies q(θ 1 , θ 2 ) > 0 if and only if q(θ 2 , θ 1 ) > 0 for θ 1 , θ 2 ∈ Θ.The proposal is accepted, that is, θ is updated to θ * , with probability where L(z, τ ; θ) is given by (7).In general, there are various ways of choosing an appropriate proposal density q.For instance, it might be advisable to update the vector θ component by component.It has to be ensured that any state θ 2 with positive posterior density can be reached from any other state θ 1 with positive posterior density in a finite number of steps, that is, that the Markov chain is irreducible.The convergence of the Markov chain to its stationary distribution (2) is then guaranteed.Note that the framework described above enables estimation of marginal and dependence parameters simultaneously.In particular, it allows for response surface methodology such as (log-)linear models for the marginal parameters.For a fixed parameter vector θ ∈ Θ we use the Gibbs sampler in Dombry et al. [2013] to update the current states of the partitions τ (1) , . . ., τ (N ) ∈ P k conditional on the data z (1) , . . ., z (N ) .Thanks to independence, for each l = 1, . . ., N , we can update τ = τ (l) conditional on z = z (l) separately, where the conditional distribution is imsart-ejs ver.2014/10/16 file: Bayesian_Methodology_arxiv_V2.tex date: August 9, 2017 with C z the normalization constant ω{τ j , U (z)}.
For i ∈ {1, . . ., k}, let τ −i be the restriction of τ to the set {1, . . ., k} \ {i}.As usual with Gibbs samplers, our goal is to simulate from where τ is the current state of the Markov chain and P θ denotes the probability under the assumption that Z follows the law F θ .It is easy to see that the number of possible updates according to (10) is always less than k, so that a combinatorial explosion is avoided.Indeed, the index i can be reallocated to any of the components of τ −i or to a new component with a single point: the number of possible updates τ * ∈ P k such that τ * −i = τ −i equals is {i} if a partitioning set of τ , and + 1 otherwise.
(11) for all τ * ∈ P k with τ * −i = τ −i .Since τ and τ * share many components, all the factors in the right-hand side of (11) cancel out except at most four of them.This makes the Gibbs sampler particularly convenient.
We suggest a random scan implementation of the Gibbs sampler, meaning that one iteration of the Gibbs sampler selects randomly an element i ∈ {1, . . ., k} and then updates the current state τ according to the proposal distribution (10).For the sake of simplicity, we use the uniform random scan, i.e., i is selected according to the uniform distribution on {1, . . ., k}.

Asymptotic results
In the previous section, we presented a procedure that allows to sample from the posterior distribution of the parameter θ of a parametric model {f θ , θ ∈ Θ} given a sample of N observations.In this section, we will discuss the asymptotic properties of the posterior distribution as the sample size N tends to ∞.
The asymptotic analysis of Bayes procedures usually relies on the Bernsteinvon Mises theorem which allows for an asymptotic normal approximation of the posterior distribution of √ N (θ − θ 0 ), given the observations z (1) , . . ., z (N ) from the parametric model f θ0 .The theorem then implies the asymptotic normality and efficiency of Bayesian point estimators such as the posterior mean or posterior median with the same asymptotic variance as the maximum likelihood estimator.
Theorem 1 (Bernstein-von Mises, Theorems 10.1 and 10.8 in van der Vaart [1998]).Let the parametric model {f θ , θ ∈ Θ} be differentiable in quadratic mean at θ 0 with non-singular Fisher information matrix I θ0 , and assume that the mapping θ → f θ (z) is differentiable at θ 0 for f θ0 -almost every z.Suppose that, for every ε > 0, there exists a sequence of uniformly consistent tests where P θ denotes the probability measure induced by N independent copies of Z ∼ f θ .Suppose that the prior distribution π prior (dθ) is absolutely continuous in a neighborhood of θ 0 with a continuous positive density at θ 0 .Then, under the distribution f θ0 , the posterior distribution satisfies where As a consequence, if the prior distribution π prior (dθ) has a finite mean, the posterior median θBayes n is asymptotically normal and efficient, that is, it satisfies √ N ( θBayes In order to apply this theorem to max-stable distributions, two main assumptions are required: the differentiability in quadratic mean of the statistical model and the existence of uniformly consistent tests satisfying (12).Differentiability in quadratic mean is a technical condition that imposes a certain regularity on the likelihood f θ0 .For the case of multivariate max-stable models this property has been considered in detail in Dombry et al. [2017], where equivalent conditions on the exponent function and the spectral density are given.
We now discuss the existence of uniformly consistent tests and propose a criterion based on pairwise extremal coefficients.This criterion turns out to be simple and general enough since it applies for most of the standard models in extreme value theory.Indeed, in many cases, pairwise extremal coefficients can be explicitly computed and allow for identifying the parameter θ.
For a max-stable vector Z with unit Fréchet margins, the pairwise extremal coefficient It is the scale exponent of the unit Fréchet variable Z i1 ∨ Z i2 and hence satisfies In the case that Z follows the distribution f θ , we write τ i1,i2 (θ) for the associated pairwise extremal coefficient.
Remark 1.The identifiability of the model parameters θ ∈ Θ through the pairwise extremal coefficients Remark 2. If θ = (θ 1 , . . ., θ p ) ∈ Θ, and for any 1 ≤ j ≤ p there exists θ) depends only on θ j and it is strictly monotone with respect to this component, then Equation ( 13) is satisfied.

Examples
In the previous sections we discussed the practical implementation of a Markov chain Monte Carlo algorithm to obtain samples from the posterior distribution L θ, {τ (l) } N l=1 | {z (l) } N l=1 and its asymptotic behavior as N → ∞.The only model-specific quantity needed to run the algorithm are the weights ω(τ j , z) in ( 6).In this section, we provide explicit formulas for these weights for several classes of popular max-stable models and prove that the models satisfy the assumptions of the Bernstein-von Mises theorem; see Theorem 1.It follows that the posterior median θBayes N is asymptotically normal and efficient for these models.
For the calculation of the weights ω(τ j , z), we first note that all of the examples in this section admit densities as a simple consequence of Prop.2.1 in Dombry et al. [2017].We further note that, for the models considered in Subsections 4.1-4.4,we have λ I ≡ 0 for all I {1, . . ., k}, i.e.
and, consequently, Equation (6) simplifies to For the posterior median θBayes n , in the sequel, we will always assume that the prior distribution is absolutely continuous with strictly positive density in a neighborhood of θ 0 , and that it has finite mean.Given the differentiability in quadratic mean of the model, it suffices to verify condition (13) in Prop. 1.This implies the existence of a uniformly consistent sequence of tests and, by Remark 1, the identifiability of the model.Theorem 1 then ensures asymptotic normality of the posterior median.
Analogously to the notation z I = (z i ) i∈I for a vector z ∈ R k and an index set ∅ = I ⊂ {1, . . ., k}, we write A I,J = (A ij ) i∈I,j∈J for a matrix A = (A ij ) 1≤i,j≤k and index sets ∅ = I, J ⊂ {1, . . ., k}.Proofs for the results presented in this section can be found in Appendix A. imsart-ejs ver.2014/10/16 file: Bayesian_Methodology_arxiv_V2.tex date: August 9, 2017

The logistic model
One of the simplest multivariate extreme value distributions is the logistic model where The logistic model is symmetric in its variables and interpolates between independence as θ ↑ 1 and complete dependence as θ ↓ 0.
Proposition 2. Let τ = (τ 1 , . . ., τ ) ∈ P k and z ∈ E. The weights ω(τ j , z) in (6) for the logistic model with exponent measure (15) are Remark 3. From 16, it can be seen that we can also write This suggests to use the simplified weights ω for the Gibbs sampler.
Proposition 3.For the logistic model is differentiable with θ 0 ∈ (0, 1), the posterior median θBayes N is asymptotically normal and efficient as N → ∞.

The Dirichlet model
The Dirichlet model [Coles and Tawn, 1991] is defined by its spectral density h on the simplex and it has no mass on lower-dimensional faces of S k−1 [Coles and Tawn, 1991].Equivalently, the exponent function of the Dirichlet model is given by where W is a random vector with density h(w).
Remark 4. We believe that the result for the posterior median holds true even for every θ 0 ∈ Θ.In the proof of 5, we need the partial derivatives of (α 1 , α 2 ) → τ (α 1 , α 2 ) to be negative, but this can only be concluded almost everywhere.
A popular model in spatial extremes is the extremal-t process [Opitz, 2013], a max-stable process {Z(x), x ∈ R d } whose finite-dimensional distributions (Z(x 1 ), . . ., Z(x k )) , x 1 , . . ., x k ∈ R d have an exponent function of the form (19) where the Gaussian vector is replaced by a standardized stationary Gaussian process {W (x), x ∈ R d } evaluated at x 1 , . . ., x k .The correlation matrix Σ then has the form where ρ : R d → [−1, 1] is the correlation function of the Gaussian process W .The special case ν = 1 corresponds to the extremal Gaussian process [Schlather, 2002], also called Schlather process.
Corollary 1.Let Z be a Schlather process on R d with correlation function ρ coming from the parametric family Suppose that Z is observed at pairwise distinct locations t 1 , . . ., t k ∈ R d such that not all pairs of locations have the same Euclidean distance.Then, the posterior median of θ = (s, α) is asymptotically normal.
Proposition 9.For the Hüsler-Reiss model with θ 0 = Λ being a strictly conditionally negative definite matrix.Furthermore, the posterior median θBayes N is asymptotically normal and efficient as N → ∞.
Hüsler-Reiss distributions are the finite dimensional distributions of the maxstable Brown-Resnick process, a popular class in spatial extreme value statistics.Here, the Gaussian vectors (W 1 , . . ., W k ) in ( 21) are the finite-dimensional distributions of a centered Gaussian process {W (x), x ∈ R d } which is parameterized via a conditionally negative definite variogram γ : ) and the resulting Brown-Resnick is stationary [Brown andResnick, 1977, Kabluchko et al., 2009].The most common parametric class of variograms belonging to Gaussian processes with stationary increments is the class of fractional variograms, which we consider in the following corollary.
Corollary 2. Consider a Brown-Resnick process on R d with variogram coming from the parametric family Suppose that the process is observed on a finite set of locations t 1 , . . ., t m ∈ R d such that the pairwise Euclidean distances are not all equal.Then the posterior median of θ = (s, α) is asymptotically normal.

Simulation Study
Let z (l) = (z k ), l = 1, . . ., N , be N realizations of a k-dimensional max-stable vector Z whose distribution belongs to some parametric family {F θ , θ ∈ Θ}.As described in Section 2, including the partition τ (l) associated to a realization z (l) in a Bayesian framework allows to obtain samples from the posterior distribution L(θ | z (1) , . . ., z (N ) ) of θ given the data.This procedure uses the full dependence information of the multivariate distribution Z.This is in contrast to frequentist maximum likelihood estimation for the max-stable vector Z, where even in moderate dimensions the likelihoods are to complicated for practical applications.Instead, at the price of likelihood misspecification, it is common practice to use only pairwise likelihoods which are assumed to be mutually independent.The maximum pairwise likelihood estimator [Padoan et al., 2010] is then where f θ;i,j denotes the joint density of the ith and jth component of Z under the model F θ .Using only bivariate information on the dependence results in efficiency losses.
In this section, we analyze the performance of our proposed Bayesian estimator and compare it to θPL and other existing methods.Since the latter are all frequentist approaches, for a Markov chain whose stationary distribution is the posterior, we obtain a point estimator θBayes of θ as the posterior median, i.e., θBayes = median L(θ|z (1) , . . ., z (N ) ) .
As the parametric model we choose the logistic distribution introduced in Subsection 4.1 with parameter space Θ = (0, 1) and uniform prior.This choice covers a range of situations from strong to very weak dependence.Other choices of parametric models will result in different efficiency gains but the general observations in the next sections should remain the same.
We note that other functionals of the posterior distribution can be used to obtain point estimators.Simulations based on the posterior mean, for instance, gave very similar results, and we therefore restrict to the posterior median in the sequel.Similarly, changing the prior distributions does not have a strong effect on the posterior distribution for the sample sizes we consider; see also Section 6.1.Theta RMSE q q q q q q q q q q q q q q q q q q 0.2 0.4 0.6 0.8 0 50 150 250 350 k = 10 Theta RMSE q q q q q q q q q q q q q q q q q q 0.2 0.4 0.6 0.8 0 50 150 250 350 k = 50 Theta RMSE q q q q q q q q q q q q q q q q q q Fig 1.

Max-stable data
We first take the marginal parameters to be fixed and known and quantify the efficiency gains of θBayes compared to θPL .We simulate N = 100 samples z (1) , . . ., z (N ) from the logistic distribution for different dimensions k ∈ {6, 10, 50} and different dependence parameters θ = 0.1 × i, i = 1, . . ., 9. For each combination of dimension k and parameter θ we then run a Markov chain with length 1500, where we discard the first 500 steps as the burn-in time.
The empirical median of the remaining 1000 elements gives θBayes .The chain is sufficiently long to reliably estimate the posterior median; see also the mixing properties in Section 6.1.The maximum pairwise likelihood estimator θPL is obtained according to (25).The whole procedure is repeated 1500 times to compute the corresponding root mean squared errors shown in Figure 1.
As expected, the use of full dependence information substantially decreases the root mean squared errors and thus increases the efficiency of the estimates.In extreme value statistics, where typically only small data sets are available, this allows to reduce uncertainty due to parameter estimation.The advantage of this additional information becomes stronger for both higher dimensions and weaker dependence, analogously to the observations in Huser et al. [2015].This behavior can to some extent be understood by the results in Shi [1995] on the Fisher information of the logistic distribution for different dimensions and dependence parameters.When θ ↓ 0, pairwise likelihood performs just as well as full likelihood, which is sensible since, up to a multiplicative constant, the pairwise likelihood equals the full likelihood when θ = 0.
It is interesting to note that the estimates θBayes appear to be unbiased in almost all cases, whereas the pairwise estimator has a finite sample bias.

Data in the max-domain of attraction
In applications, the max-stable distribution Z might not be observed exactly but only as an approximation by componentwise block maxima of data vectors X (1) , . . ., X (b) in its max-domain of attraction with standard Fréchet margins, where b ∈ N is the block size.Indeed, the random vector , approximates the distribution of Z, where the approximation improves for increasing b.In this situation we can associate to Z the partition of occurrence times of the maxima, say τ .Stephenson and Tawn [2005b] proposed to use this information on the partition to simplify the likelihood of the max-stable distribution.For N observations z(1) , . . ., z(N) of Z with partitions τ (1) , . . ., τ (N ) they defined the estimator This estimator suffers from two kinds of misspecification biases.Firstly, the z(l) are only approximately Z distributed and, secondly, the partitions τ (l) are only finite sample approximations to the true distribution of the limit partition T .
For the latter, Wadsworth [2015] proposed a bias reduction method for moderate dimensions and showed in a simulation study that it significantly decreases the bias of the Stephenson-Tawn estimator in the case where the X (k) and thus also Z follow exactly a max-stable logistic distribution.However, if the X (k)  are samples from the outer power Clayton copula [cf.Hofert and Mächler, 2011] and thus only in the max-domain of attraction of the logistic distribution, then even the bias reduced estimator suffers from significant bias [cf., Wadsworth, 2015, Table 3].
We repeat the simulation study from Section 5.1 with the only difference that, instead of sampling from Z, we simulate N = 100 samples z(1) , . . ., z(N) of Z, which is the rescaled maximum of b = 50 samples from the outer power Clayton copula for different parameters.Based on these data in the max-domain of attraction of the logistic distribution we estimate the dependence parameter θ using our Bayes estimator and compare it to the pairwise likelihood estimator.Both approaches ignore the additional information on the partitions τ (l) that we have in this setup.On the other hand, we can also compute the Stephenson-Tawn estimator and its bias reduced version by Wadsworth [2015], which explicitly include the partition information.
Table 1 shows the root mean squared errors of the four estimators.All of them have a bias that plays a significant role for the overall estimation error and that is due to the model misspecification for only approximately max-stable data.This bias is however much stronger for θST and θW , which use the again misspecified partitions.In this case, the Bayes estimator that treats the partitions as unknown and samples from them automatically seems to be more robust and does not need a bias correction.At the same time it has a small variance and thus in many cases the smallest root mean squared error.Especially in higher dimensions (≥ 20) where the bias reduction of Wadsworth [2015]  longer be used, the Bayes estimator still provides a robust and efficient method of inference.

Estimation of marginal extreme value parameters
In spatial settings, the marginal extreme value parameters are often estimated by using the independence likelihood [Chandler and Bate, 2007], where all locations are assumed independent.This avoids to specify a dependence structure but can result in efficiency losses, even if only the marginal parameters are of interest.
We perform a simulation study to assess how using the full likelihoods in a Bayesian framework improves estimation of the marginal parameters.We fix the dimension k = 10 and set the marginal parameters to µ = 1, σ = 1 and ξ ∈ {−0.2, 0.4, 1}, equal for all k margins.The dependence is logistic with unknown nuisance parameter θ 0 ∈ {0.1, 0.4, 0.7, 0.9}.
Based on N = 100 independent samples from this model, we compare three different estimation procedures.The first one is our Bayesian approach using the full joint likelihood of the marginal parameters and the dependence parameter.We use a uniform prior for θ, and independent normal priors for µ, log σ and ξ with large standard deviations.For the univariate case, more sophisticated choices for the prior distributions are possible, including dependencies between the three extreme value parameters [e.g., Stephenson andTawn, 2005a, Northrop andAttalides, 2016].
The second procedure is the maximum pairwise likelihood estimator that only uses bivariate dependence, and the third is the maximum independence likelihood estimator that completely ignores dependence between different components.Each simulation and estimation is repeated 1500 times.
Table 2 contains the root mean squared errors of the marginal parameters for the three approaches.Interestingly, for the location and scale parameter we see only little difference between the three methods, meaning that they can be efficiently estimated without taking into account dependencies.For the shape parameter, however, there are substantial improvements in the estimation error by including the unknown dependence structure in the model and estimating it simultaneously.Since estimation of the shape is both the most difficult and the most important of the three extreme value parameters, the Bayesian approach is promising also for marginal tail estimation.Finally, we observe that there is already an efficiency gain for the shape parameter when only the pairwise imsart-ejs ver.2014/10/16 file: Bayesian_Methodology_arxiv_V2.tex date: August 9, 2017 dependence is considered, but it is even more remarkable in the Bayesian setting with full likelihoods.Table 2 also shows that these observations hold across different ranges for the shape parameter ξ.

Applications in a Bayesian framework
In the previous sections we discussed the efficiency gains of the Bayesian full likelihood approach in the frequentist framework of point estimates.The Markov chain from Section 2.2 however produces not only a point estimate but an estimate of the entire posterior distribution.For instance, this can directly be used to produce credible intervals for the parameter of interest.As a further application of our approach in the Bayesian framework, we will present Bayesian model comparison in this section.

The posterior distribution and credible intervals
As an illustration of the methodology we simulate a sample of N = 15 data z (l) = (z k ), l = 1, . . ., N , from a k-dimensional max-stable vector Z whose distribution belongs to the parametric family of logistic distributions introduced in Subsection 4.1 with parameter space Θ = (0, 1).We run the Markov chain from Subsection 2.2.The left panel of Figure 2 shows the Markov chain for the parameter θ with simulated data from the logistic distribution in dimension k = 10 with θ 0 = 0.8.The prior distribution is uniform, that is, π θ = Unif(0, 1).The chain seems to have converged to its stationary distribution, namely the posterior distribution after a burn-in period of about 200 steps.The auto correlation of the Markov chain in Figure 3 suggests that there is serial dependence up to a lag of 30 steps.The parallel chain that updates the partitions is difficult to plot.The right panel of Figure 2 therefore shows in each step as a summary the mean number m of sets in the partitions τ (1) , . . ., τ (N ) , that is, m = 1/N N l=1 |τ (l) |.For complete independence (θ 0 = 1) we must have m = k = 10, whereas for complete dependence (θ 0 = 0) we have m = 1.
The left panel of Figure 4 shows a histogram and an approximated smooth version of the posterior distribution, together with the uniform prior.In order to assess the impact of the prior distribution on the posterior, the two other panels contain the corresponding plots for the same data set but for different priors, namely the beta distributions π θ = Beta(0.5,0.5) (center) and π θ = Beta(4, 4) (right).Even for a relatively small amount of N = 15 data points, the influence of the prior is not very strong.The Bayesian setup provides us with a whole distribution for the parameter instead of a point estimate only.From this we can readily deduce credible intervals for the parameter θ.This is an advantage compared to frequentist composite likelihood methods since the Fisher information matrix has a "sandwich" form adjusting for the misspecified likelihood, and confidence intervals are thus not easily computed [Padoan et al., 2010].When using composite likelihoods in a Bayesian setup, the posterior distributions are much too concentrated and the empirical coverage rates are very small.Adjustments are necessary to obtain appropriate inference [Ribatet et al., 2012].Since our approach uses the full, correct likelihood, no adjustment is needed to obtain accurate empirical coverage rates.Indeed, in Table 3 we provide the coverage rates of the 95% credible intervals obtained in the simulation study in Section 5.1 for some values of θ 0 .

Bayesian model comparison
Starting from data z from a family of max-stable distributions {F θ , θ ∈ Θ}, we consider two sub-models M 1 : θ ∈ Θ 1 and M 2 : θ ∈ Θ 2 for disjoint sets Θ 1 , Θ 2 ⊂ Θ.In Bayesian statistics, comparison of such models is often based on the Bayes factor B 1,2 , which translates the prior odds into the posterior odds [e.g., Kass and Raftery, 1995], that is, The Bayes factor can also be written as where are the so-called marginal probabilities of the data and π(• | M i ) is the prior density of the parameter θ under the model M i .Since the max-stable likelihood cannot be computed, the integral in ( 28) is computationally infeasible.However, we can use the estimation of the posterior probability (26) discussed in the previous subsection and estimate As an example, we consider a simple regression model for the marginal shape parameters ξ 1 , . . ., ξ k of the k-dimensional max-stable distribution in dimension k.One might be interested in testing if there is a linear trend in the shape parameters, and, thus, in comparing the models M 1 : {β = 0} and M 2 : {β = 0}.In order to compute the Bayes factor as the ratio of the posterior probabilities of the two models according to (29), the prior distribution π β of β must be a mixture 0.5 × δ {0} + 0.5 × π c β of a Dirac point mass δ {0} on 0 and an appropriate continuous distribution π c β on R.This ensures that we have a positive posterior probability on both sets {β = 0} and {β = 0} and the Bayes factor is well-defined.
Similarly as in Section 5.3, we simulate N = 15 data from a max-stable logistic distribution with dimension k = 10 and dependence parameter θ 0 = 0.5, with marginal parameters µ i = 1, σ i = 1 and ξ i as in (30) with α = 1 and different values for β, i = 1, . . ., k.The prior distributions for the dependence, location and scale parameters are chosen as in Section 5.3.The prior for α is standard normal and for the prior for β is a mixture of 0.5 × δ {0} + 0.5 × π c β of a point mass and a centered normal with standard deviation 0.5 as the continuous component π c β .A Markov chain whose stationary distribution is the posterior distribution of the parameters given the data can be constructed analogously to Section 2.2.However, given the current state β of the Markov chain, the proposal β * is not drawn from a continuous distribution with density q, but from a mixture of a Dirac point mass on {0} and a continuous distribution with density q c (β, •), with mixture weight p 0 (β) ∈ (0, 1).To ensure convergence of the Markov Chain, the densities q c (β, •) should be chosen such that q c (β, β * ) > 0 if and only if q c (β * , β) > 0. Bayes factor q q q q q q q q q q q q q q q q q q q q q q q q q q Fig 5.
Bayes factors for different value of β for full likelihood (blue) and independence likelihood (red).
described above.The true trend varies from β = 0, in which case M 1 would be correct, over positive values up to β = 0.08 where M 2 is the correct model.As comparison, we implemented a Bayesian approach based on the independence likelihood [Chandler and Bate, 2007], which is the product of the marginal densities and ignores the dependence structure.The results show that using the full likelihood that takes the dependence into account and treats it as a nuisance parameter significantly facilitates the distinction between the two different models.
The Bayes factors for the full likelihood show stronger support for M 1 if β = 0, and decrease more rapidly to 0 if β > 0 than the Bayes factors for independence likelihood.
Finally, we note that a similar approach has been proposed in the univariate setting for estimation of the shape parameter in Stephenson and Tawn [2005a] in order to allow the Gumbel case ξ = 0 with positive probability.

Discussion
We present an approach that allows for inference of max-stable distributions based on full likelihoods by perceiving the underlying random partition of the data as latent variables in a Bayesian framework.The formulas for ω(τ j , z) provided in Section 4 allow in principle to perform Bayesian inference based on full likelihoods for many popular max-stable distributions in any dimension.However, computational challenges arise for both the extremal-t and the Brown-Resnick model in higher dimensions since the corresponding ω(τ j , z) require the evaluation of a multivariate Student and Gaussian distribution functions, respectively, which have to be approximated numerically; see also Thibaud et al. [2016].
Making use of the weights ω(τ j , z), the posterior distribution of the parameters becomes numerically available by samples based on Markov chain Monte Carlo techniques.As the results in Section 6.1 indicate, the posterior distribution does not show strong influence of the prior distribution even in case of a rather small amount of data; cf., Figure 4.In most of the examples presented here, the proposal distributions for the model parameters in the Metropolis-Hastings algorithms were chosen to be centered around the current state of the Markov chain with an appropriate standard deviation, resulting in chains with satisfactory convergence and mixing properties; cf., Figures 2 and 3, for instance.Further improvements of these properties might be possible, e.g., by implementing an adaptive design of the Markov chain Monte Carlo algorithms.
In the frequentist framework, we propose to use the posterior median as a point estimator for the model parameters.As the simulation studies in Section 5 show, the use of full likelihoods considerably improves the estimation errors compared to the commonly used composite likelihood method even in the case of a rather small sample size.This complements our theoretical results on the asymptotic efficiency of the posterior median.Besides the point estimator in the frequentist setting, we can also make use of the posterior distribution in a Bayesian framework.In Section 6, we discuss the use of credible intervals and Bayesian model comparison for max-stable distributions.Further applications such as Bayesian prediction are possible.
Using this, Equation ( 5) becomes for the logistic model Both the proof of Prop. 4 and Prop. 5 rely on the following lemma.
Proof.For the proof of the first part, we note that the intensity of the spectral measure is given by is the density of the random vector Ỹ .A direct computation yields We see that the restriction of λ to the simplex S k−1 is equal to h which proves the claim.The first statement of the second part, is a direct consequence of the first part since The proof of the strict monotonicity relies on the notion of convex order, see Chapter 3.4 in Denuit et al. [2005].For two real-valued random variables X 1 , X 2 we say that X 1 is lower than X 2 in convex order if E[ϕ(X 1 )] ≤ E[ϕ(X 1 )] for all convex functions ϕ : R → R such that the expectations exist.It is known that the family of random variables (Y (α)/α) α>0 is non-increasing in convex order [Ramos et al., 2000, Section 4.3], and, in this case, the Lorenz order is equivalent to the convex order [Denuit et al., 2005, Property 3.4.41].We show below that this implies that τ (α 1 , α 2 ) is strictly decreasing in its arguments.

j
and T k (•; Σ) denotes a multivariate Student distribution function with k degrees of freedom and scale matrix Σ.Proposition 7. Consider the extremal-t model with θ 0 = (Σ, ν) where Σ is a positive definite correlation matrix and ν > 0.Then, for fixed ν > 0 the posterior median θBayes N is asymptotically normal and efficient as N → ∞.
Fig 1. Root mean squared errors (RMSE) of θBayes (blue) and θPL (red) for different dimensions k and different parameters θ.Values have been multiplied by 10000.

Fig 2 .Fig 3 .
Fig 2. Markov chains for θ (left) and the mean partition size (right) with uniform prior.

Appendix A :
Proofs postponed from Section 4 A.1.Proofs from Subsection 4.1 Proof of Prop. 2. Taking partial derivative of the exponent function (15) we obtain e −y2 dy 1 dy 2 , (32) for α 1 , α 2 > 0, and standard theorems for integrals depending on a parameter.Proof of Prop. 4. From the construction given in the first part of Lemma 1imsart-ejs ver.2014/10/16 file: Bayesian_Methodology_arxiv_V2.tex date: August 9, 2017 and, consequently, 1 e −t dt, is the distribution function of a Gamma variable with shape α > 0.

Table 1
Root mean squared errors of θBayes , θPL , θST and θW , estimated from 1500 estimates; figures have been multiplied by 10000.

Table 2
Root mean squared errors of (µ, σ, ξ) estimates with different values of ξ for the Bayesian approach, pairwise likelihoods and independence likelihoods, respectively, where θ is an unknown nuisance parameter; figures have been multiplied by 1000.