Dimension reduction in multivariate extreme value analysis

Non-parametric assessment of extreme dependence structures between an arbitrary number of variables, though quite well-established in dimension 2 and recently extended to moderate dimensions such as 5, still represents a statistical challenge in larger dimensions. Here, we propose a novel approach that combines clustering techniques with angular/spectral measure analysis to find groups of variables (not necessarily disjoint) exhibiting asymptotic dependence, thereby reducing the dimension of the initial problem. A heuristic criterion is proposed to choose the threshold over which it is acceptable to consider observations as extreme and the appropriate number of clusters. When empirically evaluated through numerical experiments, the approach we promote here is found to be very efficient under some regularity constraints, even in dimension 20. For illustration purpose, we also carry out a case study in dietary risk assessment. MSC 2010 subject classifications: 62H12, 62H30, 62G32.


Introduction
High dimension raises important issues in applied multivariate statistics; while sample sizes are finite, the set on which probability measures are defined can be so large that extrapolation is intricate. Referred to as the curse of dimensionality (Donoho, 2000;Massart, 1989), this phenomenon makes the variance of classical estimators explode, thereby impeding inference. In extreme value analysis, the quality of estimation is all the more degraded as it is not carried out on the entire sample, but on some relatively small number of largest observations that are considered representative of the tail of the distribution. Whereas a plethora of techniques has been developed in the field of statistical learning to overcome this issue (Friedman et al., 2009), multivariate extremes in dimensions larger that 2 are still handled with difficulty. It is the main purpose of the present paper to address this issue, by developing a non-parametric technique for identifying groups of variables (not necessarily disjoint) exhibiting asymptotic dependence. Beyond a possible overall description of the tail dependence structure, when these classes are of small dimension, this method would enable further and more efficient assessment of multivariate tails. It combines recent statistical learning 383 384 E. Chautru algorithms with multivariate extreme value theory (MEVT). From a practical perspective, it should be also pointed out that it includes a heuristic criterion to help select the sub-sample of extreme observations on which inference should be performed.
From a theoretical perspective, non-parametric assessment of multivariate extreme dependencies is already well documented. Under the mild assumption that there exists a tail dependence function, focus is usually on an angular measure, often referred to as the spectral measure, which characterizes extreme dependencies. In the bivariate setting, many estimators were proposed to assess this angular measure (Beirlant et al., 2004;Einmahl and Segers, 2009;Einmahl et al., 2001;Resnick, 2007). Bayesian models have also flourished (Boldi and Davison, 2007;Guillotte et al., 2011), in which vein Sabourin and Naveau (2014) recently proposed a novel algorithm that handles moderate dimensions. Unfortunately, their technique is only efficient when all variables considered are asymptotically dependent; higher-complexity angular measures may not be studied with their method. Hence, were we able to first identify groups of dependent variables in regard to their extreme behavior, the aforementioned estimators would enable more precise estimation up to dimension 5. Lately, Haug et al. (2009) have adapted a classical dimension reduction method, Principal Components Analysis (PCA), to multivariate extremes analysis. Under an elliptical copula assumption, they recover the set of straight lines summarizing best the extreme covariance function, thereby leading to a clustering of variables based on extreme dependence. Following in their footsteps, we propose to borrow algorithms from statistical learning to achieve dimension reduction, without making any parametric assumption in contrast. Our goal is to identify and interpret hopefully small groups of asymptotically dependent variables, possibly overlapping. For this, we exhibit a natural mixture model of the angular measure that corresponds to a partition of the space it lives in: the simplex. Useful properties arising from this setting are revealed and exploited. Inference aims at recovering the components of the mixture, which define all groups of variables exhibiting asymptotic dependence. Mimicking classical non-parametric angular measure estimation, it focuses on the cloud of observation angles related to the L 2 -norm. In a first step, they are projected on a space with drastically lower dimension by using a recent algorithm that adapts PCA to Riemannian manifolds (Jung et al., 2012). Then, identification of the groups of interest is subsequently achieved by implementing an appropriate clustering technique on the obtained sub-space (Dhillon et al., 2002). To illustrate the assets and liabilities of our method, we perform numerical experiments and conduct a real case study for dietary risk assessment.
The paper is organized as follows: we start off in Section 2 by recalling a few basic notions in angular measure analysis and introducing the main hypotheses, subsequently used throughout the methodological part of our work in Section 3. There, we introduce a mixture model for the angular probability measure and emphasize the ensuing fruitful properties it enjoys, when viewed as a latent variable model. Then we turn to the practical aspects of the approach we promote and depict our strategy for statistical inference under the assumed model, based on dimension reduction techniques, in Section 4. It is supported by numerical experiments carried through in Section 5, and subsequently applied for illustration purposes to dietary risk assessment in Section 6. In view of both simulation and case study results, assets, liabilities and natural extensions of our method are finally listed and discussed in Section 7.

Angular measures
Throughout this article, we consider X : ÔX 1 , . . . , X d Õ a d-dimensional random vector, d 2, with Lebesgue-dominated probability distribution P on the positive orthant C : Ö0, × d and cumulative distribution function (cdf) F , whose tail structure we wish to assess. For all j in Ø1, . . . , dÙ, we denote by P j the j-th 1-dimensional marginal distribution of P, i.e. the probability distribution of X j , with corresponding continuous cdf F j ÔxÕ : P j ÔÖ0, x×Õ, x 0.
Statistical inference on the extreme behavior of F will be based on the observation of a sample X 1 , . . . , X n of n 1 independent copies of X (we shall write X i ÔX i,1 , . . . , X i,d Õ for 1 i n). Also define the random vector Z : ÔZ 1 , . . . , Z d Õ of standardized components and the corresponding transformed sample Z 1 , . . . , Z n . All Z j , j È Ø1, . . . , dÙ, are standard Pareto distributed, i.e. for all x 1 we have P ÔZ j xÕ x ¡1 .
It is customary in MEVT to assume that the cdf F of X is in the maximum domain of attraction (MDA) of a multivariate extreme value distribution (EVD) G, i.e. for all j È Ø1, . . . , dÙ there exists sequences a n,j 0 and b n,j È R such that P £ max 1 i n X i,1 ¡ b n,1 a n,1 x 1 , . . . , In this equation, 0 denotes the null vector in R d and the notation " v " stands for the vague convergence of measures in C AE : for all continuous functions with compact support f : The distribution of a random vector Z that fulfills Eq. (3) is said to be regularly varying in the multivariate sense. Actually, when combined with the assumption that for all j È Ø1, . . . , dÙ the marginal F j is in the maximum domain of attraction of a univariate EVD G j , Eq. (3) implies Eq.
(2). Here we do not require the existence of univariate maximum domains of attraction. The exponent measure exhaustively describes the extreme dependence structure between the random variables X 1 , . . . , X d , see for instance Section 8.2.3 in Beirlant et al. (2004) or Section 6.5.6 in Resnick (2007). It is homogeneous, i.e. for all 0 s and Borel subsets B of C AE : µÔs BÕ s ¡1 µÔBÕ, (4) and fulfills d marginal constraints expressing the nature of the marginal survival functions, namely for all j 1, . . . , d and 0 z : µ ÔØx È C AE : x j zÙÕ z ¡1 , (5) see for instance Section 8.2.2 in Beirlant et al. (2004) and Section 6.1.4 in Resnick (2007). When switching to pseudo-polar coordinates, µ enjoys a particularly useful representation. Choose two norms . Ô1Õ and . Ô2Õ on R d with corresponding unit hyperspheres S Ô1Õ and S Ô2Õ and define the following mapping: T : Typical choices of norms include the L p -norm or the sup-norm L . Then, the homogeneity property stated in Eq. (4) implies µ ¥ T ¡1 : µ ¡1 S, where the radius measure µ ¡1 , defined on Ô0, ×, is such that for all x 0, µ ¡1 ÔÔx, ×Õ x ¡1 , and the angle measure S, referred to as the angular (or spectral ) measure, has support on Ω : S Ô2Õ C AE and satisfies SÔBÕ µ Øx È C AE : x Ô1Õ 1, xß x Ô2Õ È BÙ¨ (6) for all Borel subsets B of Ω. A simple normalization of S yields the so-termed angular probability measure Q on Ω, Set ω Zß Z Ô2Õ and ρ Z Ô1Õ , then Equations (3), (6) and (7) imply where " D " stands for the convergence in distribution. In words, Q is the limit distribution of the angles when the radius gets infinitely large. It thereby Dimension reduction in multivariate extreme value analysis 387 encapsulates the extreme (or asymptotic) dependence structure between the d variables in dimension d ¡ 1. Observe that Eq. (5) can be expressed in terms of moment constraints for S and Q respectively: for all j in Ø1, . . . , dÙ, 3. Mixture model of the angular probability measure The extreme dependence structure between the variables X 1 , . . . , X d can be expressed in terms of the geometry of the support of Q (or S), which we denote by suppÔQÕ. Indeed, recall that suppÔQÕ is included in Ω, the positive orthant of the unit hypersphere S Ô2Õ , or the simplex associated with . Ô2Õ . The latter can be partitioned into It is of dimension #h ¡ 1, where #h denotes the cardinal of h. See Figure 1 for an illustration in dimension 3. By convention, the set referring to the empty face is denoted by Ω À : À.
Given this decomposition, for any h È P AE d , QÔΩ h Õ 0 implies that all X j such that j È h exhibit asymptotic dependence (see Beirlant et al., 2004, Section 8.2.3). Observe that the converse is not necessarily true: QÔΩ h Õ 0 does not imply the asymptotic independence of ØX j : j È hÙ. For instance, take X 1 , X 2 , X 3 Ω t1,2,3u The 7 nonempty open faces in the L 2 -norm simplex Ω in R 3 : the 3 vertices (left), the 3 edges (right), and the interior (left).
three asymptotically dependent variables. Then Q has no mass on Ω Ø1,2Ù (only on Ω Ø1,2,3Ù ) even though X 1 and X 2 are asymptotically dependent. Therefore, recovering the set of faces that intersect with the support of Q suffices to identify the sets of variables that are dependent in the extremes, but only then is it possible to deduce which ones are not. This motivates the following mixture model : where for all h È P d , π h : QÔΩ h Õ and In addition, denote by H the set made of all open faces intersecting suppÔQÕ: . Then for all h È H, Q h is by definition a probability distribution on Ω h . Even so, it cannot be interpreted directly as the angular measure of ØZ j : j È hÙ in the sense that it does not fulfill the marginal condition stated in Eq. (10) (see Section A.1, p.412). Obviously, we have π h È Ö0, 1× for all h È P AE d , π À 0 and hÈP d π h hÈH π h 1. Following in the footsteps of standard mixture model analysis (see for instance McLachlan and Peel, 2000), we shall assume that there exists an intrinsic (unknown) clustering of the data into #H classes leading to an identification of the set of interest H.
We conjecture that in the present setting, Eq. (3) is a sufficient condition for Assumption 3.1 to be fulfilled.
Remark 3.1. It is always possible to construct a categorical vector λ È Ø0, 1Ù 2 d that meets the requirements ÔiÕ-ÔiiiÕ of Assumption 3.1 from a categorical vector Ö λ È Ø0, 1Ù 2 d that only fulfills conditions ÔiÕ and ÔiiÕ. Indeed, let ÔÖ p h Õ hÈP d be the parameters of the distribution of such a random vector Ö λ. Consider the sets A 0,0 : Øh È P d : π h 0, p h 0Ù, A AE,AE : Øh È P d : π h 0, p h 0Ù and A 0,AE : Øh È P d : π h 0, p h 0Ù and define λ È Ø0, 1Ù 2 d with components where I Ø.Ù is the indicator function. Then λ has Categorical distribution with parameters Ôp h Õ hÈP d defined for all h È P d as and it fulfills all three points of Assumption 3.1.
Some useful properties can be established in such a setting. We display here two results which are subsequently exploited for inference, as shall be seen in the next section. Proofs and technical details are deferred to the appendix, in Sections A.1 and A.2 respectively. The proposition below exhibits the asymptotic behavior of conditional marginals under the latent variable model. Proposition 3.1. We place ourselves in the framework of Section 3 and denote by HÔjÕ the set Øh È H : j È hÙ. Then, for all j È Ø1, . . . , dÙ, h È P d , x 1, where hÈP d p h c j,h 1 and c j,h È Ö0, 1ßp h × is non-null if and only if h È HÔjÕ.
Therefore, nonempty open faces intersecting suppÔQÕ are identifiable by remaining only in the univariate level. In particular, the following result reveals that there exists a function, the asymptotic behavior of which enables the characterization of ØHÔjÕ, 1 j dÙ, and by extension of H.
Proposition 3.2. We place ourselves in the framework of Proposition 3.1.
For all j È Ø1, . . . , dÙ, h È P d , x 1, define the functional κ j,h ÔtÕ : and assume that there exists some constants γ AE È Ô0, 1Õ, c AE 0 and t AE 1, such that for all j È Ø1, . . . , dÙ, h Ê HÔjÕ, As a consequence, for fixed dimension j È Ø1, . . . , dÙ, the set HÔjÕ consists of all sets of indexes h such that κ j,h ÔtÕ diverges as t tends to infinity, instead of converging towards a finite constant, possibly zero. Since for all h È H we can write h Øj : h È HÔjÕÙ, we have that H is the set of all h È P AE d such that there exists at least one index j È Ø1, . . . , dÙ for which κ j,h ÔtÕ as t .
Remark 3.2. The assumption in Proposition 3.2 simply requires that the extreme dependence structure is reached at a reasonably fast rate. It can be directly linked to the concept of hidden regular variation introduced in Das and Resnick (2011); Das et al. (2013); Heffernan and Resnick (2005); Resnick (2002Resnick ( , 2007Resnick ( , 2008. A function f : R R is regularly varying with index α if for all x 0, f Ôx tÕßf ÔtÕ x α as t . Roughly speaking, if the distribution of Z had hidden regular variation, there would be an angular measure on ΩÞ ä hÈH Ω h when making the radius increase with some regularly varying function bÔtÕ oÔtÕ with index 1ßα 1 instead of t. In that case, our assumption guarantees that 1ßα γ AE 1, which is a rational condition for hidden regular variation not to be mistaken for multivariate regular variation in practice.
The proposed approach to statistical inference is based on Proposition 3.2 as explained in the next section. Numerical experiments illustrating the relevance of the method we promote here are subsequently presented in Section 5.

Statistical inference
Relying on the probabilistic framework detailed in Sections 2 and 3, we now review the various steps of the proposed methodology to assess the dependence structure governing the extreme values of X 1 , . . . , X d . The ensuing algorithm, which combines techniques borrowed from multivariate extreme value theory with clustering procedures, is depicted step by step in the next paragraphs.
Just as in classical angular measure assessment, we consider that for some high enough threshold t, asymptotic relations such as in Equations (8), (9), (13) and (14) are sufficiently well approached to enable estimation. In keeping with the literature, we use t nßk, where k represents a number of upper radii. Asymptotic statistical results in EVT are typically obtained under the assumption that k k n is an intermediate sequence such that k n and k n ßn 0 as n (see for instance Beirlant et al., 2004, Resnick, 2007, or De Haan and Ferreira, 2006. For fixed n, this suggests to pick a number k large enough to get a reasonable variance but also small enough to avoid biasing the estimation with non-tail observations. Then, the analysis of extremes is carried on the set of most extreme observations ℵ k : Øi È Ø1, . . . , nÙ :ρ i nßkÙ , with cardinal #ℵ k : n k .

Estimation of the marginals
From the beginning, we have worked with the standardized vector Z defined in Eq. (1) instead of the vector of interest X. In practice Z cannot be computed directly since it depends on the unknown marginals F 1 , . . . , F d ; it has to be estimated. To avoid restrictive hypotheses, we privilege here a non-parametric procedure, usually referred to as the rank transform: for all units i È Ø1, . . . , nÙ and dimensions j È Ø1, . . . , dÙ, set Beirlant et al., 2004;Einmahl and Segers, 2009;Einmahl et al., 2001;Resnick, 2007). Angles and radii are subsequently denoted byω i andρ i respectively.
For geometrical reasons explained in the next subsection, we set . Ô2Õ as the L 2 -norm. In addition, we use the L -norm for . Ô1Õ . Observe that whereas it is unimportant regarding the angle, selecting a specific norm for the radius can have major implications. This is due to the selection process of tail observations, defined as those with radius larger than nßk; clearly, different norms are bound to produce different sub-samples (Einmahl and Segers, 2009). However, such issues go beyond the scope of our analysis and are not discussed further here.

Principal Nested Spheres
Because of the curse of dimensionality, clustering in high dimensions can be problematic. Thus, it is customary in statistical learning to start by projecting the data on a manifold of smaller dimension in order to reduce the noise and synthesize the information. Here, we propose to mimic a classical approach in statistical learning, namely Principal Components Analysis (PCA, Friedman et al., 2009). We work on the angles instead of the raw data and set . Ô2Õ as the L 2 -norm, with unit hypershpere S d¡1 in R d . This enables the use of algorithms that respect the intrinsic distance of S d¡1 , like the Principal Nested Spheres (PNS) technique developed by Jung et al. (2012). It consists of an iterative projection of the data on sub-spheres of smaller and smaller dimension, called PNS, which are then identified with the unit spheres S d¡2 , . . . , S 1 . Sub-spheres of the unit circle being points, the last PNS (of dimension 0) corresponds to the Fréchet mean defined in Algorithm 1 (which is not necessarily unique, see Jung et al., 2012, p.555). More formally, let ℓ È Ø1, . . . , d ¡ 1Ù. The geodesic distance between two vectors x and y of S ℓ (the unit sphere in R ℓ 1 ) is written d ℓ

Algorithm 1 Principal Nested Spheres
The data projected on the selected PNS S ℓ End procedure to the north pole and by R ¡ ÔvÕ the ℓ ¢ Ôℓ 1Õ matrix consisting of the first ℓ rows of RÔvÕ (see Jung et al., 2012, p. 567 for more details on RÔvÕ). A subsphere A ℓ¡1 È S ℓ can be identified with the unit sphere S ℓ¡1 using the function is denoted by The procedure PnsÔÕ in Algorithm 1 summarizes the main steps of the PNS algorithm of Jung et al. (2012), the code of which was made available by its authors at http://www.stat.pitt.edu/sungkyu/MiscPage.html. An illustration of the key steps is given in Figure 2. Practical issues like the possible multiplicity of Fréchet means go beyond the scope of our work and are not discussed here. In the end it is practical to restrict the rest of the analysis to one of the d¡2 PNS of positive dimension, chosen for instance by the simple rule-of-thumb procedure SelectPnsÔÕ in Algorithm 1. It relies on the calculation of the relative variance encapsulated in each PNS (Jung et al., 2012, Section 2.4): for all where ξ ÔℓÕ i is the scaled residual of the projection of observation i from S ℓ onto S ℓ¡1 defined in Algorithm 1 -scaling the residuals is required to compare geodesic distances in different spaces (Jung et al., 2012, Section 4). In other words, V ÔℓÕ is supposed to give an indication of the level of data variability explained by the ℓ-th PNS. Then, the SelectPnsÔÕ procedure simply consists in recursively checking that V ÔℓÕ is greater than a user defined threshold for ℓ 1, 2, . . . and select the first PNS after which the condition is no longer satisfied; we stop as soon as the gain in explained variance is considered too low to justify raising the dimension. As was pointed out by Jung et al. (2012), projecting the data onto spheres of very low dimension like S 1 and S 2 usually suffices to explain most of the the variability. Example in R 3 where the angles are concentrated around the vertices and the middle of the edges: using the L 2 -norm and applying PNS gives a good representation of the data in dimension 1 (left block); using the L 1 -norm and applying PCA gives a good representation of the data in dimension 2 but not in dimension 1 (right block).
Other algorithms could have been used to project the data on manifolds of smaller dimension, see for instance Fletcher et al. (2004); Huckemann and Ziezold (2006); Jung et al. (2011). Had the angles been calculated with the L 1 -norm, PCA would have been an acceptable choice too. Our preference for PNS is justified by its ability to synthesize the structure of the data in very small dimensions by taking into account the intrinsic property that the angles are of norm 1. Jung et al. (2012, p. 562-564) observed on real data analyses that when the observations are located on a Riemannian manifold, "Euclidean principal component analysis, which completely ignores the manifold nature of the data, gives the worst performance", whereas "the principal nested spheres capture more interesting variability in fewer components". In the present context, although there is no absolute guarantee that PNS is more suitable than PCA, it is nonetheless possible to find configurations where this postulate is verified, like shown in Figure 3.

Spherical k-means
The projected data obtained by running Algorithm 1 can be organized into groups that will later serve to assess the set H. For this, we propose to use an accurate clustering procedure such as spherical k-means (Dhillon et al., 2002;Maitra and Ramler, 2010), based again on the geodesic distance, which we describe hereinafter. Choose a number M of clusters and let m È Ø1, . . . , M Ù and ℓ È Ø1, . . . , d ¡ 1Ù. For any set of n 1 points x : Ôx 1 , . . . , x n Õ on the unit sphere S ℓ R ℓ 1 such that for all i È Ø1, . . . , nÙ, and SBÔxÕ : BÔxÕß BÔxÕ 2 its projection on S ℓ . Let I i,m be the binary variable that indicates whether observation i belongs to cluster m (I i,m 1) or not Algorithm 2 Spherical k-means The objective of the spherical k-means algorithm is to find the collection of cluster indicators that minimizes the intra-class geodesic variance This is achieved in the manner depicted in Algorithm 2, which corresponds to the function skmeans with option start "S" in the R package skmeans. The option "S" specifies the method used to choose the initial concept vectors (steps 3 and 4 of the initialization of SkmÔÕ in Algorithm 2). It produces an initial clustering with centers as scattered as possible. Obviously, many other initialization techniques may have been applied, e.g. picking the first concept vectors at random. Such considerations are disregarded here. The main advantage of the spherical k-means algorithm is that it is very simple to implement. However, it can often happen that it remains stuck at a local minimum of the intra-class geodesic variance function. To counteract this undesirable effect, many refinements have been proposed in the literature (see for instance Dhillon et al., 2002 and the references therein), which are overlooked here. Obviously, many other natural techniques in mixture models analysis could have been adopted (McLachlan and Peel, 2000); our preference for geometrical methods is based on a strong belief that Riemannian geometry is a key concept for understanding the structure of the angular (probability) measure, as suggested by the encouraging results of the numerical experiments conducted in Section 5.

Estimation of H
The clustering obtained after the successive application of Algorithms 1 and 2 can now serve to assess H in the following manner. Let k È Ø1, . . . , nÙ and M È Ø1, . . . , n k 2 d ¡ 1Ù, where n k 2 d ¡ 1 : minÔn k , 2 d ¡ 1Õ (notice that M depends on k). Assume that we have at our disposal a clustering into M groups of the set of most extreme observations ℵ k identified by the binary variables Assuming that such a mapping exists, we propose to assess it with the statistic Ô f M constructed as follows. Consider that I i,m is an estimator of λ i,fM ÔmÕ and for all m È Ø1, . . . , M Ù, j È Ø1, . . . , dÙ, set Ô κ j,m ÔkÕ :  (Cattell, 1966) described in Algorithm 3.
. Since in practice #H is usually unknown and k has to be picked at hand, we develop a heuristic criterion to measure the quality of a clustering, given some k È Ø1, . . . , nÙ and M È Ø1, . . . , n k 2 d ¡1Ù.
Observe that the condition I Øn m 1Ù makes it impossible for our procedure to work if #H exceeds 2 ¢ n k . This is not surprising: complex underlying models demand large datasets to be assessed. Finding the maximum in Eq. (15)  On account of the nice properties of the rank transform (Das and Resnick, 2011;Heffernan and Resnick, 2005;Resnick, 2007), we can reasonably hope that these statistical objects converge to the true quantities they approximate as n , provided at least that k AE fulfills the usual conditions k AE k AE n and k AE n ßn 0 as n . Unfortunately, due to the lack of probabilistic results on PNS and spherical k-means, which were originally introduced as geometrical techniques, we cannot provide here a thorough asymptotic analysis of the solution output by the statistical procedure described above. Nonetheless, as shall be seen in the next section, numerical experiments provide strong empirical evidence of the efficiency of the approach we propose.

Numerical experiments
We tested our method through a number of numerical experiments, for various values of n, d, and #H. In doing so, we tried to handle various types of extreme dependence structures, to illustrate the impact of the complexity of suppÔQÕ on our algorithm. In the next two subsections, we first describe the different scenarios analyzed, then present and comment on the simulation results.

Settings
We generated n i.i.d. copies of a d-dimensional random vector ÔX 1 , . . . , X d Õ with varying degrees of extreme dependence. Observations were drawn using the general asymmetric multivariate logistic model: for any x Ôx 1 , . . . , is positive, and set #H, d 1. The data was generated according to the following distribution: where r h È Ô0, 1× is a parameter that controls the strength of the asymptotic dependence within the group of variables indicated by h. In particular, r h 1 Dimension reduction in multivariate extreme value analysis 399

Algorithm 4 Dimension reduction of multivariate extreme values
12: gives asymptotic independence, whereas asymptotic perfect dependence occurs when r h 0. Observe that with such a model, though it was not required in our analysis, the marginal distributions of X 1 , . . . , X d are extreme values distributions, which naturally belong to their own maximum domain of attraction. Our preference for this model was purely practical: simulations were performed using function rmvevd in R package evd (Stephenson, 2003). We repeated 100 trials of Algorithm 4 under 3 scenarios listed in Table 1. Notice that in accordance with what was pointed out in the introduction and suggested in Sections 3 and 4, in scenarios 2 and 3 some classes overlap (Ø1, 2Ù Ø2, 3Ù Ø2Ù).
To limit computation time, we tested its performance on 5 different sample sizes, namely n 500, 1000, 5000, 10 000, and 10 thresholds t nßk, with k n ¢ 0.001, n ¢ 0.002, . . . , n ¢ 0.01. For the same reason, we disregarded situations where d 20. However, in MEVT, d 6 and d 20 can already be considered as high dimensions.

Results
Results are displayed in Table 2, Table 3 and Table 4. The highlighted row reports the number of trials where we managed to exactly recover the set of open faces intersecting with the support of the angular probability measure. As expected, in all scenarios, results improve when n increases, and success rates become particularly satisfactory as soon as n 5000, for they then exceed 85% in all 3 scenarios. The best performance is obtained in scenario 1, where d 20 and #H 2. Indeed, even with a very small sample (n 500), only 10 trials out of 100 fail to recover the true decomposition of suppÔQÕ, while in scenario 3, where d 6 and #H 4, this rate never goes below 12% for any n. This   suggests that rather than the dimension, the complexity of suppÔQÕ may be one of the principal determinants of the performance of our procedure. Actually, given two angular probability measures with equivalently complex supports, increasing dimensionality can produce better outcomes. This is the case with scenarios 2 and 3, where suppÔQÕ is contained on small subsets of 4 open faces, but d 6 in the former while d 20 in the latter. These results are not surprising and illustrate a typical phenomenon called the blessing of dimensionality (Donoho, 2000); as d increases, observations occur in relatively small subsets of the original space and are therefore easier to detect and separate. This property is the basis for common techniques in statistical learning, such as the widely celebrated support vector machine (Friedman et al., 2009, Chapter 12), which projects the data onto some space with higher dimension in which they are well divided. In our numerical experiments, switching from scenario 3 to scenario 2 significantly reduces the risk of overriding either ΩÔØ1, 2ÙÕ or ΩÔØ2, 3ÙÕ, which are very close to one another in the unit hypersphere and may be wrongfully confused during the PNS procedure. Observe nonetheless that these simulations were performed for very small values of parameter r h , i.e. all dependencies were strong. Since we used the multivariate logistic model, this means that for all h È H, subsets suppÔQÕ Ω h did not cover the entire open faces Ω h but were concentrated around small neighborhoods of one of their points. Had we considered less obvious extreme dependencies, these results would have probably been significantly degraded. This remark can be linked to the influence of the hidden angular/spectral measure on inference (Resnick, 2002), which controls the rate at which extreme structure is reached and thus dangerously impacts statistical analysis if the chosen threshold nßk is too small. In fine, these results are quite encouraging, and underline the usefulness of algorithms from the field of statistical learning for MEVT, provided of course that the underlying model is not too complex.

Application to dietary risk assessment
While eating is the privileged way of providing the necessary nutrients for the human organism, it also conveys toxic elements that, due to various environmental causes, contaminate the food. When consumed over certain tolerable doses, called dietary intake limits (DIL), these toxic elements can have a non-negligible impact on health. Similar phenomena also occur when diets are either too rich or too poor in nutrients. More importantly, further noxious effects may be caused by possible interactions between elements that are ingested simultaneously (Carpenter et al., 2002). For international institutes concerned about public health issues such as the WHO (World Health Organization), FAO (Food and Agriculture Organization), UNEP (United Nations Environment Program), EFSA (European Food Safety Authority) or for national agencies such as the Anses (the French Agency for Food, Environmental and Occupational Health & Safety), it is then of major interest to identify cocktails of food chemicals to which populations are indeed highly exposed. Extreme value theory has already proven useful to assess the probability of getting over a single dietary intake limit, in both univariate (Tressou et al., 2004) and bivariate settings (Paulo et al., 2006). Here, we propose to apply Algorithm 4 to examine the relationships between high simultaneous long-term exposure to 6 common nutrients and contaminants, namely iron (Fe), calcium (Ca), sodium (Na), methylmercury (MeHg), cadmium (Cd) and dioxins and dioxin-like polychlorinated biphenyls (PCB-DL). Their longterm toxicity is well-known, see for instance Anses (2011) andCarpenter et al. (2002). Methylmercury, cadmium, and PCB-DL are three contaminants found mainly in seafood products. While cadmium was recognized in 2004 as a type 2 carcinogen by the European Union, methylmercury and PCB-DL can attack the nervous system. Sodium, calcium and iron are three minerals principally found in animal products such as meat or dairy products. Long-term over-exposure to these nutrients is also harmful, e.g. consuming too much calcium can provoke urinary and renal calculi and excessive ingestion of sodium favors cardiac issues. As for iron, some studies have underlined a probable link between its excessive ingestion and Parkinson disease (Jenner et al., 1992). The current knowledge about possible synergistic effects between these chemicals, which may increase sanitary risks, is still quite poor, due to the complexity of these phenomena. Only methylmercury and PCB-DL have been studied jointly, and their simultaneous consumption was observed to amplify health issues in a number of experimental surveys (Bemis and Seegal, 1999;Carpenter et al., 2002). Henceforth, recovering groups of nutrients or contaminants to which the population is observed to be simultaneously over-exposed can help orient future biological and chemical research, which would in turn provide a better understanding of dietary risks. This is the purpose, for instance, of the PERICLES research program (PEsticide Residue In vitro Combined Level of Exposure Study), recently launched by the Anses to identify and quantify the risk due to the exposure to mixtures of pesticides Crépet and Tressou, 2011;Crépet et al., 2013;Crépet et al., 2013). In terms of statistical analysis, thus reducing the dimension would also enable a more accurate estimation of the complex relationships between these types of exposure. Indeed, even though they are clearly linked by the type of food (fish or meat) introduced in the diet, there are differences of composition between species -like tuna or salmon -that can imply independence between types of extreme long-term exposure. In particular, exceedance of the DIL of more than 3 of these elements are never observed in the data. Because of the variety of individual dietary habits and the complexity of the contamination process, simultaneous types of high exposure are not an obvious phenomenon, are rarely observed, and need to be analyzed in detail.

Description of the data
Our vectors of 6 types of exposure were calculated on the n : 2488 nonpregnant, non-lactating adults of the INCA2 database for which no important variable was missing. Excluding pregnant and lactating women is due to the specificity of their dietary needs, which significantly differ from those of the rest of the population. INCA2 is a nation-wide survey conducted by the Anses from 2005 to 2007 (Afssa, 2009). Carried out in collaboration with the French National Institute of Statistics and Economic Studies (INSEE), it took inventory of the amounts of 1342 foods eaten by 2624 adults during 7 consecutive days. Hence, we consider weekly exposure. Levels of nutrients within each of these 1342 food items were given in the CIQUAL database (Anses, 2008), and equivalents for contaminants were found in TDS2 (Anses, 2011). Both tables result from surveys conducted by the Anses and were designed to match the food nomenclature of INCA2. Then, by simply multiplying quantities of food with the corresponding average amounts of chemical elements contained, we obtained the vectors of exposure X 1 , . . . , X n that are to be examined.

Analysis of extreme dependencies
We applied the DimRedMevÔÕ procedure of Algorithm 4 on the sample of exposure X 1 , . . . , X n , with hyper-parameters v AE 0.1 and v AE  default in R package skmeans). The resulting dependence structure is represented in Figure 4. To get further confidence in this outcome, we summarize in Table 5 the strongest relationships that were found over all thresholds t nßk.
The evolution of Ö ΥÔk, M k Õ with k is displayed on Figure 5. Here M k refers to the optimal number of clusters for fixed k defined on step 19 of Algorithm 4. Our criterion reaches its maximum when k 564, i.e. when calculations are based on the 1591 observations with largest radii.
In fact, the dependence structure represented in Figure 4 is found on all 16 largest values of Ö ΥÔk, M k Õ. The corresponding number of largest values k can be divided into two groups, one where k is in a neighborhood of 360, and another where k is around 560, as illustrated by the highlighted regions in Figure 5. Moreover, Table 5 shows that some dependencies are spotted whatever the number of largest values. In particular, methylmercury is almost always associated to PCB-DL, while cadmium and calcium get separated from all other chemicals. Concerning iron and sodium, uncertainty remains quite high, and a complementary bivariate analysis seems necessary to confirm the nature of their relationship. Figure 6 shows the estimated bivariate angular probability measures of joint exposure first to MeHg and PCB-DL, then to Fe and Na. They were obtained using the maximum empirical likelihood (abbreviated MEL) approach of Einmahl and Segers (2009). Clearly, the strong asymptotic dependence between methylmercury and PCB-DL is confirmed, on whatever value of k the estimation may be carried out. The presence of a sub-population reaching extreme exposure to PCB-DL alone is also suggested by the form ofQ, which gets close to vertical height on the extreme left part of the plot, for many values of k. However, methylmercury does not exhibit such a behavior, and given that a specific class of independent exposure to MeHg only occurs for 7.60% of the largest values, we decide to disregard it. In terms of dietary habits, getting two clusters of individuals, one highly exposed to both MeHg and PCB-DL and another solely to PCB-DL, makes perfect sense. Contrary to PCB-DL, methylmercury is a contaminant found exclusively in seafood products. Hence, it is possible to get over-exposed to PCB-DL without ingesting high amounts of MeHg. As for iron and sodium, according to the evolution ofQ with k shown on Figure 6, if these two types of exposure exhibit asymptotic dependence, the latter is clearly weak. In fact, we are more inclined to believe in the presence of a mixture of three sub-populations, one ingesting high amounts of both Fe and Na, and the other two getting over-exposed to only one of these nutrients. It is also possible that k 564 being quite high, the relationship appearing in Figure 4 corresponds not to extreme but moderately high levels of exposures. This inconclusive example suggests that extending our approach to the analysis of the hidden angular measure (Resnick, 2002(Resnick, , 2008 would be of major interest.

Discussion
Non-parametric analysis of extreme dependencies via the angular measure in high dimension d is still an open issue in multivariate extreme value theory. Though the bivariate setting has already been thoroughly investigated (Beirlant et al., 2004;de Haan, 1985;Einmahl and Segers, 2009;Einmahl et al., 2001;Guillotte et al., 2011;Resnick, 2007), and moderate dimensions are now accessible when all variables are asymptotically dependent (Sabourin and Naveau, 2014), the matter is still unresolved for d 5. Following in the footsteps of Haug et al. (2009), who adapted Principal Components Analysis to extreme dependence assessment, we proposed a method combining multivariate extreme value theory with statistical learning and data mining standards so as to identify sub-groups of variables exhibiting asymptotic dependence. Once these clusters are identified, if they each encompass less than 5 variables, it becomes possible to further estimate the corresponding sub-parts of the angular measure with any existing method, for instance those cited herein-before.
We started in Section 3 by developing the theoretical context under which our approach was constructed. In a non-parametric setting, we focused our attention on the angular probability measure Q. After recalling that it can be viewed as the limit distribution of observation angles given that their radius is getting infinitely large, we underlined the adequacy between the geometry of its support on the positive orthant of the unit hypersphere and the nature of extreme dependencies. Therefore, we proposed a natural model of the angular probability measure as a mixture of angular distributions with supports on each of the 2 d ¡1 non-empty open faces of the simplex. Tackled from a latent variable point of view, this model provided particularly useful properties, formulated in Proposition 3.1 and Proposition 3.2. In particular, we showed that open faces intersecting the support of Q, namely ØΩ h , h È HÙ, could be identified by means of a simple functional κ j,h ÔtÕ, 1 j d, h È P d , introduced in Proposition 3.2.
Then, we moved to the practical part of our method in Section 4. Because we had pinpointed the major role of geometry for analyzing angular measures in the preceding section, we adopted geometrical techniques suited for Riemannian objects such as the unit hypersphere for statistical inference. Borrowed from the statistical learning field, they consist in first projecting the initial cloud of points on a lower-dimensional space by means of the Principal Nested Spheres algorithm of Jung et al. (2012) to reduce the noise, then clustering the obtained data with spherical k-means (Dhillon et al., 2002;Maitra and Ramler, 2010). Resulting clusters were then used to assess H, the set of open faces that intersect suppÔQÕ. Heuristics were constructed based on the empirical counterpart of the functional κ j,h ÔtÕ, in particular to select both the appropriate numbers of groups of dependent variables and of "extreme" observations. Unfortunately, due mainly to the absence of probabilistic analysis of PNS and spherical k-means in the literature, we were not able to provide asymptotic results about the aforementioned objects (this is the object of an ongoing work). Hence, assets and liabilities of our technique were discussed based solely on numerical experiments.
In Section 5, we tested our method on a set of simulated databases. Three scenarios were considered varying in dimension d, the number #H of open faces charged by the angular measure and the complexity of suppÔQÕ. In spite of a clearly improvable practical algorithm, the encouraging results we obtained enabled us to define which characteristics of Q have most influence on estimation. In particular, we saw that unlike #H, d is of negligible importance to the complexity of suppÔQÕ and the strength of extreme dependence. The closer ØΩ h , h È HÙ are to one another (e.g. both ΩÔØ1, 2ÙÕ and ΩÔØ2, 3ÙÕ intersect suppÔQÕ), the harder it seems to be to separate and correctly identify each of them. Though they were not considered in the simulations, we added some comments on rates of convergence to the asymptotic dependence structure that were sensed as a determining factor in assessment efficiency. Specifically, we insisted on the role that the hidden angular measure may play when selecting an optimal number of largest values and suggested the interest of generalizing our approach to its analysis. In fine, what emerged from these numerical experiments is that improvement in the estimation of the angular measure can be hoped for, provided that some regularity and sparsity hypotheses are fulfilled.
Further insight into our method was provided by a case-study illustration. Applied to real databases about exposures to 6 food contaminants, it produced stable outcomes, thereby giving confidence in the results. We were able to conclude that only two pairs of chemicals are actually linked in extremes, namely methylmercury and PCB-DL on the one hand, and iron and sodium on the other hand. These associations were confirmed by further computing the MEL estimator of Einmahl and Segers (2009) on these two pairs of variables. In addition, our method spotted a configuration usually hard to notice with traditional estimators, but quite natural given the underlying mixture model on which we based the analysis: it underlined the presence of a mixture of populations, some being jointly over-exposed to a couple of elements, while others ingest high quantities of only one of them (PCB-DL or Na). In terms of public health implications, this means that people who are over-exposed to methylmercury tend to ingest simultaneously high amounts of dioxins and PCB. Knowing that these two toxicants have similar noxious effects on the human organism (Fischer et al., 2008;Weihe et al., 1996), and that when combined, synergistic effects can occur (Bemis and Seegal, 1999;Carpenter et al., 2002), this suggests paying particular attention to the populations that do not respect the corresponding DIL. It also justifies the need for specific research on potential combined effects of these two contaminants, which would help in assessing the sanitary risks brought upon the concerned population.
In view of these results, one advantage of our multivariate approach is that people in the data are dispatched into multiple classes that embody different types of extreme dependencies. In our case-study example, it facilitates the understanding of over-exposure categories by allowing classical discriminant analyses. An interesting alternative would be to model the various π h appearing in the mixture model of the angular probability measure in function of auxiliary covariates, e.g. some sociologic or economic variables here. More than providing easily interpretable results, this would probably increase the performance of our procedure by helping discriminate between the various clusters. Such generalizations of the present work will be the subject of further investigation in the near future.
We shall prove that for all h È P AE d , for an arbitrary small ǫ. This result can be obtained by applying the Portmanteau theorem to Eq. (9), provided that we find at least a decreasing sequence of positive constants ǫ 1 , ǫ 2 , . . . that tends to 0 such that for any m 1 and open face Ω h , the frontier of V ǫm ÔΩ h Õ has null measure relative to Q. Since Q is a finite measure, its associated cdf admits at most countably many discontinuity sets, hence the requirement is met. Now we shall prove that for all h È P AE d , where Ω h denotes the closure of Ω h in Ω. Observe that ÔV ǫ ÔΩ h ÕÕ ǫ is a decreasing sequence of sets that tends to Ω h as ǫ tends to 0. Therefore, Eq. (A.17) can be deduced from the monotone continuity property of probability distributions.
By combining Eq. (A.16) and Eq. (A.17), we obtain for all h È P AE d : Now let D j , j È Ø1, . . . , dÙ denote the set Øh È P AE d : #h jÙ. We shall prove Lemma A.1 by strong induction, starting with j 1. First, observe that for all h È D 1 we have Ω h Ω h and that the events Øλ h 1, h È P d Ù are disjoint by construction. Hence, for any h È D 1 , Eq. (A.18) can be rewritten as follows: Since Eq. (12) ensures that lim t P ω È V ǫ ÔΩ h Õ § § ρ t, λ h 1¨ 1 for all ǫ 0 and that for all ℓ h, lim ǫ 0 lim t Invoking again Eq. (12), lim t P ω È V ǫ ÔΩ h Õ § § ρ t, λ h 1¨ 1 for all ǫ 0 and for all ℓ h, Combined with the induction hypothesis, these equations entail Now that we have proved Lemma A.1 for all h È P AE d , there remains to check that lim t P Ôλ À 1 ρ tÕ π À 0. By construction the distribution of λ is Categorical with parameters Ôp h Õ hÈP d , therefore the events Øλ h 1Ù hÈP d are mutually exclusive and Øλ À 0Ù ä hÈP AE d Øλ h 1Ù. As a result, we have: This concludes the proof.
The second preliminary result in the lemma below states that the distribution of vector Z given that λ h 1 is multivariate regularly varying when h È H.
Proof. By Lemma A.1, Eq. (8) and Eq. (12), we have that where by definition, Recall that C h : Øx È C AE : xß x Ô2Õ È Ω h Ù, and set then we can rewrite S h in function of µ h as below: Since C h is a cone, the homogeneity property of µ stated in Eq. (4) is passed on According to Theorem 6.1 in Resnick (2007), it naturally follows that where, just like µ, µ h can be written as the product of a measure on the radius with a measure on the angles when switching to pseudo-polar coordinates: We can now tackle the proof of Proposition 3.1. Going back to the marginal level, multivariate regular variation of conditional distributions gives for all x 1, 1 j d, Notice that we now have a null limit for all h Ê HÔjÕ. Furthermore, for all h È HÔjÕ, we have Hence, for all h È P d , j È Ø1, . . . , dÙ, x 1, we can write where c j,h 0 when h È HÔjÕ, and c j,h 0 otherwise. Based on the marginal constraints on S stated in Eq. (10) and because ØΩ h Ù hÈP d forms a partition of Ω, we have that c j,h È Ö0, 1ßp h × for all h È P d and hÈP d p h c j,h hÈHÔjÕ p h c j,h 1.

A.2. Proof of Proposition 3.2
We shall handle the situations where h È HÔjÕ and h Ê HÔjÕ separately. To simplify notations, for all h È P d and x 0 we will denote byF j,h ÔxÕ the conditional probability that Z j exceeds x given λ h equals 1: • h È HÔjÕ : p h 0 and π h 0 From Eq. (13) in Proposition 3.1, it is straightforward thatF j,h is regularly varying with index ¡1, i.e. for any x 1, F j,h Ôx tȬ F j,h ÔtÕ t x ¡1 .
Hence,F j,h may be written as follows: F j,h ÔxÕ x ¡1 L j,h ÔxÕ, where L j,h ÔxÕ is a slowly varying function ( s 0, L j,h Ôs xÕßL j,h ÔxÕ x 1) that converges to c j,h as x .
Remark A.1. Define x AE j,h : infØx 1 :F j,h ÔxÕ 0Ù, the right endpoint of the survival functionF j,h for any j È Ø1, . . . , dÙ and any h È P d . Then for all h È HÔjÕ, x AE j,h , that is t 1,F j,h ÔtÕ 0.
• h È HÞHÔjÕ : p h 0 and π h 0 Contrary to the case where h È HÔjÕ, the conditional cdfF j,h can have either finite or infinite right endpoint. When its support is bounded, relying on the Bayes decomposition exhibited in the previous paragraph, the desired result is straightforward: because there exists some t 0 1 such that for all t t 0 , F j,h ÔtÕ 0, then as t , the integral also becomes null. If on the contrary, F j,h 0 for all t 1, then, as previously, we can rewrite the quantity of interest in the following form: κ j,h ÔtÕ t p hFj,h ÔtÕ P λ h 1 § § ρ t¨÷ 1F j,h Ôx tȬ F j,h ÔtÕ dx. Since as t tends to infinity tF j,h ÔtÕ tends to 0 (Proposition 3.1), P λ i,h § § ρ i tẗ ends to π h 0 (Lemma A.1) and since p h 0, for the integral of interest to converge to 0 it suffices to prove that there exists some t 0 1 such that for all t t 0 , Ôh È H c ÞØÀÙÕ Ôπ h 0Õ Ôp h 0Õ. Consequently, when h Ê H ØÀÙ, we have P Z j x § § ρ t, λ h 1¨ 0 for all x 0, and by extension ÷ 1 t P Ôρ tÕ P Z j x t § § ρ t, λ h 1¨dx 0, for all t 0. This remains true as t .
We have already seen that according to the assumption in Proposition 3.2, there exists some constants γ AE È Ô0, 1Õ, c AE 0 and t AE 1 such that for all t t AE , ÷ 1F j,À Ôx tȬ F j,À ÔtÕ dx ¡c AE 1 ¡ 1ßγ AE .
Moreover, by virtue of Eq. (8), for all ǫ 0 there exists some t ǫ 0 such that for all t t ǫ , t P Ôρ tÕ ¡ SÔΩÕ ǫ. Fix some ǫ 0 and set ǫ AE : ¡ǫ Ô1 ¡ 1ßγ AE Õßc AE , then for all t maxÔt AE , t ǫ Õ, we have Observe that the smaller γ AE , i.e. the faster the limit dependence structure is reached, the smaller the bound of κ j,À ÔtÕ. Ideally, when allF j,À , 1 j d, are rapidly varying, i.e. c AE 0, we obtain the same result as in the case where h È HÞHÔjÕ. This would correspond in fact to the absence of hidden regular variation, like mentioned in Section 5 and Section 7 (Heffernan and Resnick, 2005;Resnick, 2002Resnick, , 2008.