A generalized spatial sign covariance matrix

The well-known spatial sign covariance matrix (SSCM) carries out a radial transform which moves all data points to a sphere, followed by computing the classical covariance matrix of the transformed data. Its popularity stems from its robustness to outliers, fast computation, and applications to correlation and principal component analysis. In this paper we study more general radial functions. It is shown that the eigenvectors of the generalized SSCM are still consistent and the ranks of the eigenvalues are preserved. The influence function of the resulting scatter matrix is derived, and it is shown that its breakdown value is as high as that of the original SSCM. A simulation study indicates that the best results are obtained when the inner half of the data points are not transformed and points lying far away are moved to the center.


Introduction
Robust estimation of the covariance (scatter) matrix is an important and challenging problem. Over the last decades, many robust estimators for the covariance matrix have been developed. Many of them possess the attractive property of affine equivariance, meaning that when the data are subjected to an affine transformation the estimator will transform of the true covariance matrix and preserve the order of the eigenvalues.
The SSCM is easy to compute, and has been used extensively in several applications.
The most common use of the SSCM is probably in the context of (functional) spherical PCA as developed by Locantore et al. (1999), Visuri et al. (2001), Croux et al. (2002) and Taskinen et al. (2012). There also has been a lot of recent research on its use for robust correlation estimators (Dürre et al., 2015;Dürre and Vogel, 2016;Dürre et al., 2017), testing for sphericity (Sirki et al., 2009), as a general way of preprocessing data (Serneels et al., 2006) and as an initial estimate for more involved robust scatter estimators (Croux et al., 2010;Hubert et al., 2012). Boente et al. (2018) study SCCM as an operator for functional data analysis.
The next section introduces a generalization of the SSCM and studies its properties.
Section 3 compares the performance of several members of this class in a small simulation study, and Section 4 concludes. All proofs can be found in the Appendix.
2 Methodology 2.1 Definition Definition 1. Let X be a p-variate random variable and µ a vector serving as its center.
Define the generalized spatial sign covariance matrix (GSSCM) of X by where the function g X is of the form where we call ξ X : R + → R + the radial function and || · || is the Euclidean norm.
Note that the form of g X in (2) precisely characterizes an orthogonally equivariant data transformation as shown by Hampel et al. (1986), p. 276. Also note that the regular covariance matrix corresponds to ξ X (r) = 1, and that ξ X (r) = 1/r yields the SSCM.
For a finite data set X = {x 1 , . . . , x n } the GSSCM is given by where T is a location estimator. Note that the SSCM gives the x i with ||x i − T (X)|| < 1 a weight higher than 1, but in general this is not required. In fact, the other functions we will propose satisfy ξ X (r) 1 for all r.
In the above definitions, we added the subscript X or X to the functions g and ξ to indicate that they can depend on X or X. In what follows we will drop these subscripts to ease the notational burden. We will study the following functions ξ: 1. Winsorizing (Winsor): 2. Quadratic Winsor (Quad): 3. Ball: 4. Shell: 5. Linearly Redescending (LR): The cutoffs Q 1 , Q 2 , Q 3 and Q * 3 depend on the Euclidean distances ||x i − T (X)|| by where hmed and hmad are variations on the median and median absolute deviation given by the order statistic hmed(y 1 , . . . , y n ) = y (h) and hmad(y 1 , . . . , The 2 3 power in these formulas is the Wilson-Hilferty transformation (Wilson and Hilferty, 1931) to near normality. In Section A.1 of the Appendix it is verified that this transformation brings the above cutoffs close to the theoretical ones, which are quantiles of a convolution of Gamma random variables with different scale parameters. Figure 1 shows the above functions ξ and that of the SSCM for distances whose square follows the χ 2 2 distribution. The ξ of the SSCM is the only one which upweights observations close to the center. The Winsor ξ and its square have a similar shape, but the latter goes down faster. The Ball and Shell ξ functions are both designed to give a weight of 1 to half (in fact, h) of the data points and 0 to the remainder, to make them comparable. Ball does this by giving a weight of 1 to the h points with the smallest distances. Shell is inspired by the idea of Rocke to both downweight observations with very high and very low distances from the center (Rocke, 1996). The Linearly Redescending ξ is a compromise between the Ball and the Quad ξ functions.

Preservation of the eigenstructure
In what follows, we assume that the distribution F X of X has an elliptical density with center zero and that its covariance matrix Σ = E F X [XX T ] exists. Therefore, X can be written as X = U DZ where U is a p×p orthogonal matrix, D is a p×p diagonal matrix with strictly positive diagonal elements, and Z is a p-variate random variable which is spherically symmetric, i.e. its density is of the form f Z (z) ∼ w(||z||) where w is a decreasing function.
Assume w.l.o.g. that the covariance matrix of Z is I p . The following proposition says that S g (X) has the same eigenvectors as Σ and preserves the ranks of the eigenvalues.
This proposition justifies the generalized SSCM approach.

Location estimator
So far we have not specified any location estimator T . For the SSCM the most often used location estimator is the spatial median, see e.g. Gower (1974) and Brown (1983), which we denote by T 0 . The spatial median of a dataset X = {x 1 , . . . , x n } is defined as In order to improve its robustness against a substantial fraction of outliers we propose to use the k-step least trimmed squares (LTS) estimator. The LTS method was originally proposed in regression (Rousseeuw, 1984), and for multivariate location it becomes where the subscript (i) stands for the i-th smallest squared distance. (Without the square this becomes the least trimmed absolute distance estimator studied in Chatzinakos et al. (2016).) For the multivariate location LTS the C-step of (Rousseeuw and Van Driessen, 1999) simplifies to Definition 2. (C-step) Fix h = (n + 1)/2 . Given a location estimate T j−1 (X) we take . . , n}. The C-step then yields The C-step is fast to compute, and guaranteed to lower the LTS objective. The k-step LTS is then the result of k successive C-steps starting from the spatial median T 0 (X) .
It is also possible to avoid the estimation of location altogether, by calculating the GSSCM on the O(n 2 ) pairwise differences of the data points. This approach is called the "symmetrization" of an estimator, but is more computationally intensive. Visuri et al.

Robustness properties
A major reason for the SSCM's popularity is its robustness against outliers. Robustness can be quantified by the influence function and the breakdown value. We will study both for the GSSCM.
The influence function (Hampel et al. (1986)) quantifies the effect of a small amount of contamination on a statistical functional T . Consider the contaminated distribution is the distribution that puts all its mass in z. The influence function of T at F is then given by For the generalized SSCM class we obtain the following result: Proposition 2. Denote S g (F ) = Ξ g and let µ = 0 in (1). The influence function of S g at the distribution F is given by: If g does not depend on F , the last term of (9) vanishes. For example, for g(t) = t, we  et al. (2002). For the GSSCM estimators defined by the functions (4)-(8) the last term of (9) remains, and the expressions of their IF can be found in Section A.3 of the Appendix.
In order to visualize the influence function we consider the bivariate standard normal case, i.e. F = N (0, I 2 ). We put contamination at (z, z) or (z, 0) for different values of z and plot the IF for the diagonal elements and the off-diagonal element. Note that we cannot compare the raw IFs directly as S g (F ) = Ξ g = c g I where c g = g 1 (X) 2 dF (X) hence Ξ g is only equal to I 2 up to a factor. In order to make the estimators consistent for this distribution we can divide them by c g , and so we plot IF(z, S g , F )/c g in Figure 2.
The rows in Figure 2 correspond to the IF of the first diagonal element S 11 (top), the off-diagonal element S 12 (middle) and the element S 22 (bottom). Let's first consider the left part of the figure, which contains the IFs for an outlier in (z, z). By symmetry, the IFs of the diagonal elements S 11 and S 22 are the same here. In the regions where the function The right panel of Figure 2 shows the influence functions for an outlier in (z, 0). In this case the IFs of the diagonal elements S 11 and S 22 are no longer the same, as the symmetry is broken. The IFs of S 11 are again quadratic where ξ = 1, with jumps at the cutoffs. Note that these cutoffs are now located at different values of z, as ||(z, 0)|| = ||(z, z)||. The IF of the off-diagonal element is constant at 0, indicating that S 12 remains zero even when there is an outlier at (z, 0). Finally, for the second diagonal element S 22 the IF of the SSCM is −1. This is because adding ε of contamination at (z, 0) reduces the mass of the remaining part of F by ε which lowers the estimated scatter in the vertical direction. For the other estimators there is an additional effect of (z, 0) on the cutoffs, which causes the discontinuities.
A second tool for quantifying the robustness of an estimator is the finite-sample breakdown value (Donoho and Huber, 1983). For a multivariate location estimator T and a dataset X of size n, the breakdown value is the smallest fraction of the data that needs to be replaced by contamination to make the resulting location estimate lie arbitrarily far away from the original location T (X). More precisely: where X * m ranges over all datasets obtained by replacing any m points of X by arbitrary points.
For a multivariate estimator of scale S, the breakdown value is defined as the smallest fraction of contamination needed to make an eigenvalue of S either arbitrarily large or arbitrarily close to zero. We denote the eigenvalues of S(X) by λ 1 (S(X)) . . . λ p (S(X)).
The breakdown value of S is then given by: For the results on breakdown we assume the following conditions on the function ξ: 1. The function ξ takes values in [0, 1].

For any dataset
Note that all functions ξ proposed in (4)-(8) satisfy these assumptions. The following proposition gives the breakdown value of the GSSCM scatter estimator S g .
Proposition 3. Let X = {x 1 , . . . , x n } be a p-dimensional dataset in general position, meaning that no p + 1 points lie on the same hyperplane. Also assume that the location estimator T has a breakdown value of at least (n − p + 1)/2 /n . Then * (S g , X) = (n − p + 1)/2 n .
As we would like the GSSCM scatter estimator to attain this breakdown value, we have to use a location estimator whose breakdown value is at least (n − p + 1)/2 /n . The following proposition verifies that the k-step LTS estimator satisfies this, and even attains the best possible breakdown value for translation equivariant location estimators.

Simulation study
We now perform a simulation study comparing the GSSCM versions (4)-(8). As the estimators are orthogonally equivariant, it suffices to generate diagonal covariance matrices.
For measuring how much the estimated Σ deviates from the true Σ we use the Kullback-Leibler divergence (KLdiv) given by We also consider the shape matrices Γ = (det Σ) −1/p Σ and Γ = (detΣ) −1/p Σ which have

Conclusions
The spatial sign covariance matrix (SSCM) can be seen as a member of a larger class called preserving the ranks of the eigenvalues. Their computation is as fast as the SSCM. We have studied five GSSCM methods with intuitively appealing radial functions, and shown that their breakdown values are as high as that of the original SSCM. We also derived their influence functions and carried out a simulation study.
The radial function of the SSCM is ξ(r) = 1/r which implies that points near the center are given a very high weight in the covariance computation. Our alternative radial functions give these points a weight of at most 1, which yields better performance at uncontaminated Gaussian data (Figure 3) as well as contaminated data (Figures 4 and 5). In particular, Winsor is the most similar to SSCM since its ξ(r) is 1 for the central half of the data and 1/r for the outer half. It performs best for uncontaminated data, but still suffers when far outliers are present. It is almost uniformly outperformed by Quad, whose ξ(r) is 1 in the central half and 1/r 2 outside it. The influence of outliers on Quad smoothly redescends to zero. The other three estimators are hard redescenders whose ξ(r) = 0 for large enough r.
Among them, the linear redescending (LR) radial function performed best overall.
A potential topic for further research is to investigate principal component analysis based on a GSSCM covariance matrix.
Software availability. R-code for computing these estimators and an example script are available from the website wis.kuleuven.be/stat/robust/software .

A Appendix
Here the proofs of the results are collected.

A.1 Distribution of Euclidean distances
Exact distribution.
The exact distribution of the squared Euclidean distances ||X|| 2 of a multivariate Gaussian distribution with general covariance matrix is given by the following result: Proposition 5. Let X ∼ N (0, Σ), and suppose the eigenvalues of Σ are given by λ 1 , . . . , λ p .
Proof. We can write X = U DZ where U is an orthogonal matrix, D is the diagonal matrix with elements √ λ 1 , . . . , λ p , and Z follows the p-variate standard Gaussian distribution.
gamma distributions with a constant shape of 1 2 and varying scale parameters equal to twice the eigenvalues of the covariance matrix.
As p goes to infinity it holds that λ 2 i by the Lyapunov central limit theorem.
Approximate distribution of a sum of gamma variables.
Proposition 5 gives the exact distribution of the squared Euclidean distances ||X|| 2 . The distribution of a sum of gamma distributions has been studied by Moschopoulos (1985).
Quantiles of this distribution can be computed by the R package coga (Hu et al., 2018) for convolutions of gamma distributions. However, this computation requires the knowledge of the eigenvalues λ 1 , . . . , λ p that we are trying to estimate. Therefore we need a transformation of the Euclidean distances such that the transformed distances have an approximate distribution whose quantiles do not require knowing λ 1 , . . . , λ p .
In the simplest case λ 1 = . . . = λ p (constant eigenvalues), and then ||X|| 2 /λ 1 follows a χ 2 p distribution. It is known that when p increases the distribution of ||X|| 2 tends to a Gaussian distribution, but this also holds for some other powers of ||X||. Wilson and Hilferty (1931) found that the best transformation of this type was ||X|| 2/3 in the sense of coming closest to a Gaussian distribution. The quantiles q α of a Gaussian distribution are easier to compute and can then be transformed back to q   It turns out that the same Wilson-Hilferty transformation also works quite well in the more general situation where the eigenvalues λ 1 , . . . , λ p need not be the same. We came to this conclusion by a simulation study, a part of which is illustrated here. The dimen-sion p ranged from 1 to 20 by steps of 1. For each p we generated n = 10 6 observations y 1 , . . . , y n from the coga distribution with shape parameters (0.5, . . . , 0.5). The scale parameters had three settings: constant (2, 2, . . . , 2), linear (p, (p − 1), . . . , 1), and quadratic (p 2 , (p − 1) 2 , . . . , 1), after which the scale parameters were further standardized in order to sum to 2p. These correspond to the distribution of the squared Euclidean norms of a multivariate normal distribution where the covariance matrix has eigenvalues that are constant or proportional to (p, (p − 1), . . . , 1) (linear eigenvalues) or to (p 2 , (p − 1) 2 , . . . , 1) (quadratic eigenvalues). Denote the unsquared Euclidean norms as r i = √ y i . Then we estimate quantiles, e.g. Q 3 by assuming normality of the transformed values h 1 (r i ) = r 2 i (square), h 2 (r i ) = r i (Fisher), and h 3 (r i ) = r 2/3 i (Wilson-Hilferty), by computing the third quartile of a gaussian distribution withμ = median i (h(r i )) andσ = mad i (h(r i )). Finally, we have evaluated the cumulative distribution function of the coga distribution inQ 2 3 . Ideally, we would like to obtain F coga (Q 2 3 ) = 0.75. The result of this experiment is shown in Figure 6. We clearly see that the Wilson-Hilferty transform brings the approximate quantile closest to its target value. The results for the first quartile Q1 (not shown) are very similar.

A.2 Proof of Proposition 1
Part 1: Preservation of the eigenvectors.
First note that g is orthogonally equivariant, i.e. g(HX) = Hg(X) for any orthogonal matrix H. Therefore S g = E F X [g(X)g(X) T ] implies E F X [g(HX)g(HX) T ] = HS g H T .
The distribution of Z is spherically symmetric hence invariant to reflections along a coordinate axis, which are described by diagonal matrices R with an entry of -1 and all other entries +1. For every reflection matrix R it thus holds that where the third equality holds because DR = RD as both D and R are diagonal, and the last equality because R is orthogonal. Therefore E F Z [g(DZ)g(DZ) T ] is a diagonal matrix, which we can denote as Λ g := diag(λ g,1 , . . . , λ g,p ). Now take U an arbitrary orthogonal matrix and let X = U DZ. Then For the plain co-variance matrix Σ of X we have Σ = E F Z [U DZ(U DZ) T ] = U ΛU T where Λ = DD T = diag(δ 2 1 , . . . , δ 2 p ). Therefore, the same matrix U orthogonalizes both Σ and S g , hence S g and Σ have the same eigenvectors.
Part 2: Preservation of the ranks of the eigenvalues.
Let i > j and suppose that δ i > δ j . We will show that λ g,i > λ g,j . Note that where f Z is the density of Z. Similarly, we have This means that λ g,i > λ g,j is equivalent to: Note that we can change the variable of integration as follows. Let y k = δ k z k and write Y = (y 1 , . . . , y p ). Then (A.2) is equivalent to We can ignore the positive constant 1/(δ 1 · · · δ p ) and split the integral over the domains where in the second equality we have changed the variables of the integration over B by replacing (y i , y j ) by (−y j , y i ) which has Jacobian 1. The ∆ ij in that step is the correction Note that on A it holds that |y i | > |y j | hence y 2 i − y 2 j > 0 so ∆ ij > 0. Since w is a decreasing function it follows that If on the other hand δ i and δ j are tied, i.e. δ i = δ j , it follows that ∆ ij = 0 hence λ g,i = λ g,j .

A.3 Influence function
Proof of Proposition 2.
Consider the contaminated distribution F ε,z = (1 − ε)F + ε∆ z where z ∈ R p and ε ∈ [0, 1]. We then have: If we take the derivative with respect to ε and evaluate it in ε = 0, we get: .

Calculation of the IF.
While the expression of the influence function might seem relatively simple, its (numerical) calculation is rather involved. We can write: dF (X).
The cutoffs in the paper are Q 1 = hmed||X|| becomes ξ ε (r) = 1 r Q 2,ε + Q 2,ε r 1 r>Q 2,ε . We then have: As Q 2 r − 1 δ(r − Q 2 ) is 0 everywhere, we only need to integrate the last term. This yields The influence function of S g is thus given by: Note that the last 2 terms in the sum are each other's transpose. The integration is done numerically.
The derivation of the influence function of the Quad GSSCM is entirely similar to that of Winsor. The main difference is that now δ δε g ε (X) ε=0 is given by The linearly redescending (LR) method uses a second cutoff: In the contaminated case we obtain g ε (x) = xξ ε (||x||) with Taking the derivative with respect to ε yields: Evaluation in ε = 0 gives: When integrating only the last term plays a role, yielding For the Ball GSSCM we analogously derive that Finally, for the Shell SSCM we obtain
Next we show that the smallest eigenvalue of S g (X * ) has a positive lower bound for all contaminated datasets X * . By condition 2 on ξ we know that #{x i | ξ(||x i − T (X * )||) = 1} (n + p + 1)/2 . Therefore, we have at least (n + p + 1)/2 −( (n − p + 1)/2 −1) = p + 1 regular points for which ξ(||x i − T (X * )||) = 1, let's assume w.l.o.g. that these are x 1 , . . . , x p+1 . We can now write λ min = min ||u||=1 u T S g (X * )u = min Part 2. It remains to show that * (n − p + 1)/2 /n. This is the known upper bound for affine equivariant scatter estimators but that result doesn't apply here, so we need to show it for this case. Take any m (n − p + 1)/2 and replace the last m points of X, keeping the points x 1 , . . . , x n−m unchanged. By location equivariance we can assume w.l.o.g. that the average of x 1 , . . . , x n−m is zero. For j = n − m + 1, . . . , n put x * j = λa j where a j is such that min i=n−m+1,...,n ||a j − a i || 1 and such that for all λ > 1 it holds that min i=1,...,n−m ||λa j − x i || λ. This is possible by placing the a j outside of the convex hull of X and far enough from each other and X. Now consider an unbounded increasing sequence of λ k > 1. For every λ k the set {x * n−m+1 , . . . , x * n } must contain at least one point for which w i = 1, call this point x * b . Take another point of X * for which w i = 1, name this x * c . Note that x * c can be an original data point or a replaced point. We now have that ||x