Elsevier

Computers & Geosciences

Volume 39, February 2012, Pages 77-85
Computers & Geosciences

Interpretation of multivariate outliers for compositional data

https://doi.org/10.1016/j.cageo.2011.06.014Get rights and content

Abstract

Compositional data—and most data in geochemistry are of this type—carry relative rather than absolute information. For multivariate outlier detection methods this implies that not the given data but appropriately transformed data need to be used. We use the isometric logratio (ilr) transformation, which seems to be generally the most proper one for theoretical and practical reasons. In this space it is difficult to interpret the outliers, because the reason for outlyingness can be complex. Therefore we introduce tools that support the interpretation of outliers by representing multivariate information in biplots, maps, and univariate scatterplots.

Highlights

► A special transformation needs to be used for compositional data. ► Tools are developed that help to interpret multivariate outliers. ► The interpretation can be done in different graphical displays. ► The type of symbol and color is the same in all displays. ► R code is available, which allows for a flexible interaction within the plots.

Introduction

In many practical applications from geosciences one has to deal with compositional data, i.e., with multivariate observations describing quantitatively the parts of some whole. Thus, their components carry exclusively relative information about the parts (Aitchison, 1986). Typically these observations are expressed as data with a constant sum constraint such as proportions, percentages, or mg/kg. Standard statistical methods usually fail when they are applied directly to compositional data (Filzmoser and Hron, 2008, Filzmoser et al., 2009, Hron et al., 2010). Many authors appear to be under the impression that the main reason lies in a nonnormal distribution of the samples (for example, of chemical elements in a rock) and thus recommend applying a logarithmic transformation in order to achieve normality (Reimann et al., 2008) of the data set. Depending on the transformation chosen, normality can often be reached, so that the data pass a statistical test. However, the reality is more difficult. The original data follow, in fact, another geometry (usually called the Aitchison geometry; see, e.g., Egozcue and Pawlowsky-Glahn, 2006, for details) on the sample space of compositions, the simplex, defined for a D-part composition x=(x1,,xD) as SD=x=(x1,,xD),xi>0,i=1,,D,i=1Dxi=κ.The positive constant κ stands for 1 in the case of proportions, 100 for percentages, or 106 for mg/kg.

Due to the fact that geochemical data follow the Aitchison geometry, standard statistical methods that rely mostly on the Euclidean geometry cannot be used for raw compositional data. Whether or not the data follow a normal distribution is of no importance at all. To transform the data to the Euclidean space, the family of logratio transformations from the simplex SD to the Euclidean real space was proposed. Only by following such a transformation is the use of the standard statistical methods possible. The three main types in this family of transformations are the additive logratio (alr), the centered logratio (clr), and the isometric logratio (ilr) transformation. The alr (Aitchison, 1986) is simple and could be used in the context of outlier detection. However, it is not recommended because it does not result in an orthogonal basis system, which is necessary for diagnostic tools following outlier detection. The clr (Aitchison, 1986) results in data singularity, which is in conflict with the usual tools for outlier detection. The ilr (Egozcue et al., 2003) is recommended because it forms a one-to-one relation between the Aitchison geometry on the simplex and the standard Euclidean geometry, with excellent geometrical properties.

The D−1 ilr variables are coordinates of an orthonormal basis on the simplex (with respect to the Aitchison geometry); thus a proper choice of the basis seems to be crucial for their interpretation. Here the big step ahead is the sequential binary partition procedure (Egozcue and Pawlowsky-Glahn, 2005), which enables interpretation of the orthonormal coordinates in the sense of balances between groups of compositional parts. Additionally, each ilr variable explains all the logratios, i.e., terms of type ln(xi/xj),i,j=1,,D, between parts of the corresponding groups (Fišerová and Hron, in press); conversely, each logratio in the composition is exclusively explained by one balance. This point of view seems to be meaningful, because the definition of compositions implies that the only relevant information is contained in (log)ratios of compositional parts. Although the sequential binary partition can also be made to measure for the concrete geochemical problems (Buccianti et al., 2008), in practice the following D choices of the orthonormal bases seem to be the most useful (Egozcue et al., 2003, Hron et al., 2010). Explicitly, we obtain (D−1)-dimensional real vectors z(l)=(z1(l),,zD1(l)),l=1,,D, zi(l)=DiDi+1lnxi(l)j=i+1Dxj(l)Di,i=1,,D1,where (x1(l),x2(l),,xl(l),xl+1(l),,xD(l)) stands for a permutation of the parts (x1,,xD) such that the lth compositional part always fills the first position, (xl,x1,,xl1,xl+1,,xD). In such a configuration, the first ilr variable z1(l) explains all the relative information (logratios) about the original compositional part xl; the coordinates z2(l),,zD1(l) then explain the remaining logratios in the composition (Fišerová and Hron, in press). Note that the only important position is that of x1(l) (because it can be fully explained by z1(l)), the other parts can be chosen arbitrarily, because different ilr transformations are orthogonal rotations of each other (Egozcue et al., 2003). Of course, we cannot say that z1(l) is the original compositional part xl, but it explains all the information concerning xl; thus, it stands for xl.

An interesting consequence follows for the known clr transformation from SD to RD, resulting in y=(y1,,yD)=lnx1i=1DxiD,,lnxDi=1DxiD.It is easy to see that there exists a linear relationship between yl and z1(l), namely yl=DD1z1(l).Thus, up to a constant, the single clr variables have the same interpretation as the corresponding ilr coordinates: they explain all logratios concerning the lth compositional part. However, as a consequence, some of the logratios are explained more than once by the D clr variables (in contrast to the ilr transformation). This is also an intuitive reason for the resulting singularity y1++yD=0 of clr variables, which makes, e.g., the use of robust multivariate statistical methods not possible. On the other hand, the clr transformation is a cornerstone of the compositional biplot (Aitchison and Greenacre, 2002), which will be employed further in the paper.

From the above-mentioned properties of logratio transformations, it is visible that the ordered D-tuple of the ilr coordinates, z1(l),l=1,,D, can be obtained from clr-transformed data as (D1)/Dy. Nevertheless, note that it would be not meaningful to interpret the relations between the clr variables or even between the variables z1(l) using their correlation structure; here the subcompositional incoherence (which means that the results of statistical modeling might be incompatible if only a subset of the parts were used; see, e.g., Aitchison, 1986, Filzmoser et al., 2010, for details) could lead to wrong conclusions. Some kind of “incompatibility” is obtained also for the single z1(l),l=1,,D, variables; however, here it is as a natural consequence of the fact that the information available in x was reduced just to a subcomposition.

In contrast to univariate outliers, whose identification as extreme observations is straightforward, for multivariate outliers the covariance structure of the data set needs to be considered as well (Filzmoser et al., 2005). Moreover, when working with compositional data, one has to consider the data structure in view of the Aitchison geometry; see, e.g., Hron et al. (2010) and Filzmoser and Hron (in press). For example, an elliptical point cloud arising from a multivariate normal distribution in the usual Euclidean geometry can look very different in the Aitchison geometry, depending on its position in space (see, for instance, the back-transformed ellipses in Fig. 1B to the Aitchison geometry in Fig. 1A). This is important for multivariate outlier detection methods, which are usually based on distances from an elliptically symmetric distribution. Moreover, each compositional data point can be shifted along the line from the origin through the point without changing the ratios of the compositional parts. Formally, an observed composition x=(x1,,xD) is defined as a member of the corresponding equivalence class of x, x̲={cx,cR+}.Thus, two compositions that are elements of the same equivalence class x̲ (we call them also compositionally equivalent; see Egozcue and Pawlowsky-Glahn, 2006) contain the same information and have zero Aitchison (1986) distance. From this point of view, the “extremeness” of the outliers can be even more misleading than in case of standard (Euclidean) multivariate outliers.

The methods for outlier detection of compositional data will be discussed in Section 2, where both theoretical aspects and possibilities for graphical representations will be considered. Section 3 proposes several tools for the interpretation of multivariate outliers that have been implemented in the statistical software environment R (R Development Core Team, 2011). In Section 4 we show how the tool is used for real problems, and how results can be interpreted. Section 5 concludes.

Section snippets

Methods for multivariate outlier detection and graphical representation

As with the other multivariate methods applied to compositional data, it is important to use an appropriate data transformation first. Either the clr transformation or a proper choice of the ilr transformation can be used for this purpose; see Filzmoser and Hron (2008).

Tools for interpreting multivariate outliers

The tools discussed in this section are implemented and freely available in the R package mvoutlier; see Filzmoser and Gschwandtner (2011). Mainly, two functions are relevant for the user:

  • mvoutlier.CoDa() requires an untransformed input data matrix with at least three compositional parts. Robust location and covariance estimations are derived using the adaptive approach of Filzmoser et al. (2005) (with sensible default values) for the ilr-transformed data. These are used for computing robust

Examples

In this section we demonstrate the use of the outlier tools for two data sets from geochemistry.

Conclusions

Multivariate outliers are often the most interesting data points because they show atypical phenomena. Several methods have been proposed for the identification of multivariate outliers, making use of the technology of robust statistics (Maronna et al., 2006). Such tools have also been developed in the context of compositional data (Filzmoser and Hron, 2008). As a result, the data investigator gets the information on the samples that are potential multivariate outliers, but not the information

Acknowledgments

The authors are grateful to the editor and to the referees for helpful comments and suggestions. This work was supported by the Council of the Czech Government, MSM 6198959214.

References (24)

  • Eilu, P., Hallberg, A., Berman, T., Feoktistov, V., Korsakova, M., Krasotkin, S., Kitosmanen, E., Lampio, E.,...
  • Filzmoser, P., Gschwandtner, M., 2011. mvoutlier: multivariate outlier detection based on robust methods. Manual and...
  • Cited by (94)

    • Deep ore‑forming fluid characteristics of the Jiaodong gold province: Evidence from the Qianchen gold deposit in the Jiaojia gold belt

      2022, Ore Geology Reviews
      Citation Excerpt :

      Combined with the trace element characteristics (most of them appear as magmatic source pyrite) and sulfur isotopic characteristics (a more concentrated distribution) of pyrite, the ore-forming fluids of the Qianchen gold deposit are inferred to be mainly magmatic hydrothermal fluids, and mixing with a minor volume of metamorphic fluids, suggesting that the regional magmatic rocks and metamorphic rocks both contributed to the source. The primary geochemical halo of an ore deposit was defined by Safronov (1936) as a region including enriched ore-formation and associated elements in hydrothermal mineralization, which are commonly used to detect the hidden and deep mineralization (Gundobin, 1984; Goldberg et al., 2003; Filzmoser et al., 2012; Li et al., 2016), especially hydrothermal gold deposits have a very good effect (Li et al., 1999, 2013a, 2020). The axial zoning of the primary halo in an ore deposit is linked to the direction of fluid flow and provides information about the presence of deep ore bodies (Beus and Grigorian, 1977; Li et al., 2006, 2016; Zuo et al., 2009).

    View all citing articles on Scopus
    View full text