Interpretation of multivariate outliers for compositional data

doi:10.1016/j.cageo.2011.06.014

Computers & Geosciences

Volume 39, February 2012, Pages 77-85

https://doi.org/10.1016/j.cageo.2011.06.014 Get rights and content

Abstract

Compositional data—and most data in geochemistry are of this type—carry relative rather than absolute information. For multivariate outlier detection methods this implies that not the given data but appropriately transformed data need to be used. We use the isometric logratio (ilr) transformation, which seems to be generally the most proper one for theoretical and practical reasons. In this space it is difficult to interpret the outliers, because the reason for outlyingness can be complex. Therefore we introduce tools that support the interpretation of outliers by representing multivariate information in biplots, maps, and univariate scatterplots.

Highlights

► A special transformation needs to be used for compositional data. ► Tools are developed that help to interpret multivariate outliers. ► The interpretation can be done in different graphical displays. ► The type of symbol and color is the same in all displays. ► R code is available, which allows for a flexible interaction within the plots.

Introduction

In many practical applications from geosciences one has to deal with compositional data, i.e., with multivariate observations describing quantitatively the parts of some whole. Thus, their components carry exclusively relative information about the parts (Aitchison, 1986). Typically these observations are expressed as data with a constant sum constraint such as proportions, percentages, or mg/kg. Standard statistical methods usually fail when they are applied directly to compositional data (Filzmoser and Hron, 2008, Filzmoser et al., 2009, Hron et al., 2010). Many authors appear to be under the impression that the main reason lies in a nonnormal distribution of the samples (for example, of chemical elements in a rock) and thus recommend applying a logarithmic transformation in order to achieve normality (Reimann et al., 2008) of the data set. Depending on the transformation chosen, normality can often be reached, so that the data pass a statistical test. However, the reality is more difficult. The original data follow, in fact, another geometry (usually called the Aitchison geometry; see, e.g., Egozcue and Pawlowsky-Glahn, 2006, for details) on the sample space of compositions, the simplex, defined for a D-part composition $x = (x_{1}, \dots, x_{D})'$ as $S^{D} = \{x = (x_{1}, \dots, x_{D})', x_{i} > 0, i = 1, \dots, D, \sum_{i = 1}^{D} x_{i} = κ\} .$ The positive constant $κ$ stands for 1 in the case of proportions, 100 for percentages, or 10⁶ for mg/kg.

Due to the fact that geochemical data follow the Aitchison geometry, standard statistical methods that rely mostly on the Euclidean geometry cannot be used for raw compositional data. Whether or not the data follow a normal distribution is of no importance at all. To transform the data to the Euclidean space, the family of logratio transformations from the simplex S^D to the Euclidean real space was proposed. Only by following such a transformation is the use of the standard statistical methods possible. The three main types in this family of transformations are the additive logratio (alr), the centered logratio (clr), and the isometric logratio (ilr) transformation. The alr (Aitchison, 1986) is simple and could be used in the context of outlier detection. However, it is not recommended because it does not result in an orthogonal basis system, which is necessary for diagnostic tools following outlier detection. The clr (Aitchison, 1986) results in data singularity, which is in conflict with the usual tools for outlier detection. The ilr (Egozcue et al., 2003) is recommended because it forms a one-to-one relation between the Aitchison geometry on the simplex and the standard Euclidean geometry, with excellent geometrical properties.

The D−1 ilr variables are coordinates of an orthonormal basis on the simplex (with respect to the Aitchison geometry); thus a proper choice of the basis seems to be crucial for their interpretation. Here the big step ahead is the sequential binary partition procedure (Egozcue and Pawlowsky-Glahn, 2005), which enables interpretation of the orthonormal coordinates in the sense of balances between groups of compositional parts. Additionally, each ilr variable explains all the logratios, i.e., terms of type $\ln (x_{i} / x_{j}), i, j = 1, \dots, D$ , between parts of the corresponding groups (Fišerová and Hron, in press); conversely, each logratio in the composition is exclusively explained by one balance. This point of view seems to be meaningful, because the definition of compositions implies that the only relevant information is contained in (log)ratios of compositional parts. Although the sequential binary partition can also be made to measure for the concrete geochemical problems (Buccianti et al., 2008), in practice the following D choices of the orthonormal bases seem to be the most useful (Egozcue et al., 2003, Hron et al., 2010). Explicitly, we obtain (D−1)-dimensional real vectors $z^{(l)} = (z_{1}^{(l)}, \dots, z_{D - 1}^{(l)})', l = 1, \dots, D$ , $z_{i}^{(l)} = \sqrt{\frac{D - i}{D - i + 1}} \ln \frac{x_{i}^{(l)}}{\sqrt[D - i]{\prod_{j = i + 1}^{D} x_{j}^{(l)}}}, i = 1, \dots, D - 1,$ where $(x_{1}^{(l)}, x_{2}^{(l)}, \dots, x_{l}^{(l)}, x_{l + 1}^{(l)}, \dots, x_{D}^{(l)})$ stands for a permutation of the parts $(x_{1}, \dots, x_{D})$ such that the lth compositional part always fills the first position, $(x_{l}, x_{1}, \dots, x_{l - 1}, x_{l + 1}, \dots, x_{D})$ . In such a configuration, the first ilr variable $z_{1}^{(l)}$ explains all the relative information (logratios) about the original compositional part x_l; the coordinates $z_{2}^{(l)}, \dots, z_{D - 1}^{(l)}$ then explain the remaining logratios in the composition (Fišerová and Hron, in press). Note that the only important position is that of $x_{1}^{(l)}$ (because it can be fully explained by $z_{1}^{(l)}$ ), the other parts can be chosen arbitrarily, because different ilr transformations are orthogonal rotations of each other (Egozcue et al., 2003). Of course, we cannot say that $z_{1}^{(l)}$ is the original compositional part x_l, but it explains all the information concerning x_l; thus, it stands for x_l.

An interesting consequence follows for the known clr transformation from $S^{D}$ to $R^{D}$ , resulting in $y = (y_{1}, \dots, y_{D})' = (\ln \frac{x_{1}}{\sqrt[D]{\prod_{i = 1}^{D} x_{i}}}, \dots, \ln \frac{x_{D}}{\sqrt[D]{\prod_{i = 1}^{D} x_{i}}})' .$ It is easy to see that there exists a linear relationship between y_l and $z_{1}^{(l)}$ , namely $y_{l} = \sqrt{\frac{D}{D - 1}} z_{1}^{(l)} .$ Thus, up to a constant, the single clr variables have the same interpretation as the corresponding ilr coordinates: they explain all logratios concerning the lth compositional part. However, as a consequence, some of the logratios are explained more than once by the D clr variables (in contrast to the ilr transformation). This is also an intuitive reason for the resulting singularity $y_{1} + \dots + y_{D} = 0$ of clr variables, which makes, e.g., the use of robust multivariate statistical methods not possible. On the other hand, the clr transformation is a cornerstone of the compositional biplot (Aitchison and Greenacre, 2002), which will be employed further in the paper.

From the above-mentioned properties of logratio transformations, it is visible that the ordered D-tuple of the ilr coordinates, $z_{1}^{(l)}, l = 1, \dots, D$ , can be obtained from clr-transformed data as $\sqrt{(D - 1) / D} y$ . Nevertheless, note that it would be not meaningful to interpret the relations between the clr variables or even between the variables $z_{1}^{(l)}$ using their correlation structure; here the subcompositional incoherence (which means that the results of statistical modeling might be incompatible if only a subset of the parts were used; see, e.g., Aitchison, 1986, Filzmoser et al., 2010, for details) could lead to wrong conclusions. Some kind of “incompatibility” is obtained also for the single $z_{1}^{(l)}, l = 1, \dots, D$ , variables; however, here it is as a natural consequence of the fact that the information available in $x$ was reduced just to a subcomposition.

In contrast to univariate outliers, whose identification as extreme observations is straightforward, for multivariate outliers the covariance structure of the data set needs to be considered as well (Filzmoser et al., 2005). Moreover, when working with compositional data, one has to consider the data structure in view of the Aitchison geometry; see, e.g., Hron et al. (2010) and Filzmoser and Hron (in press). For example, an elliptical point cloud arising from a multivariate normal distribution in the usual Euclidean geometry can look very different in the Aitchison geometry, depending on its position in space (see, for instance, the back-transformed ellipses in Fig. 1B to the Aitchison geometry in Fig. 1A). This is important for multivariate outlier detection methods, which are usually based on distances from an elliptically symmetric distribution. Moreover, each compositional data point can be shifted along the line from the origin through the point without changing the ratios of the compositional parts. Formally, an observed composition $x = (x_{1}, \dots, x_{D})'$ is defined as a member of the corresponding equivalence class of $x$ , $\underset{̲}{x} = {c x, c \in R^{+}} .$ Thus, two compositions that are elements of the same equivalence class $\underset{̲}{x}$ (we call them also compositionally equivalent; see Egozcue and Pawlowsky-Glahn, 2006) contain the same information and have zero Aitchison (1986) distance. From this point of view, the “extremeness” of the outliers can be even more misleading than in case of standard (Euclidean) multivariate outliers.

The methods for outlier detection of compositional data will be discussed in Section 2, where both theoretical aspects and possibilities for graphical representations will be considered. Section 3 proposes several tools for the interpretation of multivariate outliers that have been implemented in the statistical software environment R (R Development Core Team, 2011). In Section 4 we show how the tool is used for real problems, and how results can be interpreted. Section 5 concludes.

Section snippets

Methods for multivariate outlier detection and graphical representation

As with the other multivariate methods applied to compositional data, it is important to use an appropriate data transformation first. Either the clr transformation or a proper choice of the ilr transformation can be used for this purpose; see Filzmoser and Hron (2008).

Tools for interpreting multivariate outliers

The tools discussed in this section are implemented and freely available in the R package mvoutlier; see Filzmoser and Gschwandtner (2011). Mainly, two functions are relevant for the user:

•
mvoutlier.CoDa() requires an untransformed input data matrix with at least three compositional parts. Robust location and covariance estimations are derived using the adaptive approach of Filzmoser et al. (2005) (with sensible default values) for the ilr-transformed data. These are used for computing robust

Examples

In this section we demonstrate the use of the outlier tools for two data sets from geochemistry.

Conclusions

Multivariate outliers are often the most interesting data points because they show atypical phenomena. Several methods have been proposed for the identification of multivariate outliers, making use of the technology of robust statistics (Maronna et al., 2006). Such tools have also been developed in the context of compositional data (Filzmoser and Hron, 2008). As a result, the data investigator gets the information on the samples that are potential multivariate outliers, but not the information

Acknowledgments

The authors are grateful to the editor and to the referees for helpful comments and suggestions. This work was supported by the Council of the Czech Government, MSM 6198959214.

References (24)

P. Filzmoser et al.
Multivariate outlier detection in exploration geochemistry
Computers & Geosciences
(2005)
P. Filzmoser et al.
The bivariate statistical analysis of environmental (compositional) data
Science of the Total Environment
(2010)
K. Hron et al.
Imputation of missing values for compositional data using classical and robust methods
Computational Statistics and Data Analysis
(2010)
J. Aitchison
The Statistical Analysis of Compositional Data
(1986)
Aitchison, J., 1997. The one-hour course in compositional data analysis or compositional data analysis is simple. In:...
J. Aitchison et al.
Biplots of compositional data
Applied Statistics
(2002)
A. Buccianti et al.
Another look at the chemical relationships in the dissolved phase of complex river systems
Mathematical Geosciences
(2008)
J.J. Egozcue et al.
Isometric logratio transformations for compositional data analysis
Mathematical Geology
(2003)
J.J. Egozcue et al.
Groups of parts and their balances in compositional data analysis
Mathematical Geology
(2005)
J.J. Egozcue et al.
Simplicial geometry for compositional data

Eilu, P., Hallberg, A., Berman, T., Feoktistov, V., Korsakova, M., Krasotkin, S., Kitosmanen, E., Lampio, E.,...

Filzmoser, P., Gschwandtner, M., 2011. mvoutlier: multivariate outlier detection based on robust methods. Manual and...

Cited by (94)

A self-learning algorithm for identifying the leverage points in soil data using quantile regression forests
2024, Decision Analytics Journal
Some unusual combinations of predictor values in multivariate regression often influence tampering with the output, and filtering those observations becomes the trickiest and most challenging task. This concern is prevalent and predominant in ecological domains, especially in soil samples, as the data sets are heteroscedastic and heterogeneous. When there is little domain knowledge on the combinatorial criterion for the leverage points, it is advantageous to derive a labelled framework to differentiate the unusual observations. This study proposes a novel framework by integrating quantiles and proximity matrix of Quantile Regression Forest that builds a framework out of the training data set. Unlike other supervised anomalous detection algorithms, prior knowledge about the samples is not required to train the dataset, as the algorithm works in a self-learning mode. The outcome is two sets of observations: regular and leverage points. When unseen data arrives, the regressors’ proximity to these two observation sets is the demarcation criterion. Three real datasets are used, and the outcome of the proposed approach is verified using Principal Component Analysis, Local Outlier Factor, and Gaussian Mixture Models. The algorithm’s results are promising, setting a new trend of using supervised techniques without demanding any prior knowledge of the observations and performing an inlier-based outlier detection technique.
Voronoi Natural Neighbours Tessellation: An interpolation and grid agnostic approach to forensic soil provenancing
2023, Forensic Chemistry
Recently there has been an increase of work dedicated to developing a more objective soil provenancing capability. Notwithstanding the significant progress made, the presented provenancing techniques have predominately been based upon interpolation grids, generated from often arbitrary decisions of the user (e.g., grid cell size, grid placement, interpolation model, etc.). To address the acknowledged reproducibility issues, this paper introduces a spatial modelling technique based upon Voronoi Tessellations that is free from arbitrary user decisions. Termed herein as Voronoi Natural Neighbours Tessellation (VNNT), the proposed approach segments the survey area into many “honeycomb-like” polygons. Of which, the exact number, shape, location, and orientation of polygons are inherently dependent upon the original density of input sampling points from the survey, not a user’s subjective decision.
Utilising compositional geochemistry data from a fit-for-purpose topsoil survey and eleven “blind” soil samples from Canberra, Australia, we compare this proposed VNNT approach against a simpler Voronoi Tessellation, and a previously presented 500 m × 500 m grid following a modified and upscaled Natural Neighbour interpolation. Aside from also being computationally less intensive, our results indicated the proposed VNNT approach regularly yielded at least equal, or often more accurate provenance predictions than that of the gridded Natural Neighbour interpolation. Importantly, the delineation of individual polygons is fundamentally dependent upon the survey’s real sampling design, and most truthfully reflects the underlying sampling density, and associated uncertainties. Consequently, the VNNT approach is significantly less susceptible to expert bias as a result of subjective decision-making and “fine-tuning” of interpolation parameters.
Application of SVD combined with PCA in delineation and evaluation of ore-prospecting targets in the Gejiu tin polymetallic cluster region, SW China
2023, Ore Geology Reviews
The Gejiu tin polymetallic cluster region, situated at the northeastern side of the Ailaoshan-Red River Fault and the junction area between the Cathaysia terrane and the Yangtze terrane, is associated with the late Cretaceous magmatic-hydrothermal metallogenetic event. In this study, a singular value decomposition (SVD) is effectively used to extract the anomaly components from the metallogenetic element groups established by the principal component analysis (PCA) for the identification of the ore-prospecting targets. Three steps are shown: (1) the metallogenetic element groups are obtained using PCA; (2) the element anomaly components formed by magma-hydrothermal metallogenetic events are extracted using SVD from the metallogenetic element groups; and (3) the local and regional element concentration anomaly components are identified by applying the element anomaly components corresponding to different eigenspaces. Finally, ten ore-prospecting targets for the Bi-Pb-Sn-Th-W-Cu mineralization, two ore-prospecting targets for the Ag-Cd-Hg-Zn-Mn mineralization, and two ore-prospecting targets for the La-Nb-Th-Zr-U-Y mineralization are delineated, respectively. The results certify that SVD can discriminate not only the regional element concentration anomaly components formed by the magmatism but also the local element concentration anomaly components triggered by the metallogenetic processes from the metallogenetic element groups established by the PCA.
Biogeochemical prospecting for gold at the Yellowknife City Gold Project, Northwest Territories, Canada: Part 2 - Robust statistical analysis
2023, Applied Geochemistry
The Yellowknife City Gold Project (YCGP) is a promising Au district located in proximity to the city of Yellowknife, Northwest Territories, Canada. Positive results of a short biogeochemical survey in 2015 over the Crestaurum and Barney shears (Part 1) resulted in a broad survey over the early and advanced Au targets across the YCGP (Part 2). This survey uses a systematic two-phase statistical approach, including process discovery and process validation, to evaluate the multi-element biogeochemical dataset and identify the geochemical process controlling elemental occurrence and distribution in black spruce needles. To achieve these objectives, 2788 black spruce needle samples were collected across the Northbelt and Eastbelt of the YCGP and analyzed by ICP-MS using a multi-element determinations package for unashed vegetation. Following that, a biogeochemical dataset including Au, As, Ag, As, Bi, Ca, Cd, Co, Cr, Cu, Fe, K, Mg, Mn, Mo, Ni, P, Pb, S, Sb, Se, Tl, and Zn were prepared and analyzed using univariate and multivariate statistical analyses, including principal component analysis (PCA) and inverse distance weighted (IDW) interpolation method. The univariate statistical analysis showed that the average Au value in the black spruce needle is higher than the background level in all targets. Contiguous anomalous Au values were identified in needle samples collected proximal to the Ptarmigan & Tom mines (∼10 ppb) and Ryan Lake target (3.8 ppb). According to the robust RQ mode-PCA, PC1 and PC2 control the distribution pattern of elements in black spruce needles. PC1 includes two sets of elements indicating geochemical/mineralization factor (Au, As, Sb, Fe, Pb, and Mo) and physiological factor (Cu, Ca, Zn, K, S, P, and Mg). On the other hand, PC2 differentiates these two factors with more emphasis on the influence of the geochemical/mineralization factor. The IDW interpolation method indicates zones of Au enrichments at the YCGP are associated with different sets of pathfinder elements based on the bedrock composition and mineralization style. According to the IDW, elevated Au values associated with shear zones hosted within mafic/ultramafic bodies are accompanied by As, Sb, Pb, and Cu, while those located proximal to late felsic bodies are accompanied by Ag, As, Se, Hg, and Tl. Zones of Au enrichment located along the faulted contacts between the felsic-intermediate metavolcanics and sulphide metasediments are accompanied by Bi, Se, Hg, and Zn. The results of this study attest to the robustness of multivariate statistical analysis in detecting zones of Au enrichment using biogeochemical exploration.
Identification of geochemical anomalies related to mineralization: A case study from porphyry copper deposits in the Qulong-Jiama mining district of Tibet, China
2023, Journal of Geochemical Exploration
Geochemical exploration data plays a vital role in mineral prospectivity mapping (MPM) for discovering unknown mineral deposits. In this study, compositional balance analysis (CoBA), unsupervised and supervised learning are collaboratively used to improve the practice of identification of geochemical anomalies related to mineralization in the Qulong-Jiama mining district of Tibet. For CoBA, five balances of geochemical elements were constructed by the sequential binary partition (SBP) technique, facilitating the interpretation of geochemical/geological processes and contrasting with geochemical anomalies generated by unsupervised and supervised learning. The iterative self-organizing data analysis techniques algorithm (ISODATA) and isolation forest (iForest) are used for preprocessing the geochemical data and therefore optimizing the appropriate training sets (e.g., positive and negative training samples) for supervised learning. Different training datasets including locations of known mineralization, randomly selected positive samples from resulting clusters by ISODATA and outliers by iForest are fed to the support vector machine (SVM) and random forests (RF) algorithm. The results of high AUC values indicate that supervised learning can effectively delineate prospective areas both with SVM and RF, however, the excessive large area of high-prospectivity and a bias toward known mineralization makes the outcomes of MPM unpractical. Nevertheless, a hybrid of the CoBA, unsupervised learning and supervised learning could alleviate such situations and provide insights into MPM using the geochemical data.
Deep ore‑forming fluid characteristics of the Jiaodong gold province: Evidence from the Qianchen gold deposit in the Jiaojia gold belt
2022, Ore Geology Reviews
Citation Excerpt :
Combined with the trace element characteristics (most of them appear as magmatic source pyrite) and sulfur isotopic characteristics (a more concentrated distribution) of pyrite, the ore-forming fluids of the Qianchen gold deposit are inferred to be mainly magmatic hydrothermal fluids, and mixing with a minor volume of metamorphic fluids, suggesting that the regional magmatic rocks and metamorphic rocks both contributed to the source. The primary geochemical halo of an ore deposit was defined by Safronov (1936) as a region including enriched ore-formation and associated elements in hydrothermal mineralization, which are commonly used to detect the hidden and deep mineralization (Gundobin, 1984; Goldberg et al., 2003; Filzmoser et al., 2012; Li et al., 2016), especially hydrothermal gold deposits have a very good effect (Li et al., 1999, 2013a, 2020). The axial zoning of the primary halo in an ore deposit is linked to the direction of fluid flow and provides information about the presence of deep ore bodies (Beus and Grigorian, 1977; Li et al., 2006, 2016; Zuo et al., 2009).
The Jiaodong gold province hosts world-class gold deposits with giant resources of >5000 t, where the Jiaojia-Xincheng fault belt and the Sanshandao-Cangshang fault belt occupy more than two-thirds of the gold deposits. Recent developments of exploration have advanced in tracking deep mineralization. The nature of the ore-forming fluid, source, and the extreme enrichment mechanism of the gold in this region remain debated. Here, we investigate the deep ore body of the Qianchen gold deposit at the southern part of the Jiaojia gold belt based on mineralogy, fluid inclusion, elemental and isotopic composition of pyrite, and the primary halo geochemistry. Natural gold is the important form of Au in this deposit with a relatively large (up to 200 μm) grain size, suggesting the prospect of mineralization at deeper domains. The accompanying PGE minerals (Hexaferrum) hint at contributions from mantle material. The ore-forming fluids of the Qianchen gold deposit are characterized by medium–low temperatures and salinities and have a general H₂O–NaCl–CO₂ ± CH₄ composition, which indicates that the Jiaodong gold province has a unified ore-forming fluid system from shallow to deep. However, the trace elements and S isotopes of pyrite at different elevations show slight differences, suggesting the composition and properties of the fluid changed during the process of ascending, affected by ore-forming conditions such as temperature, pressure, and oxygen fugacity. Furthermore, the evidence of a decrease of salinity with depth and temperature and fluid immiscibility indicates that fluids from different sources continuously mixed during the upward migration of fluid, with the mineralization triggered possibly by phase separation. The primary halo model suggests promising prospecting potential. We formulate a comprehensive metallogenic model for the mineralization process in the Jiaodong gold province. The Au-bearing hydrous magma derived from the partial melting of the metasomatized SCLM, underwent further magmatic-hydrothermal differentiation and fluid exsolution, and thus the Au was enriched and partitioned into exsolved fluids that evolved to auriferous fluids. During the upward migration of ore-forming fluids (magmatic-hydrothermal), many factors affected mineralization, mainly including fluid-rock interaction and mixing of the metamorphic-hydrothermal and meteoric fluids. The fluid activities, multiple sources of Au in the mantle and ancient basement, together with the large fault structures resulted in the formation of the giant Jiaodong gold province.

View all citing articles on Scopus

View full text

Interpretation of multivariate outliers for compositional data

Abstract

Highlights

Introduction

Section snippets

Methods for multivariate outlier detection and graphical representation

Tools for interpreting multivariate outliers

Examples

Conclusions

Acknowledgments

Computers & Geosciences

Science of the Total Environment

Computational Statistics and Data Analysis

The Statistical Analysis of Compositional Data

Biplots of compositional data

Applied Statistics

Another look at the chemical relationships in the dissolved phase of complex river systems

Mathematical Geosciences

Isometric logratio transformations for compositional data analysis

Mathematical Geology

Groups of parts and their balances in compositional data analysis

Mathematical Geology

Simplicial geometry for compositional data