E XPLORATORY FUNCTIONAL DATA ANALYSIS OF MULTIVARIATE DENSITIES FOR THE IDENTIFICATION OF AGRICULTURAL SOIL CONTAMINATION BY RISK ELEMENTS

distribution, going well beyond a superficial detection of extreme concentration anomalies. We illustrate the proposed methodology on a dataset gathered according to the Czech national legislation (1990–2009), whose information content has not yet been fully exploited. Taking into account specific properties of probability density functions and recent results for orthogonal decomposition of multivariate densities enabled us to reveal real contamination patterns that were


Introduction
Detection of anomalies in geochemical data is a traditional task in mineralisation surveys (Sinclair, 1991;McKinley et al., 2016;Borojerdnia et al., 2020) and soil and sediment geochemistry studies (Reimann et al., 2005;Ander et al., 2013;Matys Grygar and Popelka, 2016;Fabian et al., 2017;Lučić et al., 2023).Focus on food production safety and resident health resulted in systematic state-governed mapping of risk element concentrations in soils and the production of highly valuable geochemical datasets in many countries (Ander et al., 2013;Vácha et al., 2015;Tóth et al., 2016;Ballabio et al., 2018;Reimann et al., 2018;Zhang et al., 2020).The existing geochemical datasets cover districts (Galán et al., 2008;Zhang et al., 2020), states (Vácha et al., 2015;Bednářová et al., 2016), and continents (Tóth et al., 2016;Reimann et al., 2018).This poses non-negligible challenges for data processing, which must respect specific properties of compositional data (Grunsky, 2010;McKinley et al., 2016;Zhang et al., 2020) and simultaneously produce meaningful results regarding the original purpose of the data gathering (Vácha et al., 2015;Tóth et al., 2016;Greenacre, 2018;Matys Grygar et al., 2023), which in our case is food production safety.In the case of soil geochemical maps, a common task is to distinguish anthropogenic contamination and natural variability including geogenic anomalies (Ander et al., 2013;Amorosi et al., 2014;McKinley et al., 2016;Matys Grygar et al., 2023).Agricultural use of anthropogenically polluted areas should be limited if it were to endanger food consumers or residents, but the legislative limits should be applied carefully in natural geogenic anomalies (Amorosi et al., 2014), where risk elements are usually less bioavailable.If the legislative limits for soil contamination were simply set too high so as not to discriminate against owners of land in areas with geogenic anomalies, it could worsen the evaluation of anthropogenic contamination elsewhere.Geogenic anomalies also hinder the identification of diffuse contamination, which can be dangerous in the long term, because weak but persistent worsening of soil composition would be overlooked.This happened recently with Cu (Ballabio et al., 2018) and Cd (Reimann et al., 2019) in European agricultural soils, for example.
In recent work, Matys Grygar et al. (2023) concluded that natural variability in large-scale soil datasets considerably decreases the sensitivity of local anomaly detection that could actually be a consequence of anthropogenic contamination.Geochemical maps can be distorted by variable quartz and organic matter contents (McKinley et al., 2016;Négrel et al., 2015), bedrock composition (Galán et al., 2008;Ander et al., 2013;Amorosi et al., 2014), and pedogenesis (Fabian et al., 2017).If a dataset of soil composition which is too diverse is processed by univariate statistical tools such as the Tukey upper fence (TUF) (Tukey, 1977;Reimann et al., 2005Reimann et al., , 2018) ) or Carling upper fence (CUF) (Carling, 2000;Matys Grygar et al., 2023), the anomaly identification has too low a sensitivity due to too large heterogeneities between the first and the third quartile (Matys Grygar et al., 2023).Many ways to overcome this have been proposed, but most of them focus on processing such big datasets as a whole (McKinley et al., 2016;Fabian et al., 2017;Ballabio et al., 2018;Zhang et al., 2020).Instead, Matys Grygar et al. (2023) proposed to zoom in on smaller areas, where natural factors, such as parent geology, topography, and climate, are as homogeneous as possible, and to examine the data distribution in the resulting compartments using traditional exploratory data analysis (EDA) tools, in particular empirical cumulative distribution functions (ECDF).The compartmentalisation of big data proposed by Matys Grygar et al. (2023) is similar to the intuitive separation of soil geochemical datasets into subsets (post-stratification) according to the bedrock geology (Galán et al., 2008;Zhang et al., 2020) and other factors (Ballabio et al., 2018;Négrel et al., 2015).
When multiple contaminants are being examined, univariate EDA is not sufficient because it ignores possible interactions among elements.In contrast, the Bayes space methodology (van den Boogaart et al., 2014), allows to process uni-as well as multivariate distributions via their equivalent representation through probability density functions (PDFs).Furthermore, the recent results of Genest et al. (2023) enable an orthogonal decomposition of multivariate PDFs into univariate patterns and interactions among variables.In the present context, this brings additional advantages as multiple contaminants can be studied simultaneously.In traditional EDA, soil contamination can be defined as anomalously high element concentrations, separated from a population of non-anomalous (usual) concentrations (main population) by gaps in the data distribution (Sinclair, 1991;Reimann and Filzmoser, 2000;Grunsky, 2010).To this end, a population of usual concentrations should be as narrow as possible.This in turn requires the processing of data from areas with rather homogeneous natural factors controlling for soil composition, such as bedrock geology and topography.The practical challenge of such "narrowing" is how to proceed with a mosaic of smaller (more homogeneous) areas without causing an extensive amount of manual work.
Most previous attempts to identify soil contamination reduced information from data distribution functions to a threshold (or thresholds) separating usual and anomalous values (Reimann and Filzmoser, 2000;Reimann et al., 2018).Such an approach seems too crude as it fails to take the whole complexity of real soil systems into account.In contrast, in the functional data analysis (FDA) approach we advocate here, the entire probability distribution function can be exploited; a possible information loss is limited to smoothing of the input concentration data via compositional splines (Machalová et al., 2016(Machalová et al., , 2021) ) or kernel density estimation.The latter is preferred here because it is also available in the multivariate case and the resulting density estimates can be used for a subsequent exploratory functional data analysis (EFDA).
The aim of this work is to propose novel, sensitive, and feasible ways to identify anthropogenic contamination of soils by risk elements.The Czech Republic has varied surface geology, topography, and land use.All these factors scatter natural risk element concentrations in agricultural soils.Historical contamination also occurred in the Czech Republic, because it was part of the most industrially developed countries in central Europe around the turn of the 19th and 20th centuries.Localised contamination is related to ore mining, metallurgy, and heavy industry that caused point contamination of soils; this is relatively well documented in several case studies and overviews of the entire republic (Vácha et al., 2015;Bednářová et al., 2016;Matys Grygar et al., 2023).The extent of diffuse contamination is less clear but expected due to heavy industry development in the 19th and 20th centuries (Sucharová et al., 2014;Vácha et al., 2015;Bednářová et al., 2016) and agricultural activities with the use of impure fertilisers and pesticides (Skála et al., 2020;Matys Grygar et al., 2023).Risk element concentrations in agricultural soils of the Czech Republic have been monitored since the early 1990s (Podlešáková et al., 1996;Poláková et al., 2011) according to the Czech legislation, Act No. 156/1998 Sb.However, the resulting dataset does not include information on natural factors that determine the soil composition, such as bedrock geology, soil texture, and lithogenic element concentrations.This makes the use of this dataset for distinguishing anthropogenic contamination from natural variability challenging (Matys Grygar et al., 2023).The first step made in this paper was to split the entire state database into compartments, in particular into 76 administrative districts (Fig. 1), in which natural variability is expected to be smaller than in the entire dataset.Risk element patterns in those compartments were examined by advanced tools rooted in the Bayes space methodology and tailored to recognise natural and anthropogenic contamination patterns.This work paves a novel way for the statistical processing of big soil geochemistry datasets through compartmentalisation, clustering similar parts into distinct patterns of contamination, and mode-sensitive identification of contamination by taking into account specific properties of probability density functions from the compartments.
2 Study area, data, and methods

Study area
The Czech Republic has an area of 78,700 km 2 , of which more than a half is used for agricultural purposes.Some of these soils have been contaminated by industrial activities (Vácha et al., 2015;Matys Grygar et al., 2023), in particular, historical ore mining and metallurgy and heavy industry in the 19th and 20th centuries.Historical ore mining and metallurgy left clear marks in the soil contamination around the most critical mining cities at the time (Sucharová et al., 2014;Bednářová et al., 2016;Kylich, 2022;Matys Grygar et al., 2023).It also concerned some river systems draining ore mining districts and metallurgy centres, such as the Ohře River (Skála et al., 2020;Kylich, 2022).The major historical metallurgical centres are known, but their actual impact, in particular, diffuse contamination and differentiation between geogenic, metallurgical, and agricultural contamination have not been unequivocally distinguished (Skála et al., 2020;Matys Grygar et al., 2023).
Administrative districts of the Czech Republic (Figure 1) have been historically defined to encompass geographically uniform parts of the landscape and thus their geomorphology and geology are relatively consistent.The homogeneity of those compartments is favourable for detailed examination of element distribution functions, similarly as in historical ECDFs proposed for ore prospection (Sinclair, 1991).In the Czech Republic, there are 76 districts (besides the capital city), which is enough to zoom in from the overall area to compartments, while still having a sufficiently large number of samples per compartment for EFDA, except perhaps for some big cities with a small amount of agricultural soils.Fractions of agricultural soils are unevenly distributed in individual districts (Figure 1).
Preceding case studies in the Czech Republic revealed typical patterns of soil contamination and geogenic anomalies for Cu, Pb, and Zn summarised in Table 1.All non-ferrous ores mined in the Czech Republic were sulphidic, and polymetallic, with Ag accompanied by Cu (Kutná Hora) or Zn (in all other ore regions) and Pb usually accompanied by Zn (Matys Grygar et al., 2023).Iron ore processing and iron and steel production were linked to emissions of Pb (Sucharová et al., 2014) and to a lesser extent Zn, but never Cu due to the contrasting melting points of those metals and the volatility of their oxides.Mixing of contamination sources in districts is not exceptional: the floodplain of the Ohře River has laterally deposited sediments contaminated by half a millennium of mining in the Ore Mountains and hop gardens contaminated by Cu pesticides (Skála et al., 2020).Several other contaminated floodplains used for agriculture are also present in the Czech Republic (Skála et al., 2020;Matys Grygar et al., 2023).

Dataset of soil data and geographic information systems
The Register of Contaminated Areas (RKP) is a database of risk element concentrations in agricultural soils of the Czech Republic, gathered by the state Ministry of Agriculture (MZe) to guarantee food production safety.From RKP, Cu, Pb, and Zn concentrations in soil extracts by cold 2 M HNO 3 (Podlešáková et al., 1996;Zbíral et al., 2004;Poláková et al., 2011) were used.Although this kind of extraction can be considered historical, as it was later replaced by aqua regia extraction (AR), it covered rather evenly the agricultural lands of the Czech Republic.RKP entries from HNO 3 extracts were obtained between 1990 and 2009 and included 49,567 analyses of Cu, Pb, and Zn.Cold 2 M HNO 3 extraction recovers ca.50 % AR content of Cu, 80 % AR content of Pb, and 30 % AR content of Zn (Vácha et al., 2013;Matys Grygar et al., 2023).The advantage of the cold diluted HNO 3 extraction is higher extractability of atmospheric and fluvial contamination relative to geogenic portions of risk elements (Vácha et al., 2013) and generally better relationship with the bioavailable portion of risk elements than pseudo-total or total concentrations (Podlešáková et al., 1996;Groenenberg et al., 2017).At least one sample per km 2 of agricultural soils was obtained from the top 40 cm in ploughed fields, vineyards, hop gardens, and orchards and 30 cm in grasslands.Sampling followed several strategies: the initial one was designed to characterise the entire area of the Czech Republic, and later more detailed analyses were targeted in areas revealed as contaminated.

Data processing by traditional univariate EDA
Boxplots and ECDFs are used as conventional EDA tools and constructed using OriginPro software (OriginLab, USA).Quartiles in distributions are denoted conventionally Q1 to Q4. Q2 and Q3 are also highlighted in ECDFs and considered as roughly corresponding to the main population of samples.For some purposes, interquartile ranges (IQR = Q3 -Q1) are used.The Tukey upper fence (TUF) is defined as 1.5 × IQR on the linear scale.

Statistical processing of PDFs using Bayes spaces
The relative character of probability density functions (PDFs), which is best conveyed by their common unit integral representation, can be captured by the Bayes space methodology (Egozcue et al., 2006;van den Boogaart et al., 2014).Being a Hilbert space, a Bayes space provides a convenient algebraic-geometrical structure for scale-invariant functions.Scale invariance of PDFs means that given a common domain and a positive real multiple c > 0, two positive functions f and g that are proportional in the sense that g(x) = cf (x) for all x carry essentially the same, relative information.From this perspective, imposing the usual unit-integral constraint of PDFs simply chooses a proper representative in the corresponding equivalence class of scale-invariant functions, as is also known from the concept of proportionality in Bayesian statistics.Consequently, we set up the corresponding Bayes space B 2 (I) as a space of such equivalence classes of PDFs.Because we focus on trivariate PDFs for Cu, Pb, and Zn in this study, it will suffice to restrict our presentation to the trivariate case.We will further assume that all trivariate densities have common support of the form The Bayes space B 2 (I) is then defined as the space of all trivariate densities on I whose logarithm is square integrable.The vector space structure of B 2 (I) results from introducing operations of perturbation and powering (standing for the usual addition and multiplication operations of functions), defined for f, g ∈ B 2 (I) , respectively (Egozcue et al., 2006).Note that the result of these operations can be arbitrarily rescaled due to the scale invariance of PDFs.The difference between two elements in the Bayes space can be obtained as the perturbation of one element with the reciprocal of another, i.e.
).To endow the B 2 (I) with a Hilbert space structure, Egozcue et al. (2006) defined the inner product ⟨f, g⟩ = 1/(2η The introduced inner product naturally defines a norm and distance on B 2 (I) The centred logratio (clr) transformation allows to express a PDF f ∈ B 2 (I) as an equivalent real function in the standard L 2 (I) space of square-integrable functions on the cuboid I, given as clr(f Although the clr transformation imposes an additional zero integral constraint on the transformed densities, it enables further statistical processing of PDFs using popular FDA methods, which are commonly designed for functions in an L 2 space. To distinguish unusual properties of one contaminant from those of two or more contaminants, we will decompose the trivariate densities into their so-called geometric marginals and their interactive parts.Geometric marginals are similar to the usual (arithmetic) marginal densities known from standard probability theory, but marginalise the PDFs on the log-scale so as to be compatible with the clr transformation.The idea is similar to that of the geometric mean, where the relative (ratio) scale of positive data is maintained by computing the arithmetic mean of log-transformed observations and transforming it back using the exponential function.
Specifically, in the three-dimensional case, the univariate geometric margins are given by Consequently, the bivariate geometric margins are defined as Unlike the usual arithmetic marginals, geometric marginals are affected by the dependence structure of the trivariate density (Genest et al., 2023).An important geometric property of geometric marginals is that these are, in fact, orthogonal projections of the original trivariate PDF f onto a suitable lower-dimensional subspace, as shown by Genest et al. (2023).These authors exploited it to derive the following key decomposition, which is an analogue of the Hoeffding-Sobol identity (Hoeffding, 1948;Sobol, 1993), The first term f 1 ⊕ f 2 ⊕ f 3 in the above expression is known as the independence part and equals to the product of the univariate (geometric) marginals; the rest constitutes the so-called interactive part.Note that when the trivariate density corresponds to stochastically independent variables, the univariate geometric marginals coincide with their arithmetic counterparts, and the interactive part is simply the uniform density (i.e.absent from the above decomposition).Accordingly, analysing the usual arithmetic marginals alone neglects possible interactions (either bivariate or trivariate) among variables.In contrast, geometric marginals could be considered as arithmetic marginals extended by the information about the relationship of a particular element to the others.In the above decomposition, the bivariate interaction between X 1 and X 2 corresponds to f 12 ⊖ f 1 ⊖ f 2 , i.e., to the bivariate geometric margin f 12 from which the respective univariate geometric marginals have been filtered out.The bivariate interactions between X 1 and X 3 , resp.X 2 and X 3 are constructed analogously.Finally, the trivariate interaction results from filtering out all univariate and bivariate geometric marginals from f itself.
All elements of the decomposition (1) given in brackets are mutually orthogonal (Genest et al., 2023).This justifies the use of FDA techniques on each of these terms separately and enables us to effectively identify the source of dependence between the variables.Probably the most important practical implication of the orthogonality of the terms in the decomposition (1) is the Pythagorean theorem for norms, viz.
where for f ∈ B 2 (I), ∥f ∥ denotes its norm in the Bayes space as defined above.
In the context of Bayes spaces, the norm of a PDF can be considered as a scalar measurement of information (Egozcue and Pawlowsky-Glahn, 2018).Specifically, the PDF of the uniform distribution has zero norm; the more information a density carries (resulting usually in peaky PDFs), the higher the value of its norm.Accordingly, a lower norm of the density indicates the presence of multimodality or heavy tails resulting, e.g., from anthropogenic contamination.The Pythagorean theorem then allows for the decomposition of the overall information, contained in the joint density, into information carried by single variables (through the respective marginals) and their interactions; see also Section 5.4 in Genest et al. (2023).Moreover, to reflect the above interpretation, the vector of squared norms, viz.
is called information composition and will be used under this name in the sequel.
One of the most straightforward, yet naive, approaches to the analysis of multivariate density data is to ignore the dependence structure and to focus on the individual analysis of the distribution of the concentration values of the observed elements independently.This amounts to considering the arithmetic marginals of the trivariate density f .As the first part of EFDA, we thus focus on the univariate arithmetic marginals, which we analyse using hierarchical clustering and the adaptation of the DDC algorithm for univariate functional data described below.
Of course, the dependence structure among variables cannot simply be ignored.Given the existence and complexity of the dependence structure between the concentrations of the three elements we study, it is necessary to identify suitable tools that are easy to grasp by practitioners in a geochemical context.The first source of information on the dependence structure is at hand: the information composition vector in (2).Because the elements of this vector represent relative contributions to the overall information carried by the (squared) norm of f , they carry relative rather than absolute information.The very same property is characteristic of compositional data (Aitchison, 1986;Pawlowsky-Glahn et al., 2015;Filzmoser et al., 2018).For each trivariate PDF, we thus use the seven-component information composition vector in (2) for subsequent analysis, in particular in the hierarchical clustering (see Section 3.3).

Deviating Data Cells algorithm
The Deviating Data Cells (DDC) algorithm for detecting cellwise outliers in multivariate data was introduced by Rousseeuw and Van Den Bossche (2018).It has since proved to be a useful tool for the identification of outlying cells in the data matrix which takes correlations between variables into account.The algorithm can be briefly summarised in the following steps: 1. Standardisation of the variables by means of robust statistics: Robust estimators of location and scale are used to standardise the observed values within the data matrix.Commonly, the column-wise median is subtracted from the original observed value and the difference is then divided by the column-wise median absolute deviation (MAD).2. Flagging outliers in terms of individual variables: A cell is marked as an outlier when its absolute value exceeds the cut-off value χ 2 1 (p), where χ 2 1 (p) is the p-th quantile of the chi-squared distribution with one degree of freedom.

Computation of predicted values:
Predicted values for each variable are based on the set of correlated variables, i.e. those with absolute robust correlation higher than 0.5.Each cell is then predicted from the entries of the correlated variables in the same row.Subsequently, a deshrinking step returns the values to the original scale.
In case there are no correlated variables to be used for prediction, the predicted value is set to 0. Residuals r ij , given as the standardised difference between observed and predicted values are then again compared to χ 2 1 (p) in absolute value -if exceeded, the cell is marked as an outlier.4. Identification of the rowwise outliers: The identification of a subject as a row-wise outlier (i.e.outlier at the observation level) is based on a test statistic obtained as a suitable function of the squared residuals r 2 ij (Rousseeuw and Van Den Bossche, 2018, equation (12)).The realisation of this statistic is then compared to the respective criterion χ 2 1 (p).When exceeded, the row as a whole is then marked as an outlier.
The extension of the DDC algorithm to the case of (univariate) PDFs -here, in the form of arithmetic marginals -is quite natural.As commonly used in FDA, we first represent each functional observation as a linear combination of predefined basis functions, and apply the DDC algorithm to the vector of the corresponding weights or coefficients.
Using the B-spline basis (de Boor, 1978) proved to be an efficient tool for nonperiodic functional data, with ZB-splines (compositional splines; (Machalová et al., 2016(Machalová et al., , 2021))) being introduced specifically for the case of clr-transformed PDFs.Although the use of ZB-splines is technically more straightforward, we opted for an alternative approach which is preferable here for interpretation purposes.The latter uses the usual B-spline basis in the clr space with additional zero-integral constraints.Note that the values of the basis functions range between 0 and 1 at each individual point in the domain and that the coefficients sum up to 1.The latter can thus be perceived as naturally obtained weights for distinguishing the 'level' of anomaly for that point.One concrete choice of the B-spline basis is illustrated in Figure 2.This way, by applying the DDC algorithm on the matrix of the B-spline coefficients, one can not only locate the parts of densities that are outlying from the common behaviour but also quantify the 'degree of outlyingness' at the pointwise level.

Clustering of districts
As discussed above, the grouping of districts according to their similarity is an important part of the analysis.To this end, we employ hierarchical clustering (Nielsen, 2016).The procedure begins by considering each individual observation as an individual cluster.Subsequently, it iteratively identifies the two clusters that exhibit the closest proximity to each other and proceeds by merging these two clusters together.The proximity of clusters is, in our case, measured by a suitable distance of the two furthest elements in the two clusters.This is also known as complete-linkage clustering; see, e.g., Everitt et al. (2001).The procedure is iterated until all the observations are placed in the same cluster.To perform the clustering, we utilised the gplot package in R (Warnes et al., 2022).The output is presented as When clustering the districts based on the squared Pythagorean norm decomposition, the districts are first characterised using the clr transformed seven-dimensional vector of squared norms in (2).We then use these vectors to group the districts into several most similar groups using the complete-linkage clustering algorithm with the Euclidean distance.This way, districts are clustered based on the amount of information carried by the individual components in the decomposition (1).When using arithmetic marginals to characterise the distribution of the concentration values in the districts, we apply the complete-linkage clustering algorithm directly to the corresponding arithmetic margins.The latter are then grouped by their similarity based on the matrix of Bayes distances.

Univariate data analysis by ECDF and high concentrations as possible contamination
The overall variability of risk element concentrations in selected districts and between districts is demonstrated in boxplots (Figure 3) and ECDF plots (Figures 4 and 5).The main mass of the concentration distributions, usually corresponding roughly to Q2 and Q3, is quite variable.A comparison of boxplots for selected districts and the entire Czech Republic (Figure 3) documents the necessity of zooming in (compartmentalisation) from the whole RKP to districts.For example, Teplice (TP) and Jablonec nad Nisou (JN) districts have a high background of Pb, but generally rather low numbers of values exceeding the district-specific TUF.This is consistent with mostly natural causes of the high concentrations: the JN district represents a natural contamination Type 6 according to Table 1 due to a prevailing portion of granitic bedrocks.Application of the entire-Czech Republic TUF (orange arrow in Fig. 3) would thus mark the entire Q3 and Q4 of JN district as anomalous, which is not the case in the context of the district.On the other hand, the entire Czech Republic TUF would be insensitive to identify outliers in the low-background BV and VY districts.IQRs are also district-specific -in Teplice (TP), the Pb concentration IQR is extremely high (Figure 3) due to a considerable portion of anomalous soils from the Ore Mountains (metallogenic zone, i.e. naturally contaminated area) in the dataset that greatly increases the TUF and worsens the sensitivity of the detection of anomalies by a univariate analysis.Boxplots are thus not always suitable for a detailed evaluation and methods such as ECDFs or FDA of PDFs in  the framework of Bayes spaces that utilise more information from the data should be preferred.While both ECDF and PDF representations of the concentration distributions are equivalent as such, the clr transformation, which is defined for PDFs, should better capture anomalies within small concentration values.The reason is that the clr transformation is defined to take into account the relative scale of PDFs, which highlights the fact that the small PDF values are the main source of variability in the concentration density.
In ECDF stacks (Figures 4 and 5), selected districts that are well known (or suspected) to be contaminated are plotted in the upper row; the lower row displays plots for neighbouring, less contaminated districts with very similar geology and geography.The vertical pairs of ECDFs are thus suitable for very detailed individual evaluation of concentration patterns.In ECDFs for Cu (Figure 4), PS and CR districts show monomodal distributions with only exceptional high-value outliers, while the RA district (neighbour of PS) and KH district (neighbour of CR) have much more skewed distributions at high values and considerable occurrences of very high Cu concentrations relative to the main populations of the district, that all resulted in significant departures from the fitted lines at concentrations above the values corresponding to Q3.In KH and UH districts, already the Q3 boundaries are elevated relative to the Q3 in the neighbouring districts that are less contaminated (CR and ZL, respectively), documenting increased concentrations in at least the entire Q4.UH and ZL districts are, however, both anomalous in the entire Czech Republic comparison: their main populations are close to the entire Czech Republic TUF (their bedrock geology includes Cu-rich marine sediments), again underpinning the need to evaluate districts with anomalous background with particular care.
Similar features as for Cu can be found in ECDFs for Zn in pairs of selected districts (Figure 5).ECDFs (Figure 5) show important diagnostic features of contamination: (i) slower decrease in occurrence of high concentrations in Q4, sometimes starting already in Q3 in considerably contaminated districts, (ii) presence of heterogeneities (minor populations of high concentration values) shown by red arrows in Figures 4 and 5, and (iii) considerably larger percentage of extreme outliers in the districts contaminated by ore mining and smelting (LN and KH).Those features are not explicitly dependent on the position of the main mode in the concentration distribution that determines the TUF.

Univariate data analysis using PDFs and its use for clustering of compartments
The concentration distributions in districts can also be compared using PDF heatmaps of clr transformed arithmetic marginals of univariate concentration series (Figure 6), i.e. assuming (unlike for geometric marginals) patterns that correspond to one element at a time.Here the densities of concentrations per district are transformed to a colour scale, with the horizontal axis corresponding to element concentrations.Districts with narrow main concentration populations and without contamination are those with warm colours in a narrow range of low concentrations in the heatmaps and sharp colour change to cold hues at the high concentration end.Contamination (i) broadens the main population and/or (ii) leads to heavy tails above the modal maximum with the gradual decline of the data density at the high end.
Subsequent clustering of districts according to the PDF representation of their Pb concentrations (Figure 6) compares more features of concentration distributions than captured by the EDA tools described above.Clustering groups, whose number was determined visually from the dendrogram, recognised districts with higher main modes (Figure 6, cluster C) together, including JN discussed above (Figure 3), with the relatively sharp decline of the density above the main mode and thus little contamination, in spite of high overall concentrations.Figure 6 also shows districts with lower main modes and additionally with the sharp decline of densities at concentrations higher than the main mode (sharp change to cold colours), documenting both low background and negligible contamination (Figure 6, cluster D).Severe contamination is shown by relatively high densities at high concentrations (light blue shades on the right side of the heatmap, Figure 6, cluster B), actually including districts known to be affected by types 1, 4, and 5 contamination (Table 1).Noteworthy for that cluster is the broad main mode that is shown by considerably low colour contrast in the entire concentration range.This would make the use of the TUF in these districts inefficient (insufficiently sensitive) due to a large IQR.Particular attention in Figure 6 should be paid to cluster A, which shares similar features with cluster B but with a bit less broadened concentration distribution and weaker contamination, i.e. decay of the concentration densities above the main mode occurring at somewhat lower concentrations than in cluster B. The clusters A and B thus share features expected for areas contaminated by Pb.
FDA can also be used for the detection of anomalies in the data set.Here, the DDC algorithm is applied to the arithmetic marginals of Pb data set represented via a matrix of B-spline coefficients.These coefficients are assigned their level of anomaly between -1 (maximal deviation in the negative sense) and 1 (maximal deviation in the positive sense) via scaled residuals computed in step 3 of the algorithm sketched in Section 2.5.To visualise the anomaly in the sense of functions rather than coefficients, the levels of the anomaly were weighted using the functional values of the B-spline basis for a discretised grid of time points.The matrix of weighted anomaly levels then serves as the input for hierarchical clustering, as shown in Figure 7.
Districts severely impacted by ore mining or industry (contamination types 1, 4, and 5) are identified by the presence of local outliers at concentrations that are one order of magnitude higher than the global background of ca.20 mg kg −1 (cluster D in Figure 7).The districts JN (and SO) with considerably high background Pb concentration are also indicated as anomalous (cluster A in Figure 7) although they are not contaminated anthropogenically (type 6 in Table 1).
In clusters B and C, the districts with anomalies at high concentrations (FM, BM, and VY) are really contaminated anthropogenically (types 3 and 4 in Table 1).Several contaminated samples per district are sufficient to be flagged as anomalous by the DDC algorithm (Figure 7).

Contamination as heterogeneities at high concentrations and a multivariate analysis
The traditional EDA definition of anomalies (in particular high anomalies) is based on distinguishing cluster(s) of high concentrations above the main population that are separated from the main population by breaks or gaps in the ECDF (Sinclair, 1991).Those traditional breaks or gaps are marked by red arrows in the ECDF plots for Cu (top row in Figure 4) and Zn (Figure 5).Although placing the arrows into ECDFs is subjective because the actual shape of the natural data distribution for the majority of concentration values is unknown, the presence of some heterogeneity at the high end of the distribution is obviously a diagnostic sign of contamination.Coordinates of the breaks or gaps in ECDFs in individual districts show a remarkable scatter of (i) concentration values at those thresholds and (ii) percentiles of anomalies above those thresholds (portions of anomalously high samples), again underpinning the need for compartmentalisation and an individual approach based on smaller areas.Noteworthy is the variable (district-specific) position of the TUF and of the breaks or gaps in the ECDFs in Figures 4 and 5, documenting the insufficiency of the TUF to capture the real, fine structure of concentration distributions.It is thus necessary to define local anomalies within each particular distribution, represented either using ECDFs or PDFs, without a priori assumptions on the shape of the distribution in the main concentration population or fraction of anomalies.
While FDA of PDFs can characterise heterogeneities as those marked by arrows in ECDFs (Figures 4 and 5) in a different way as shown already in the previous section, it has the additional large advantage of allowing an extension to the multivariate case.Using multivariate PDFs allows the joint distribution of several elements to be characterised simultaneously, which would be hardly possible with traditional EDA tools.
Clr transformed PDFs can find heterogeneities in univariate data distributions, as demonstrated in panels A in Figures 8  and 9 and highlighted there by red or blue arrows.Smoothing of PDFs can be additionally applied to identify bi-and trivariate heterogeneities (panels C and D in Figures 8 and 9). Figure 8 shows plots for typical Cu-pesticide impacted districts (BV, type 2 contamination in Table 1) with a highly heterogeneous distribution of Cu (red arrows in Figure 8A).
Here the contamination in Cu is truly monometallic (univariate), without the substantial impact of other elements on the Cu concentration at the high end.It is, however, important to take into account some lithogenic effects in the main concentration population controlling jointly all three elements (red ellipses in Figure 8B).Neither ore mining nor heavy industry have impacted the BV district and neither Pb nor Zn were used in pesticides in vineyards in the Czech Republic.This justifies the use of arithmetic rather than geometric marginals.The heterogeneity of the Cu concentration distribution is amplified in clr transformed arithmetic marginals that better highlight the variability in the outskirts of the data distribution (Figure 8A, middle panel), similarly as in ECDFs of raw concentration data.Because in BV elevated Cu is not related to high values of Pb and Zn, there are clouds of elevated Cu marked by thick red arrows in the scatter plots (Figure 8B) outside the main concentration population (ellipses in Figure 8B).This is reflected by the bivariate PDFs (in terms of clr transformed arithmetic marginals), which highlight these heterogeneities as islands of higher density values around the main population (red arrows in Figure 8C and D).
Visualisation of multimetallic contamination is shown in Figure 9 for the Jihlava district (JI), with mixed contamination from mining polymetallic sulfide ores and silver smelting in the city of Jihlava from the Middle Ages to the 18th century.Here, elevated Cu is associated with elevated Pb and less with elevated Zn (Figure 9B) and highlighted by multiple maxima in the density maps on the clr-scale (Figure 9C).While the arithmetic marginals (Figure 9C) are still reasonable here, geometric marginals (Figure 9D), which show similar patterns, better reflect the impact of multimetallic contamination in the univariate and bivariate plotting.
The heterogeneity of the densities can be quantified via norms and used for fingerprinting individual districts according to mono-, bi-, and trimetallic distribution heterogeneities.Figure 10 shows districts clustered according to clr transformed information composition vectors defined in (2).Their components corresponding to the univariate geometric marginals are denoted f (Cu), f (Pb), and f (Zn), to the bivariate interactions f (Cu, Pb), f (Cu, Zn), and f (Pb, Zn), and to the trivariate interaction f (Cu, Pb, Zn).Bi-and trivariate interaction heterogeneities clearly cannot be examined by univariate ECDFs.Yet they are promising indicators of contamination according to specific interelement relationships (Table 1) and as such extend the EDA diagnostic tools to recognise contamination.For example, districts contaminated by Cu-bearing pesticides (type 2 contamination in Table 1, districts RA and UH in Figure 4, and district BV in Figure 8) are special because of the high heterogeneity of the Cu distribution and simultaneously its mono-metallic character, that in turn produce high bivariate interaction heterogeneities Cu-Pb and Cu-Zn (i.e.these interactions do not indicate a strong uni-, or multimodal pattern and are thus rather uninformative).The bi-and trivariate heterogeneities (blue arrows in Figure 8) are sensitive to monometallic contamination that pushes points outside the red ellipses in Figure 8, panel B, as the ellipses show joint lithogenic controls of element composition.The simultaneous heterogeneities in uniand bivariate densities with Cu are joint features of clusters C with RA and other pesticide-contaminated districts (LT, PR, and ZN) and D1 with BV, OC, VY, and UH in Figure 10.In some cases, specific Cu contamination also increases trivariate interaction heterogeneity, such as in the case of districts HO and LN in cluster C1 (Figure 10).
Clusters B (without subcluster A) and G in Figure 10 include districts with highly heterogeneous Pb distribution, in particular those with ore mining (type 5 in Table 1) HB, ZR, TC, PB, JI (Figure 9), and KH (Figure 3), big cities UL, PM and DC with abundant high Pb pollution, and districts, in which there are two types of soils: normal ones and those in the metallogenic zone of the Ore Mountains -this is the case for TC (Figure 3) and KV.Distinguishing anthropogenically contaminated districts and such geologically heterogeneous districts is, however, out of the scope of this exploratory analysis, as both can exhibit the same element concentration patterns.
Additionally, the DDC algorithm from Section 2.5 was applied to the information composition vectors as shown in Figure 11.Due to compositional nature of the data objects, the original algorithm was adapted to work on all pairwise logratios, while the graphical output is still interpretable in sense of the original components (Šimíček et al., 2023); accordingly, we refer to LR-DDC algorithm.By following Section 3.2, the level of anomaly is assigned to each pairwise logratio of the information composition, and these levels are aggregated according to each compositional part.The corresponding cell in the graphical output is highlighted if at least some proportion of the corresponding logratios (here 30%, in line with the default setting) is marked as outlying.Based on the results presented in Figure 11, each district flagged by LR-DDC as an outlier has contamination with some specific characteristics.BM, OV, and PM are large cities with severe polymetallic contamination (contamination type 4) with particularly high homogeneity in trivariate densities due to all three elements jointly enriched near the contamination sources.The majority of the flagged districts have been impacted by ore mining (KV and PB, contamination type 5) and silver production (KV, KH, and PB, contamination type 1), which left severe contamination but with distinct character.PB is a Czech district with the most severe Pb pollution near the major smelter (high heterogeneity in Pb), with high Pb associated with high Zn (polymetallic contamination, type 5).TC has been contaminated by Pb and Zn from several important metallurgy-related sources of contamination (contamination types 1 and 3), and thus this district shows non-overlapping hotspots of those elements typical of high heterogeneity in polymetallic patterns.RA district is flagged in Figure 11 because of the pesticide-born Cu contamination discussed above.

Specificity of diffuse contamination and how to recognise it
Diffuse contamination (types 2 and 3 in Table 1) is common in industrialised countries.It refers to a weak, large scale, and thus poorly spatially constrained elevation of risk elements, with continuous increase of risk element concentrations above the natural levels of the main population and no sharp threshold between uncontaminated and contaminated    samples.In ECDF, diffuse contamination is visualised as a bend of the main trend in Q4 or even already in Q3 (Figure 4, district KH; Figure 5, district FM).Diffuse contamination can also be accompanied by heterogeneities at the high end of the concentration distributions (Figures 4 and 5, majority of contaminated districts in the upper lines show this feature), but this heterogeneity can generally be lacking or it is poorly discernible (Fig. 4, district UH) and thus cannot be used as an unequivocal diagnostic sign for diffuse contamination.True diffuse contamination is more resistant to unequivocal identification because it is relatively weak and widespread and thus coalesced with the main concentration population.
Diffuse contamination can be identified by expert-based comparison of contaminated districts with neighbouring ones that are uncontaminated and have the same geology and topography.Such pairs are shown in columns in Figure 4, where UH has been impacted by Cu from pesticide use in vineyards (contamination type 2), while ZL is not suitable for vine production.The diffusively contaminated district UH shows already an elevated Q3 relatively to ZL, while the ECDF of UH has only a poorly discernible change in slope.Here the multi-element pattern helps: Single-element contamination in UH is highlighted by more heterogeneity in the joint density of Cu and Zn relative to ZL; this is why these districts belong to different clusters in Figure 10 (UH and ZL are present in distinct clusters D1 and E, respectively).
In contrast to monometallic contamination, the elevated concentrations of two or all three examined elements are associated when there is multi-element contamination, and their bi-element interaction heterogeneities are lower than in the absence of contamination.Diffuse contamination by Pb and Zn jointly (type 3 in Table 1) impacted the FM district; here indeed the main population has an increased Q3 and more samples above the breaks or gaps in the ECDF of Zn relative to the neighbouring district VS (Figure 5).An elevated main mode and diffuse enrichment of Pb in FM is shown in Figure 6, where the districts FM and VS belong to different clusters C and E, respectively.Otherwise, FM and VS districts have similar element patterns and both belong to cluster E in Figure 10.However, in line with the contamination type, the univariate Pb and Zn distributions are more heterogeneous while the bivariate interaction between Pb and Zn is more homogeneous in FM relative to VS.In any case, Pb concentrations in the FM district are themselves sufficiently high to be evaluated as anomalous by the DDC algorithm (cluster D in Figure 7 with high Pb concentrations).Clearly, all EFDA tools presented above provide complementary information on the univariate and joint variability of the element patterns by considering the main populations (represented by ellipses in panels B, Figures 8 and 9) as well as the respective anomalies.

Novelty
Contamination is frequently associated with high outliers for which TUF (or CUF) has become an almost routine tool in EDA (Reimann et al., 2018;Matys Grygar et al., 2023).While the TUF approach is satisfactory for the preliminary data examination and swift visualisation in maps, it has considerable limitations.The first is the assumption that the lognormal distribution is the underlying "truth" for the TUF, which seems to be too naive (Reimann and Filzmoser, 2000).The inadequacy of the TUF for real-life datasets is demonstrated in Figure 3.The ECDF approach (Sinclair, 1991) is more adequate for real data sets as it (i) preserves the fine structure of the data distributions and (ii) works with them individually (Figures 4 and 5), irrespective of how large a portion of samples belongs to the main population and how many are anomalous (Sinclair, 1991).The novelty brought by the FDA of PDFs using the Bayes space approach and demonstrated in the Results section are (i) detailed graphical presentation of the fine structure of univariate data that takes changes in small PDF values into account (Figure 6) and (ii) clustering of similar patterns in multivariate data structures to simplify the data interpretation (Figure 10). Figure 6 is an example of how the univariate structure of Pb concentration data can be visualised using PDFs in the clr space while preserving information on the data distribution.The colour scale in Figure 6 highlights the main populations of Pb with a skewed high end and an elevated proportion of high outliers that are not affected by the highly variable main modes in the individual distributions (clusters B and A in Figure 6).Districts with high natural backgrounds such as JN (Figure 3) are not considered contaminated by this approach (Figure 6), although their main mode is anomalous as compared to other Czech districts (Figures 3 and 7).
Another novelty the FDA of PDFs brings to environmental geochemistry exploration is a visualisation and quantification of multivariate heterogeneities in data distributions without making explicit assumptions on what constitutes the "ideal" data distribution.The heterogeneity of the data distribution, whose identification is a key feature of the ECDF analysis, is captured by compositional data analysis of the information compositions.Such analysis can help to identify monometallic diffuse contamination by Cu pesticides because it increases both heterogeneities of the Cu distribution and of the bivariate distribution of Cu and Zn (Figure 10).Identification of diffuse contamination is a challenge for soil data exploration as it needs a comparison of contaminated and uncontaminated areas with similar factors controlling the soil composition (Figures 4 and 5).In contrast, Cu contamination from ore mining and smelting is usually polymetallic and thus it is characterised by lower heterogeneity of the joint distributions of Cu, Pb, and Cu, Zn, respectively.
The methodology introduced in this paper could be applied to multivariate analyses including more than three elements.That extension would be technically feasible, but the geochemical interpretation of results could be more challenging.The choice of Cu, Pb, and Zn in this paper was not arbitrary: those elements have similar fates in soils (grain-size control, mobility, adsorption to particles) and thus similar natural controls, which makes their relationships rather simple and understandable.Further risk elements, that could be attractive in contamination tracing, show many specific features, such as larger mobility in soil profiles (Cd), strong control by local geology (As), and prevalent spread by the atmosphere (As, Hg).Those specificities would weaken interelement relationships relative to the Cu, Pb, and Zn triad.Still, multivariate EFDA would deserve testing, for example for the identification of polymetallic diffuse contamination.

From big datasets to compartments and back to the whole
The majority of environmental geochemistry studies apply the same thresholds to the entire available dataset (Tóth et al., 2016;Reimann et al., 2018) or evaluates ECDFs for entire datasets (Fabian et al., 2017).Such a "mass processing" of big data must inevitably destroy details of the entire picture.Négrel et al. (2015) and Matys Grygar et al. (2023) concluded that the interpretation of big geochemical datasets is challenging due to too broad scale of factors controlling the soil composition.Our idea for the Czech Republic data was thus to separate the big dataset and "zoom in" on smaller areas, assuming that factors that control the soil composition are more homogeneous and thus more easily deciphered (Matys Grygar et al., 2023).The idea of zooming in is generally opposite to the prevailing current trend of processing big data by brute computational force of explanatory models, frequently working in a black box manner.The risks of this general trend have been critically discussed by Rudin (2019) and Loyola-González ( 2019).Together with Rudin (2019) we promote the use of interpretable models, also called white-box models (Loyola-González, 2019).
In this paper, the big soil chemistry dataset of the entire Czech Republic was divided into compartments, which are historically established administrative districts, typically with certain natural geographic boundaries, a rather homogeneous geology, and consistent agricultural activities.Ballabio et al. (2018) used NUTS2 regions for compartmentalisation of their pan-European dataset, which was a good step but perhaps a bit too coarse: In Ballabio et al. (2018) the entire Czech Republic was divided into eight NUTS2 regions, boundaries of which were defined by Eurostat, while in this work, we used 76 districts, i.e., we worked on a much finer spatial scale.The compartmentalisation of big datasets would considerably extend the volume of work needed for data processing if it needs to be done manually using traditional EDA tools (Sinclair, 1991).If this is not done, only several contaminated areas can be identified in the whole Czech Republic as was demonstrated on a low-resolution map by Bednářová et al. (2016).Novel FDA tools from Genest et al. (2023), tailored for the use in geochemistry in this paper, in particular their powerful visual representations and clustering according to internal similarity, make it possible to zoom in without losing the fine structure of the data set.The power of the Bayes space approach is documented by a graphical representation of Pb concentration distributions for various districts in the Czech Republic (Figure 6) and district clustering according to uni-and multivariate density heterogeneity (Figure 10).Of course, the potential of FDA of PDFs goes beyond this initial study, which can be further expanded in many directions (Kokoszka and Reimherr, 2017).

Sensitivity test of EFDA for diffuse contamination by Cu pesticides
The need for larger sensitivity in distinguishing contamination from a highly variable, geographically specific natural background is documented in Figure 3: the use of the TUF for the entire Czech Republic makes no sense on the district level.Too broad spatial scales and too high thresholds prevented the identification of soil contamination in the Czech Republic by Cu pesticides in vineyards and hop gardens in the maps of Bednářová et al. (2016) and Ballabio et al. (2018), although it can be visualised if proper data mining tools are used (Matys Grygar et al., 2023).
Clusters C and D1 in Figure 10 include districts with high variability in Cu concentrations combined with a relatively weak relation between Cu and Zn; taken together these are diagnostic signs of a monometallic Cu contamination.Those clusters contain 20 districts, in particular all 8 districts with hop gardens and vineyards that actually represent more than 1% of agricultural soils in those districts.The fraction of hop gardens and vineyards is more than 0.4% in 15 districts, 12 of which are included in the C and D1 clusters.Figure 12 shows the coincidence of the intensity of hop and wine production and the position of districts with monometallic Cu contamination according to Figure 10.Possibly also abundance of orchards is relevant, as some fruit trees or shrubs are also treated by pesticides.Part of the districts where the Cu contamination is not related to those specific crops was impacted by metal mining; natural factors may also play a role, such as local mafic rocks (Matys Grygar et al., 2023).Also, information on hop and wine production does not include historical activities.Still, the match of EFDA results and wine-and hop production intensity shown in Figure 12 documents good EFDA performance.

Conclusion
This work paves new ways to data mining in soil geochemistry datasets.A considerable volume of sampling and analytical work has been performed in the recent decade.However, there is a pressing need to develop powerful data processing tools that would allow us to fully use these data to gain an understanding of natural and anthropogenic element patterns in soils.Regarding the geochemical principles of our approach, we returned to traditional exploratory data analysis, developed for ore prospection by the end of the 20th century, and extended it to a multidimensional analysis.To make big datasets processable, novel tools from the domain of functional data analysis of probability density functions using the Bayes space methodology were developed here.The graphical presentation of results was tailored to process big data without the loss of details and while respecting a wide variety of natural factors that govern the soil composition.The proposed tools, tested on a soil geochemistry dataset of the Czech Republic, can be used in any soil mapping project and can identify anthropogenic contamination with larger sensitivity than conventional data mining tools.For better accessibility of the new concepts, trivariate PDFs were analysed in this paper.The presented mathematical tools could naturally also deal with higher dimensional data, but this would require resolving issues related to data processing (e.g., kernel density estimation becomes computationally intensive in higher dimensions) and interpretation (the complex data structure can have a substantial impact on both geometric and arithmetic marginals and a geochemical interpretation may be elusive).Nevertheless we hope to address these challenges in the near future.Role of the authors: T. Matys Grygar designed the geochemical part of the work and prepared the initial manuscript draft.U. Radojičić performed the preliminary statistical analysis, as well as the first draft of the methodology section, and helped to finalise the manuscript.I. Pavlů worked on the outlier detection and prepared the initial draft of the DDC section.S. Greven and J.G. Nešlehová worked on the methodological part of the manuscript and helped to finalise the paper.Š. Tůmová performed work with geographic information systems and produced maps.K. Hron worked on the methodological part of the paper, helped to finalise the manuscript, and coordinated the entire teamwork.

Figure 1 :
Figure 1: The position and abbreviations of administrative districts in the Czech Republic.The shades of brown show percentages of agricultural fields in the total area of individual cadastral units.

Figure 2 :
Figure 2: One concrete choice of a B-spline basis.For each point on the x-axis, the function values of the basis functions sum up to 1.

Figure 3 :
Figure3: Boxplots of Pb concentrations pooled across the entire Czech Republic (left, in grey) along with seven districts covering typical patterns observed in that country.The numbers (white ink in black boxes) refer to contamination patterns according to Table1.Orange arrows highlight TUF for the entire Czech Republic.

Figure 4 :
Figure 4: ECDFs of Cu (mg kg −1 ) for contaminated (top line) and neighbouring uncontaminated districts (bottom line).Vertical rectangles are boundaries of Q2 and Q3 in individual districts.Numbers (white ink in black boxes) refer to contamination patterns according to Table 1.TUFs are shown by black arrows for individual data sets and orange arrows for the entire Czech Republic.Red arrows indicate heterogeneities at high concentration levels.

Figure 5 :
Figure 5: ECDFs of Zn concentrations (mg kg −1 ) in selected districts.Captions and explanations of abbreviations are analogous to those in Figure 4.

Figure 6 :
Figure 6: PDFs of Pb concentrations, row-wise per district, showing the distribution of concentrations in a colour scale in individual districts and clustered accordingly.

Figure 7 :
Figure 7: Districts clustered according to anomaly detection by the DDC algorithm applied to arithmetic marginals of the Pb distribution.

Figure 8 :
Figure 8: Representations of Cu distribution densities (A), scatterplots of pairs of elements (B), and the respective clr transformed bivariate PDFs as arithmetic (C) and geometric marginals (D) for Břeclav (BV) with contamination type 2 (Table 1).Red arrows show populations of high Cu concentrations unrelated to elevated Pb nor Zn, i.e. single-element contamination.Minor bimetallic contamination by Pb and Zn is highlighted by a blue arrow.Red ellipses in panel B correspond to the robust, MCD covariance ellipses.They highlight the main concentration population for the district and indicate a positive interelement association.

Figure 9 :
Figure 9: Representations of Cu distribution densities (A), scatterplots of element pairs (B), and the respective clr transformed bivariate PDFs as arithmetic (C) and geometric marginals (D) for Jihlava (JI) district with polymetallic contamination type 1 (Table1).Blue arrows show samples with multi-element contamination.Red ellipses in panel B correspond to the robust, MCD covariance ellipses.They highlight the main concentration population, and show only weak lithogenic correlation between Cu, Pb, and Zn.

Figure 10 :
Figure 10: Compositional clustering of districts according to their respective clr transformed information compositions.Specific values of norms indicate heterogeneity of the respective distributions: low norms (in blue hues) correspond to high heterogeneities, while high norms (in red hues) evidence narrow distributions.

Figure 11 :
Figure 11: Anomaly levels of information composition vectors based on the scaled residuals of pairwise logratios and aggregated in sense of the original components.The color key corresponds to the case of Figure 7.

Figure 12 :
Figure 12: Clusters C and D1 (districts bordered by thicker lines), the proportion of hop gardens and vineyards (blue shades) as well as orchards (green columns) in Czech administrative districts.
with the geographic data search and processing.T. Matys Grygar, K. Hron, I. Pavlů, J.G. Nešlehová and U. Radojičić were supported by the Czech Science Foundation, Project 22-15684L, and by the Austrian Science Foundation, Project I 5799-N, respectively.S. Greven and J.G. Nešlehová acknowledge funding from the German Research Foundation (DFG grant GR 3793/8-1) and the Natural Sciences and Engineering Research Council of Canada (RGPIN-2022-03614), respectively.Part of this work was completed while K. Hron visited the Centre de recherches mathématiques (CRM) in Montréal (Québec, Canada) in May 2023 as part of the CRM-Simons Scholar-in-Residence program.