Mineral–nutrient relationships in African soils assessed using cluster analysis of X-ray powder diffraction patterns and compositional methods

Highlights • Cluster analysis applied to soil X-ray powder diffraction patterns.• Nine mineralogically distinct clusters of soils defined.• Statistically significant differences in nutrient compositions between clusters.• Feldspars and Fe/Ti/Al/Mn-(hydr)oxides drive total nutrient concentrations.• 2:1 phyllosilicates drive extractable (Mehlich-3) nutrient concentrations.

Most ordinary statistical methods are designed for real-valued variables and focus on the absolute magnitude of the measurements as basic input for comparison between samples. However, compositions consist of inter-related collections of parts of a whole within which values are relative to each other; commonly expressed in units such as percentages, parts per million, weight percent or similar. Without loss of generality, when a practitioner closes the data to add up to a constant (e.g. 1 when expressing the data in proportions) the composition is equivalently represented on a so-called unit simplex. The construction and analysis of log-ratio transformations involving a one-to-one mapping from the simplex to the ordinary real space have become the mainstream approach to deal with data carrying relative information (Pawlowsky-Glahn et al., 2015). Logratios are real-valued and this generally enables the use of ordinary data analysis techniques on them, with the possibility of transferring results and conclusions back to be expressed in terms the original compositions. This type of data transformation additionally guarantees that results do not change with changes in the units of measurement used (e.g. if data were rescaled from mg kg −1 to proportions) or depending on whether the full composition or only a subset of its parts (a sub-composition) is of interest, avoiding then conflicting interpretations when using different scaling procedures. The issues with data representing parts of a whole and compositional methods have been known for long in the geological sciences (see e.g. Buccianti et al. (2006) and Grunsky and de Caritat (2019)), however in recent decades they have found applications in a wide range of areas, from modern molecular biology and epidemiology to economics and social sciences. Recent applications in soil science can be found in, for example, Reimann et al. (2012), Abdi et al. (2015) and Neiva et al. (2019).
We use in this work so-called isometric log-ratio (ilr) coordinates. In particular, a family of ilr-coordinates known as balances, which represent the relative importance or weight of one part or group of parts of the composition with respect to another part or group of parts as summarised by their corresponding geometric means (Pawlowsky-Glahn et al., 2015). A collection of D − 1 balances is obtained from a D-part composition. Formally, given a composition x = (x i , . . . , x D ), a procedure to construct balances b i is based on a sequential binary partition which produces contrasts between two subsets of parts as where x + ik and x − ik refer to the subsets of r i and s i parts of x going, respectively, into the + (numerator) and -(denominator) groups. The multiplicative factor preceding the log-ratio term is a normalisation factor required to guarantee orthogonality of the balances.
In accordance with the relative scale of the data, the co-dependence or association between parts is determined in terms of proportionality between pairs of parts, instead of using the ordinary Pearson correlation measure. Following (Aitchison, 1986), proportionality was measured by the matrix of log-ratio variances T = [t ij ] D×D , the so-called variation matrix, where t ij = var(log(x i /x j )), with i, j = 1, . . . , D. A log-ratio variance t ij which is close to 0 indicates that the two components x i and x j are nearly proportional (highly co-dependent); that is, their log-ratio is nearly constant.
Tables S1 and S2 show the variation matrices obtained for the total and M3 datasets respectively.
Note that the variance (var) was computed here using the mean absolute deviation (MAD) as a  The clr-variables y = clr(x) are computed as The relationships between components in Figure S1 are reflected in the length of the links between rays for the different parts. The shorter the link between two arrowheads the higher the co-dependence (proportionality) between the corresponding parts. This is in agreement with the variation matrices (although note that the biplot is a 2-dimensional representation and some information is lost; in this case the biplot based on the first two principal components accounts for 71% and 67% of the total variability in the total and M3 datasets respectively). For the M3 dataset, the co-dependences Ca M -Mg M and Fe M -Zn M are clearly illustrated. Note that the rays for these two pairs lay nearly perfectly on a line, so they form a one-dimensional pattern of variation. The information in the variation matrix T was used in this work as input to perform hierarchical clustering of variables (R-mode Ward method), and the derived overall grouping structure of the parts (according to their proportionality) was displayed in the associated dendrograms. They are shown for the total and M3 datasets in Figure 7 of the manuscript. These groupings were used here to define data-driven balances between nutrient concentrations, with the split at each node of the dendrogram determining the parts going into the numerator (left-hand branch) and denominator (right-hand branch) of the log-ratio term of each balance (b i T or b i M , i = 1, . . . , 7). Thus, for the M3 dataset for instance, b 1 M is given by

Moreover, Mn
which represents a contrast between B M and the other nutrients (summarised by their geometric and so on. Note that a balance is equal to zero in equilibrium, so the sign of the balance indicates the predominance of one subset of components versus the other. Moreover, these balances account for decreasing amounts of total variability in the dataset. Thus, for the M3 dataset, they account for 51.75%, 17.47%, 11.48%, 7.81%, 5.61%, 4.21% and 1.68% from b 1 M to b 7 M respectively. In the total dataset, these contributions are 40.94%, 16.64%, 14.42%, 9.01%, 7.54%, 5.79% and 5.65% from b 1 T to b 7 T respectively. Hence, most variability is explained by the first balance and then the explained variability decreases gradually.  Figure S1: Compositional biplots of a) total and b) M3 nutrient concentration datasets.
The collections of log-ratio balances for the total and M3 nutrient concentration compositions are defined on the ordinary real space and ordinary statistical techniques can be applied on them.
Thus, the multivariate analysis of variance described in the manuscript was conducted on balances.
Finally, note that balances, and ilr-coordinate systems in general, can be defined in infinitely many ways, with them being just orthogonal rotations from each other from a geometrical point of view.
Hence, it is important to verify that a particular data analysis method or model is invariant to those rotations, which enables to use an arbitrary choice of ilr-coordinates to obtain the required output. Even so, in many practical settings it is preferable to tailor the balance representation according to a meaningful criterion for the benefit of interpretability. For example, balances driven by information contained in the data (as we did above based on the co-dependency structure between elements) or by the scientific questions to represent particular relationships of interest. It can be checked that the statistical methods used in this study are invariant to orthogonal rotations of ilr-coordinates.