Short Communication: The Wasserstein distance as a dissimilarity metric for comparing detrital age spectra, and other geological distributions

. Distributional data such as detrital age populations or grain size distributions are common in the geological sciences. As analytical techniques become more sophisticated, increasingly large amounts of distributional data are being gathered. These advances require quantitative and objective methods, such as multidimensional scaling (MDS), to analyse large numbers of samples. Crucial to such methods is choosing a sensible measure of dissimilarity between samples. At present, the Kolmogorov-5 Smirnov (KS) statistic is the most widely used of these dissimilarity measures. However, the KS statistic has some limitations such as high sensitivity to differences between the modes of two distributions, and insensitivity to their tails. Here we propose the Wasserstein-2 distance ( W 2 ) as an additional and alternative metric for use in geochronology. Whereas the KS-distance is defined as the maximum vertical distance between two empirical cumulative distribution functions, the W 2 -distance is a function of the horizontal distances (i.e., age differences) between observations. Using a variety of synthetic and real datasets we 10 explore scenarios where W 2 may provide greater geological insight than the KS statistic. We find that in cases where absolute time differences are not relevant (e.g., mixing of known, discrete age peaks), the KS statistic can be more intuitive. However, in scenarios where absolute age differences are important (e.g., temporally/spatially evolving sources, thermochronology, and overcoming laboratory biases) W 2 is preferable. The W 2 -distance has been added to the R package IsoplotR, for immediate use in detrital geochronology and other applications. The W 2 distance can be generalised to multiple dimensions, which opens 15 opportunities beyond distributional data.

In the following sections, we first introduce the Wasserstein distance in a more realistic setting, and formally define it. Next we discuss how it can be decomposed into intuitive terms that accord with how qualitatively, as geologists, we might compare distributions. We then proceed to compare the Wasserstein distance to the KS distance using a simple yet realistic synthetic 55 example. Finally, we analyse a series of case studies, analysing real datasets using both the Wasserstein and KS distances. We thus evaluate the benefits and drawbacks of both metrics, identifying scenarios when one metric may be preferred to the other.
Whilst we focus primarily on detrital age distributions, we emphasise that much of the following discussion applies equally to any form of distributional data.
2 The Wasserstein distance 60 The Wasserstein distance is a distance metric between two probability measures from a branch of mathematics called 'optimal transport'. Optimal transport is often intuited in terms of moving piles of sand from one location to another with no loss or gain of material (e.g., Villani 2003). The problem that optimal transport solves is finding the way to transport the sand such that the least sand is moved the least distance. The Wasserstein distance is the cost associated with this most efficient transportation.
The association with moving piles of sand is why the Wasserstein distance is often termed the Earth-mover's distance. Figure   65 1a shows an example of how one univariate probability distribution, µ, based on a detrital age spectrum, is transformed into another, ν according to the optimal transport plan. Elsewhere in the Earth sciences, the Wasserstein distance is increasingly used for solving non-linear geophysical inverse problems (e.g., Engquist and Froese 2014;Métivier et al. 2016;Sambridge et al. 2022) and has been proposed as a tool for fitting hydrographs (Magyar and Sambridge, 2023). Full mathematical treatments of the Wasserstein distance and optimal transport are beyond the scope of this paper, but interested readers are referred to Villani 70 (2003) or Peyré and Cuturi (2019). A geophysical perspective is given in Sambridge et al. (2022).

Formal definition
We consider two univariate probability distributions µ and ν which have cumulative distribution functions (CDFs) M and N respectively. The p th Wasserstein distance between µ and ν is given by: (1) 75 where M −1 indicates the inverse of the CDF M and 0 ≤ t ≤ 1 (Villani, 2003). Note that this definition of W p assumes that the cost-function is given by |x − y| p (e.g., the Euclidean distance where p = 2), which is the case for most distributional data in geology. In the further special case of p = 1 (i.e., the first Wasserstein distance, W 1 ), Equation 1 can be re-written simply as: which is the area between two CDFs (e.g., Figure 1b). Recall that the KS-distance between two distributions is the maximum 80 distance between the two corresponding CDFs. Whilst the W 1 is easily visualised, we actually use the W 2 going forwards as Semi-transparent coloured lines are probability distributions spaced equally in Wasserstein space between µ and ν (termed 'barycentres'; Benamou et al. 2015). b) Empirical Cumulative Distribution Functions (ECDFs) of the detrital ages used to calculate the distributions shown in panel a, same colours. The first Wasserstein (W1) distance corresponds to the total area between the two ECDFs (shaded pink). The Kolmogorov-Smirnov (KS) distance is the maximum distance between the two ECDFs (black double-headed arrow).
the squared distance (i.e., p = 2) between observations is the standard distance metric in most statistical analyses (e.g., least squares regression). Additionally, W 2 decomposes into readily interpretable terms, as discussed below.
We focus on these univariate instances as they apply to the most common geological distributional data including detrital age distributions and grain size distributions. However, we note that the Wasserstein distance is, in general, multivariate. As 85 a result, some form of the Wasserstein distance could prove useful for analysing a number of other geological datasets such as the geochemical compositions of detrital minerals, or joint U-Pb and Lu-Hf isotope analysis (see Vermeesch et al. 2023).
Statistics for comparing distributional data in multiple dimensions are increasingly needed (Sundell and Saylor, 2021).
Like the KS distance, W 2 satisfies the triangle inequality, and as such is a true metric. This property means that classical, as well as metric & non-metric MDS can be used with a W 2 defined dissimilarity matrix. As W 2 is sensitive to absolute time 90 differences, metric (or classical) MDS, which seek to preserve absolute distances, may be preferable to non-metric MDS. For the rest of this manuscript, metric MDS is used.

Decomposition
A particularly useful property of W 2 between two univariate distributions is that it can be decomposed in terms of the differences between the two distributions' location, spread and shape. Irpino and Romano (2007) show that: whereμ is the mean of µ, σ µ is the standard deviation of µ and ρ µν is the Pearson correlation coefficient between the quantiles of the distributions µ and ν. These three terms also accord with, qualitatively, how as geologists we might compare two distributions.

Discrete data 100
Most distributional data in the Earth sciences do not, in raw form, follow continuous probability distributions. Instead, samples may be discrete sets of observations, e.g., lists of individual mineral ages. The above formulations can be easily applied to such cases by describing the probability functions µ and ν as weighted sums of δ functions. For example, let us consider two samples x m and x n with p and q numbers of observations respectively:

A synthetic example
To demonstrate the intuition of W 2 we explore a simple synthetic example. We consider two probability density functions of 110 mineral ages: a bimodal distribution and a unimodal distribution, both constructed from Gaussians with the same scale ( Figure   2a). We fix the bimodal distribution at 1000 Ma, but translate the unimodal distribution along the time axis. For each translated distribution we calculate both the KS-distance and W 2 . Figure 2b displays the behaviour of both distances under this scenario.
The KS-distance shows an unexpectedly complex response containing a series of steps, as the peaks of the distributions align and misalign. At around ±400 Ma, once the distributions stop overlapping, the KS-distance plateaus at its maximum value of 115 1. By contrast, W 2 increases monotonically with increasing distance. Away from the origin, W 2 approximates a linear function of the amount of translation, as is predicted from Equation 3. At the origin, the non-zero value of W 2 is the cost of turning the unimodal distribution into the bimodal distribution without translation.
We argue that the behaviour of W 2 is more geologically intuitive than the KS-distance under this scenario. It is useful geological information if two distributions differ in their means by 400, 500 or 1000 Ma, but if the distributions do not overlap, overlapping distributions. Additionally, the stepped response of the KS-distance under translation is undesirable. Under the simple operation of translating a unimodal distribution, we would expect our dissimilarity to increase at a constant, or at least predictable (e.g., quadratic) rate. The change of the KS-distance with translation is, unintuitively, non-linear. By contrast, the W 2 increases linearly with respect to translation. 125 We reiterate that at a translation of 0 Ma, W 2 (and the KS distance) is still non-zero, reflecting the fact that even when the average ages are aligned, the shapes of the uni-modal and bi-modal distributions do not match. This illustrates the tendency of W 2 in geochronological data to prioritise aligning the average ages of distributions before considering matching individual peaks. Such behaviour contrasts with approaches that seek to only match probability peaks neglecting any information of absolute ages (e.g., Saylor and Sundell 2016).

Discussion
As stated above, the most appropriate dissimilarity metric to use will depend on the scientific question being answered. In general, the Wasserstein distance is most appropriate when absolute differences along the time axis (or more generally, the x-axis) provide useful information to solving the geologic problem. The KS distance however is more appropriate when the size of the time differences between peaks is not relevant. Both the KS distance and the W 2 are calculated in terms of differences between 135 ECDFs. Due to these similarities in construction, in many cases the results from using the KS and W 2 are, encouragingly, similar. One exception is whether ages are log transformed prior to analysis. Because the KS distance considers only the order of the ages, it will be the same whether a log transform is used or not. W 2 however will be different, and will consider relative not absolute age differences. Such an example is discussed below ( Figure 5).
Here we discuss a variety of realistic scenarios where the KS and W 2 may result in different interpretations. In each, we 140 evaluate the advantages and disadvantages of using W 2 or KS. These case-studies can be used to determine which metric is most appropriate for a particular scenario.

Discriminating contributions from discrete endmembers
We first consider a scenario where the samples are assumed to be mixtures, in differing proportions, of some known or unknown fixed endmembers. This situation is one where absolute distance along the time-axis is not relevant, as the nature of the 145 endmembers is not sought, simply their relative contributions to a set of mixtures. Instead, it is vertical differences in the probability at a given age that is relevant. The KS distance, which is sensitive to such vertical differences in age distributions is better suited for this than W 2 . Indeed, in such a scenario the W 2 can result in some unintuitive behaviour.
For example, let us consider three unimodal potential sediment sources, as shown in Figure 3a. We now consider two mixture samples. The first is an equal mixture of X and Y, and the second an equal mixture of Y and Z (bottom two plots, Figure 3a).

150
Geologically, we would expect these samples to be about half as similar to the two source endmembers. However, a W 2 MDS map identifies these samples as being removed from their two endmembers (Figure 3b). Additionally, because of the absolute time difference between Source Z and the other sources, Sample 2 is treated as a considerable outlier. The KS distance performs better here, placing the mixtures approximately halfway between the expected endmembers. However, in such a well defined mixing scenario as this, methods such as endmember mixture modelling may be more appropriate than statistical dimension 155 reduction (e.g., Weltje 1997; Sharman and Johnstone 2017; Dietze and Dietze 2019).

Temporally varying source age distributions
In contrast, scenarios where the shape of sediment source age distributions evolves in space and time are well suited to using W 2 . This is because W 2 considers all parts of a distribution, whereas the KS only compares one point, the location of maximum Figures 4b-c display MDS maps calculated using W 2 and KS respectively. The W 2 map clearly identifies the stratigraphic 165 order of the samples by the changing distribution shape. Additionally, it clusters the four unimodal samples together. By contrast, the KS map does not identify the stratigraphic trend, locating the lowermost stratigraphic sample GV64 with the uppermost samples KDS3 and GV44. We conclude then that the W 2 has better captured the geological information in this scenario.

170
In thermochronology, age distributions shift along the time-axis according to thermal signals (e.g., exhumation). In many thermochronological studies, we may seek to characterise how such a signal evolves in space and time. For this question MDS configuration using W2, following a log transform (stress = 0.02). c) MDS map using KS statistic (stress = 0.18). In this example, W2 performs better than the KS distance at identifying the geographic trend. absolute distance along the time-axis is useful information and so the W 2 may be more effective than the KS distance. For example, Wobus et al. (2003) use 40 Ar/ 39 Ar detrital mica thermochronometry to explore spatially varying exhumation along a spatial transect in the Himalaya. The KDEs of the samples are shown in Figure 5a arranged south to north. The southern 175 samples (WBS1, WBS2, WBS3, WBS8) show old exhumation signals, but a dramatic shift to younger ages is observed north of a distinct physiographic transition. MDS maps of these samples are shown using the KS distance and W 2 in Figures 5bc respectively. As there is limited overlap between the samples, the KS distance struggles to capture the NS progression in exhumation age. Whilst the physiographic division is found, it weights it equally to variation within one cluster. By contrast, the W 2 map correctly identifies the simple temporal and geographical trend of the samples from south to north.  Figure 6. Comparing samples from an inter-laboratory calibration study. KDEs (left) and ECDFs (right) of two samples from the interlaboratory comparison study of (Košler et al., 2013), plus a purposefully misaligned synthetic sample. Dashed lines mark the true ages of the detrital mixture. According to the KS-statistic, the age distribution produced by Lab 4 is more similar to the synthetic distribution than it is to the distribution produced by Lab 1, despite the absence of any shared age components. The W2 distance correctly deems the distribution produced by Lab 4 to be closer to that of Lab 1 than to the synthetic mixture.

Combining data from multiple laboratories
A final scenario where the W 2 could be preferable is when comparing samples from different laboratories which are affected by inter-laboratory bias. Košler et al. (2013) provided ten different laboratories with identical synthetic zircon samples with a known age distribution. Different instruments introduced small differences in the ages of each peak. For example, in Figure 6 we display the results from Lab 1 (red) and Lab 4 (pink) as KDEs. The expected peak at ∼ 1200 Ma (dashed line) is offset 185 between the two samples. As it is the maximum distance between two ECDFs, the KS distance is very sensitive to minor offsets in sharply defined peaks. In this case, the KS distance between these theoretically identical samples is large at 0.348, which is over one third of the maximum possible distance between samples. Indeed, the KS distance considers a synthetic, purposefully misaligned series of peaks (black KDE) to be more similar to the Lab 4 results than the results from Lab 1. The W 2 distance, does not suffer from this oversensitivity to minorly offset peaks and correctly identifies the samples from Lab 1 and Lab 4 as 190 being much more similar than the random synthetic distribution.
spatially/temporally evolving source distributions, thermochronological data, and combining detrital samples from different laboratories. The Wasserstein distance has been added to the IsoplotR software, and example scripts are provided in Python and R.

225
Code availability. The code and data repository is found at github.com/AlexLipp/detrital-wasserstein Author contributions. AGL conceived the project, both authors contributed to development, writing, and software production.
Competing interests. PV is an Associate Editor of Geochronology