Optimal transport for automatic alignment of untargeted metabolomic data

Untargeted metabolomic profiling through liquid chromatography-mass spectrometry (LC-MS) measures a vast array of metabolites within biospecimens, advancing drug development, disease diagnosis, and risk prediction. However, the low throughput of LC-MS poses a major challenge for biomarker discovery, annotation, and experimental comparison, necessitating the merging of multiple datasets. Current data pooling methods encounter practical limitations due to their vulnerability to data variations and hyperparameter dependence. Here, we introduce GromovMatcher, a flexible and user-friendly algorithm that automatically combines LC-MS datasets using optimal transport. By capitalizing on feature intensity correlation structures, GromovMatcher delivers superior alignment accuracy and robustness compared to existing approaches. This algorithm scales to thousands of features requiring minimal hyperparameter tuning. Manually curated datasets for validating alignment algorithms are limited in the field of untargeted metabolomics, and hence we develop a dataset split procedure to generate pairs of validation datasets to test the alignments produced by GromovMatcher and other methods. Applying our method to experimental patient studies of liver and pancreatic cancer, we discover shared metabolic features related to patient alcohol intake, demonstrating how GromovMatcher facilitates the search for biomarkers associated with lifestyle risk factors linked to several cancer types.

Untargeted metabolomics is a powerful analytical technique used to identify and measure a large number of metabolites in a biological sample without preselecting targets [1]. This approach allows for a comprehensive overview of an individual's metabolic profile, provides insights into the biochemical processes involved in cellular and organismal physiology [2,3], and allows for the exploration of how environmental factors impact metabolism [4,5]. It creates new opportunities to investigate health-related conditions, including diabetes [6], inflammatory bowel diseases [7], and various cancer types [8,9]. However, a major challenge in biomarker discovery, metabolic signature identification and other untargeted metabolomic analyses lies in the low throughput of experimental data, necessitating the development of efficient pooling algorithms capable of merging datasets from multiple sources [8].
A common experimental technique in untargeted metabolomics is liquid chromatography-mass spectrometry (LC-MS) which assembles a list of thousands of unlabeled metabolic features characterized by their massto-charge ratio (m/z), retention time (RT) [10], and intensity across all biological samples. Combining LC-MS datasets from multiple experimental studies remains challenging due to variation in the m/z and RT of a feature from one study to another [10,11]. This problem is further compounded by differing instruments and analytical protocols across laboratories, resulting in seemingly incompatible metabolomic datasets.
Manual matching of metabolic features can be a laborious and error-prone task [8]. To address this challenge, several automated methods have been developed for metabolic feature alignment. One such method is * Marie Breeur and George Stepaniants contributed equally and are joint first authors.
MetaXCMS, which matches LC-MS features based on user-defined m/z and RT thresholds [12]. More advanced tools use information on feature intensities measured in samples. For instance, PAIRUP-MS uses known shared metabolic features to impute the intensities of all features from one dataset to another [13]. MetabCombiner [14] and M2S [15] compare average feature intensities, along with their m/z and RT values, to align datasets without requiring extensive knowledge of shared features. These automated alignment methods have accelerated our ability to pool and annotate datasets as well as extract biologically meaningful biomarkers. However, they demand substantial fine-tuning of user-defined parameters and ignore correlations among metabolic features which provide a wealth of additional information on shared features.
Here we introduce GromovMatcher, a user-friendly flexible algorithm which automates the matching of metabolic features across experiments. The main technical innovation of GromovMatcher lies in its ability to incorporate the correlation information between metabolic feature intensities, thereby harnessing the powerful mathematical framework of optimal transport (OT) [16,17]. OT has proven effective in solving various matching problems and has found applications in multiomics analysis [18], cell development [19,20], and chromatogram alignment [21]. Here we leverage the Gromov-Wasserstein (GW) method [22,23], which matches datasets based on their distance structure and has been seminally applied to spatial reconstruction problems in genomics [24]. GromovMatcher uses the GW algorithm to automatically uncover the shared correlation structure among metabolic feature intensities while also incorporating m/z and RT information in the final matching process.
To assess the performance of GromovMatcher, we systematically benchmark it on synthetic data with varying levels of noise, feature overlap, and data normalizations, outperforming prior state-of-the-art methods of metab-Combiner [14] and M2S [15]. Next we apply Gromov-Matcher to align experimental patient studies of liver and pancreatic cancer to a reference dataset and associate the shared metabolic features to each patient's alcohol intake. Through these efforts, we demonstrate how GromovMatcher data pooling improves our ability to discover biomarkers of lifestyle risk factors associated with several types of cancer.

GromovMatcher algorithm
GromovMatcher uses the mathematical framework of OT to find all matching metabolic features between two untargeted metabolomic datasets (Fig. 1). It accepts two LC-MS datasets with possibly different numbers of metabolic features and samples. Each feature, fx i in Dataset 1 and fy j in Dataset 2, is identified by its m/z, RT, and vector of feature intensities across samples (Fig. 1a). The primary tenet of GromovMatcher is that shared metabolic features have similar correlation patterns in both datasets and can be matched based on the distance/correlations between their feature intensity vectors. Specifically, GromovMatcher computes the pairwise distances between the feature intensity vectors of each metabolic feature in a dataset and saves them into a distance matrix, one per dataset (Fig. 1b). In practice, we use either the Euclidean distance or the cosine distance (negative of correlation) to perform this step (Methods). The resulting distance matrices contain information about the feature intensity similarity within each study. Using optimal transport, we can deduce shared subsets of metabolic features in both datasets which have corresponding feature intensity distance structures.
OT was originally developed to optimize the transportation of soil for the construction of forts [25] and was later generalized through the language of probability theory and linear programming [26], leading to efficient numerical algorithms and direct applications to planning problems in economics. The ability of OT to efficiently match source to target locations found applications in data science for the alignment of distributions [27,28] and was generalized by the Gromov-Wasserstein (GW) method [29,30] to align datasets with features of differing dimensions.
In practice, a sizeable fraction of the metabolic features measured in one study may not be present in the other. Hence, in most cases only a subset of features in both datasets can be matched. Recent GW formulations for unbalanced matching problems [31] allow for matching only subsets of metabolic features with similar intensity structures (Fig. 1c). To incorporate addi-tional feature information, we modify the optimization objective of unbalanced GW to penalize feature matches whose m/z differences exceed a fixed threshold (Methods, Supplementary Information). The optimization of this objective computes a coupling matrixΠ where each entryΠ ij ≥ 0 indicates the level of confidence in matching metabolic feature fx i in Dataset 1 to fy j in Dataset 2.
Differences in experimental conditions can induce variations in RT between datasets that can be nonlinear and large in magnitude [10,14,15]. In the spirit of previous methods for LC-MS batch or dataset alignment [14,15,21,[32][33][34][35], the learned couplingΠ is used to estimate a nonlinear map (drift function) between RTs of both datasets by weighted spline regression, which allows us to filter unlikely matches from the coupling matrix to obtain a refined coupling matrix Π (Fig. 1d, Methods). An optional thresholding step removes matches with small weights from the coupling matrix. The final output of GromovMatcher is a binary matching matrix M where M ij is equal to 1 if features fx i and fy j are matched and 0 otherwise. Throughout the paper, we refer to the two variants of GromovMatcher, with and without the optional thresholding step as GMT and GM respectively.

Validation on ground-truth data
We first evaluate the performance of GromovMatcher using a real-world untargeted metabolomics study of cord blood across 499 newborns containing 4,712 metabolic features characterized by their m/z, RT, and feature intensities [36]. To generate ground-truth data, we randomly divide the initial dataset into two smaller datasets sharing a subset of features (Fig. 2). We simulate diverse acquisition conditions by adding noise to the m/z and RT of dataset 2, and to the feature intensities in both datasets. Moreover, we introduce an RT drift in dataset 2 to replicate the retention time variations observed in real LC-MS experiments (Methods and Supplementary Information). For comparison, we also test M2S and metabCombiner, both of which use m/z, RT, and median or mean feature intensities to match features (Fig. 3). MetabCombiner is supplied with 100 known shared metabolic features to automatically set its hyperparameters, while M2S parameters are manually finetuned to optimize the F1-score in each scenario (Supplementary Information). We assess the performance of GM, GMT, metabCombiner, and M2S across 20 randomly generated dataset pairs in terms of their precision (fraction of true matches among the detected matches) and recall/sensitivity (fraction of true matches detected) averaged across 20 dataset pairs.
To investigate how the number of shared features affects dataset alignment, we generate pairs of LC-MS Step 1: Input metabolomics datasets Step 2: Computation of distance matrices Step 4: Retention time drift estimation Step 4: Retention time drift estimation Step 3: Estimate metabolite coupling matrix with Gromov-Wasserstein   1. An optimal transport approach for combining untargeted metabolomics datasets (GromovMatcher). a, Inputs are two LC-MS datasets of unlabeled metabolic features (rows) identified by their m/z, RT, and feature intensities across biospecimen samples. Both studies can have differing numbers of metabolic features and samples. b, In both datasets, the intensities across samples of each metabolic feature are formed into a vector and Euclidean distances between these feature vectors are computed and stored in a distance matrix. c, Based on the technique of optimal transport, the unbalanced GW algorithm learns a coupling matrixΠ that places large weightsΠij ≥ 0 when fxi and fy j likely correspond to the same metabolic feature. It optimizesΠ to match features with similar pairwise distances (red outlined boxes) whose m/z ratios are close. d, The final step of GromovMatcher plots the retention times of features from both datasets against each other and fits a spline interpolationf weighted by the estimated coupling weightsΠ. This retention time drift function is then used to set all entries Πij to zero for those outlier pairs (fxi, fy j ) which exceed twice the median absolute deviation (MAD) aroundf (green highlighted region). Finally, the coupling matrixΠ is filtered and/or thresholded to obtain a refined coupling Π which is then binarized to obtain a one-to-one matching M between a subset of metabolite pairs in both datasets.
datasets with low, medium, and high feature overlap (25%, 50%, and 75%), while maintaining a medium noise level (Methods). Here we find that GM and GMT generally outperform existing alignment methods, with a re-call above 0.95 while metabCombiner and M2S tend to be less sensitive (Fig. 3b). All methods drop in precision as the feature overlap is decreased, with GM and GMT still maintaining an average precision above 0.8. Next we evaluate all four methods at low, moderate, and high noise levels for pairs of datasets with 50% overlap in their features (Methods). Our results show that GMT, GM, and M2S maintain an average recall above 0.89, while metabCombiner's recall drops below 0.6 for high noise. At large noise levels, RT drift estimation becomes more challenging, leading to a higher rate of false matches between metabolites (lower precision) for all four methods ( Supplementary Fig. 1). Nevertheless, GMT obtains a high average precision and recall of 0.86 and 0.92 respectively.

RT
A notable difference between GM, metabCombiner, and M2S lies in their use of feature intensities. Metab-Combiner expects that the mean feature intensity rank-ings are identical across studies, while M2S assumes that shared features have similar median intensities. In contrast, GM uses both the mean feature intensities and their variances and covariances. In practice, differences in experimental assays or study populations can lead to greater variation in feature intensities, making matchings based on these statistics less reliable. Centering and scaling the feature intensities to unit variance avoids potential biases arising from inconsistent feature intensity magnitudes, but preserves correlations that GM leverages.
Exploring this further, we test how sensitive all four methods are to centering and scaling of feature intensities. MetabCombiner and M2S are tuned using the same methodology as for non-centered and non-scaled data. For M2S, we match features solely based on their m/z and RT. In this experiment ( Supplementary Fig. 2), the absence of intensity magnitude information significantly affects metabCombiner's performance and, to a lesser extent, M2S. GM and GMT still obtain accurate matchings, due to their use of correlation structures which are preserved under centering and scaling.
Application to EPIC data Next, we apply GM, metabCombiner and M2S to align datasets from the European Prospective Investigation into Cancer and Nutrition (EPIC) cohort, a prospective study conducted across 23 European centers. EPIC comprises more than 500,000 participants who provided blood samples at recruitment [37]. Untargeted metabolomics data were successively acquired in several studies nested within the full cohort.
In the present work, we use LC-MS data from the EPIC cross-sectional (CS) study [38] and two matched case-control studies nested within EPIC, on hepatocellular carcinoma (HCC) [39,40] and pancreatic cancer (PC) [41]. LC-MS untargeted metabolomic data were acquired at the International Agency for Research on Cancer, making use of the same platform and methodology (Methods). The number of samples and features in each study is displayed in Fig. 4a.
Loftfield et al. [8] previously matched features from the CS, HCC, and PC studies in EPIC for alcohol biomarker discovery. The authors first identified 205 features (163 in positive and 42 in negative mode) associated with alcohol intake in the CS study. These features were then manually matched by an expert to features in both the HCC and PC studies (Methods). In our analysis, we use these features as a validation set and compare each method's matchings to the expert manual matchings on this subset. Due to the imbalance between the number of positive and negative mode features in the validation subset, our main analysis focuses on the alignment results of CS with HCC and CS with PC in positive mode. We delegate the matching results between the negative mode studies to the Supplementary Information. In this section, we use the same settings for GM as in our simulation study, and do not apply an additional thresholding step. The parameters of metabCombiner and M2S are calibrated using the validation subset as prior knowledge (Methods).
Preliminary analysis of the validation subset reveals inconsistencies in the mean feature intensities (Supplementary Fig. 3), but Figure 4b shows that on centered and scaled data, the 90 expert matched features shared between the CS and HCC studies have similar correlation structures. Hence, to avoid potential errors we center and scale the feature intensities which improves the performance of all three methods tested below (Supplementary  Information, Supplementary Tables 2, 3).

Hepatocellular carcinoma
Here we analyze the quality of the matchings obtained by GM, M2S, and metabCombiner between the CS and HCC datasets in positive mode. Both GM and M2S identify approximately 1000 shared features while metab-Combiner finds a smaller number of about 700 shared features. We refer the reader to Supplementary Figure 4a for the precise matched feature sizes and details on the agreement between the feature matchings of all three methods.
We evaluate the performance of metabCombiner, M2S, and GM on the validation subset in positive mode ( Fig. 4c), which consist of 90 features from the CS study manually matched to features from the HCC study and 73 features specific to the CS study. MetabCombiner demonstrates precise matching but lacks sensitivity. M2S's precision and recall are comparable with GM, in contrast to its performance on simulated data. This can be attributed to the RT drift shape between the CS and HCC studies (Supplementary Information), which is estimated to be close to linear ( Supplementary Fig. 5). Because the parameters of M2S are fine-tuned in the validation subset, it is able to learn this linear drift and apply tight RT thresholds to achieve accurate matchings. In contrast to metabCombiner and M2S, the GM algorithm is not given any prior knowledge of the validation subset, and nevertheless demonstrates the highest precision and recall rates of the three methods (Fig. 4c). Figure 4b shows how GM recovers the majority of the expert matched pairs by leveraging the shared correlations (Supplementary Information).

Pancreatic cancer
Matching features between the CS and PC studies in positive mode, GM and M2S identify approximately 1000 common features, while metabCombiner detects approximately 600 matches ( Supplementary Fig. 4b). We examine the performance of all three methods on the validation subset consisting of 66 manually matched features between CS and PC along with 97 features specific to the CS study. As before, GM and M2S have high recall while the recall of metabCombiner is less than 0.5.
A decrease in precision is observed for both GM and M2S compared to the previous CS-HCC matchings. We therefore manually inspect the false positive matches; the set of CS features matched by the method to the PC study but explicitly examined and left unmatched in the expert manual matching. Assessing the GM results, we identify 7 false positive feature matches. Upon secondary inspection, 3 pairs are revealed as correct matches that were not initially identified in the expert matching. M2S finds 11 false positive matches which include the 7 false positives recovered by GM. Manual examination of the 4 remaining pairs reveals 2 clear mismatches. These results highlight the advantage of using automated methods for data alignment, as both GM and M2S detect correct matches that were not identified by experts, with GM being more precise than M2S.

Illustration for alcohol biomarker discovery
Loftfield et al. [8] identified biomarkers of habitual alcohol intake by first performing a discovery step, where they examined the relationship between alcohol intake and metabolic features in the CS study. They then manually matched the significant features in CS to features from the HCC and PC studies, and repeated the analysis with samples from the HCC and PC studies to determine whether the association with alcohol intake persisted. This led to the identification of 10 features possibly associated with alcohol intake (Extended Data Fig. 5a).
To extend this analysis and illustrate the benefit of GM automatic matching for biomarker discovery, we use GM to pool features from the CS, HCC, and PC studies, and examine the relationship between metabolic features and alcohol intake in the pooled study (Methods and Extended Data Fig. 5b).
Applying an FDR correction on the pooled study, we identify 243 features associated with alcohol intake, including 185 features consistent with the discovery step of Loftfield et al. [8], and 55 newly discovered features (Extended Data Fig. 5c). Using the more stringent Bonferroni correction on the pooled data, we identify 36 features shared by all three studies that are significantly associated with alcohol intake. These features include all 10 features identified in Loftfield et al. (Extended Data Fig. 5c). These findings highlight the potential benefits of using GM automatic matching for biomarker discovery in untargeted metabolomics data. Additional information regarding the methodology and findings of our GM and Loftfield et al. analyses can be found in Methods and Supplementary Information.

DISCUSSION
LC-MS metabolomics has emerged as an increasingly powerful tool for biological and biomedical research, offering promising opportunities for epidemiological and clinical investigations. However, integrating data from different sources remains challenging. To address this issue, we introduce GromovMatcher, a method based on optimal transport that automatically aligns LC-MS data

FIG. 4.
Application of GromovMatcher and comparison to existing methods on EPIC dataset. a, Dimensions of the three EPIC studies used. For each ionization mode, the cross-sectional (CS) study is aligned successively with the hepatocellular carcinoma (HCC) study and the pancreatic cancer (PC) study. b, Demonstration of expert manual matching and GromovMatcher (GM) matching between the CS and HCC studies in positive mode. Experts manually match 90 features from [8] and the correlation matrices of these features in both datasets have similar structure (bottom two matrices). GM discovers 996 shared features between the CS and HCC datasets which have similar correlation structure (top two matrices). We validate that 88 of the 90 features from the manually expert matched subset are contained in the set of features matched by GM. c, Performance of metabCombiner (mC), M2S and GM in positive mode. Precisions and recalls are measured on a validation subset of 163 manually examined features, and 95% confidence intervals are computed using modified Wilson score intervals. d, Performance of mC, M2S and GM in negative mode. Precision and recall are measured on a validation subset of 42 manually examined features, and 95% confidence intervals are computed using modified Wilson score intervals. See Extended Data Table I and Table II for exact precisions, recalls, and confidence intervals in positive and negative mode respectively. from pairs of studies. Our method exhibits superior performance on both simulated and real data when compared to existing approaches. Additionally, it presents a user-friendly interface with few hyperparameters.
While GromovMatcher is robust to noise and variations in data, it may face limitations when aligning LC-MS studies from populations with different characteristics, where the correlation structures between features may be inconsistent across studies. In this case, the base assumption of GromovMatcher can be relaxed by focusing on subsamples with similar characteristics, as exemplified in a recent study [42].
A current limitation is that GromovMatcher does not account for more than two datasets simultaneously, although this can be overcome by aligning multiple studies to a chosen reference dataset, as demonstrated in our biomarker experiments. The extension of Gromov-Wasserstein to multiple distributions [43] is another promising approach for generalizing GromovMatcher to multiple dataset alignment. Further improvements can be made by incorporating existing knowledge about the studies being matched, such as known shared features, samples in common, or MS/MS data.
The results obtained from GromovMatcher are highly promising, opening the door for various analyses of metabolomic datasets acquired in different experimental laboratories. Here we demonstrated the potential of Gro-movMatcher in expediting the combination and metaanalysis of data for biomarker and metabolic signature discovery. The matchings learned by GromovMatcher also allow for comparison between experimental protocols by assessing the drift in m/z, RT, and feature intensities across studies. Finally, inter-institutional annotation efforts can directly benefit from incorporating this method to transfer annotations between aligned datasets. Bridging the gap between otherwise incompatible LC-MS data, GromovMatcher enables seamless comparison of untargeted metabolomics experiments.

GromovMatcher method overview
GromovMatcher accepts as input two feature tables from separate LC-MS untargeted metabolomics studies. Each feature table for dataset 1 and dataset 2 consists of n 1 , n 2 biospecimen samples respectively and p 1 , p 2 metabolic features respectively detected in the study. Features in dataset 1 are given the label fx i for i = 1, . . . , p 1 . Every feature is characterized by a massto-charge ratio (m/z) denoted by m x i , a retention time (RT) denoted by RT x i , and a vector of intensities across all samples written as X i ∈ R n1 . Similarly, features in dataset 2 are labeled as fy j for j = 1, . . . , p 2 and are characterized by their m/z m y j , retention time RT y j , and a vector of intensities across all samples Our goal is to identify pairs of indexes (i, j) with i ∈ {1, . . . , p 1 } and j ∈ {1, . . . , p 2 }, such that fx i and fy j correspond to the same metabolic feature. More formally, we aim to identify a matching matrix M ∈ {0, 1} p1×p2 such that M ij = 1 if fx i and fy j correspond to the same feature, hereafter referred to as matched features. Otherwise, we set M ij = 0.
Because the m/z and RT values of metabolomic features are often noisy and subject to experimental bias, our matching algorithm leverages metabolite feature intensities X i , Y j to produce accurate dataset alignments. The GromovMatcher method is based on the idea that signal intensities of the same metabolites measured in two different studies should exhibit similar correlation structures, in addition to having compatible m/z and RT values. Here we define the Pearson correlation for vectors u, v ∈ R n as where we definē as the mean value, Euclidean norm and inner product respectively. If measurements X i , Y j correspond to the same underlying feature, and similarly, measurements X k , Y l share the same an underlying feature, we expect that This idea that the feature intensities of shared metabolites have the same correlation structure in both datasets also holds more generally for distances, under a suitable choice of distance. For example, the correlation coefficient corr(u, v) can be turned into a dissimilarity metric by defining commonly referred to as the cosine distance. Preservation of feature intensity correlations then trivially amounts to the preservation of cosine distances. Another classical notion of distance between vectors u, v ∈ R n is the normalized Euclidean distance which is equal to the cosine distance (up to constants) when the vectors u, v are centered and scaled to have zero mean and a standard deviation of one. The Euclidean distance depends on the magnitude or mean intensity of metabolic features, and hence is a useful metric for matching metabolites as long as these mean feature intensities are reliably collected.
To summarize, the main tenant of GromovMatcher is that if measurements X i , Y j correspond to the same feature and X k , Y l correspond to the same feature, then for suitably chosen distances d across both datasets. In this paper, the distances d x , d y are taken to be the normalized Euclidean distances in (5).
We take care to specify those experiments where the metabolic features X and Y are centered and scaled. In these cases, implicitly the Euclidean distance between normalized feature vectors becomes the cosine distance (4) between the original (unnormalized) feature vectors.

Unbalanced Gromov-Wasserstein
The goal of GromovMatcher is to learn a matching matrix M ∈ {0, 1} p1×p2 that gives an alignment between a subset of metabolites in both datasets. However, searching over the combinatorially large set of binary matrices would be an inefficient approach for dataset alignment. The mathematical framework of optimal transport [16] instead enlarges this space of binary matrices to the set of coupling matrices with real nonnegative entries Π ∈ R p1×p2 + . The entries Π ij with large weights indicate that feature fx i in dataset 1 and feature fy j in dataset 2 are a likely match. Taking inspiration from (6), we minimize the following objective function to estimate the coupling matrix Π.
A standard approach is to optimize this objective over all coupling matrices Π under exact marginal constraints Here we define 1 n is the ones vector of length n, and Π 1 = Π1 p2 , Π 2 = Π T 1 p1 denote the column and row sums of the coupling matrix. Objective (7) under these exact marginal constraints defines a distance between the two sets of metabolic fea- known as the Gromov-Wasserstein distance [22], a generalization of optimal transport to metric spaces. Note that for pairs , the entries Π ij , Π kl are penalized less and hence matches between features fx i , fy j and features fx k , fy l are more favored. In our optimization, we avoid enforcing exact marginal constraints on the marginal distributions Π1 p2 and Π T 1 p1 of our coupling matrix as this would enforce that all metabolites in both datasets are matched (Supplementary Information). However, without any marginal constraints on the coupling Π, the objective function (7) is trivially minimized by Π = 0, leaving all metabolites in both datasets unmatched.
To account for this, we follow the ideas of unbalanced Gromov-Wasserstein (UGW) [31] and add three regularization terms to our objective where ρ, ε > 0 and we define a = 1 p1 , b = 1 p2 . Here ⊗ denotes the Kronecker product. We define D KL as the Kullback-Leibler (KL) divergence between two discrete distributions µ, ν ∈ R p + by which measures the closeness of probability distributions. The first two regularization terms in (8) enforce that the row sums and column sums of the coupling matrix Π do not deviate too much from a uniform distribution, leading our optimization to match as many metabolic features as possible. The magnitude of the regularizer ρ roughly enforces the fraction of metabolites in both datasets that are matched where large ρ implies most metabolites are matched across datasets. The final regularization term ε in (8) controls the smoothness (entropy) of the coupling matrix Π where larger values of ε encourage Π to put uniform weights on many of its entries, leading to less precision in the metabolite matches. However, increasing ε also leads to better numerical stability and a significant speedup of the alternating minimization algorithm used to optimize the objective function (Supplementary Information). In our implementation, we set ρ and ϵ to the lowest possible values under which our optimization converges, with ρ = 0.05 and ϵ = 0.005.
Our full optimization problem can now be written as The UGW objective function is optimized through alternating minimization based on the code of [31] using the unbalanced Sinkhorn algorithm [44] from optimal transport (Supplementary Information).

Constraint on m/z ratios
Matched metabolic features must have compatible m/z so we enforce that Π ij = 0 when |m x i − m y j | > m gap where m gap is a user-specified threshold. Based on prior literature [8,[13][14][15]45] we set m gap = 0.01ppm. Note that m gap is not explicitly used in (10) but is rather enforced in each iteration of our alternating minimization algorithm for the UGW objective (Supplementary Information).
Unlike the m/z ratios discussed above, RTs often exhibit a non-linear deviation (drift) between studies so we cannot enforce compatibility of RTs directly in our optimization. Instead, in the following step of our pipeline we ensure matched metabolite pairs have compatible RTs by estimating the drift function and subsequently using it to filter out metabolite matches whose RT values are inconsistent with the estimated drift.

Estimation of the RT drift and filtering
Estimating the drift between RTs of two studies is a crucial step in assessing the validity of metabolite matches and discarding those pairs which are incompatible with the estimated drift.
LetΠ ∈ R p1×p2 + be the minimizer of (10) obtained after optimization. We seek to estimate the RT drift function f : R + → R + which relates the retention times of matched features between the two studies. Namely, if feature fx i and feature fy j correspond to the same metabolic feature, then we must have that RT y j ≈ f (RT x i ). We propose to learn the drift f through the weighted spline regression where B n,k is the set of n-order B-splines with k knots. All pairs (RT x i , RT y j ) in objective (11) are weighted by the coefficients ofΠ so that larger weights are given to pairs identified with high confidence in the first step of our procedure. The order of the B-splines was set to n = 3 by default, while the number of knots k was selected by 10-fold cross-validation.
Pairs identified as incompatible with the estimated RT drift are then discarded from the coupling matrix. To do this, we first take the estimated RT driftf , and the set of pairs S = {i, j :Π i,j ̸ = 0} recovered inΠ. We then define the residual associated with (i, j) ∈ S as The 95% prediction interval and the median absolute deviation (MAD) of these residuals are given by where |S| is the size of S and the functions std and median denote the standard deviation and median respectively. Similar to the approach in [15], we create a new filtered coupling matrix Π ∈ R p1×p1 + given by where r thresh is a given filtering threshold. Following [14], the estimation and outlier detection step can be repeated for multiple iterations, to remove pairs that deviate significantly from the estimated drift and improve the robustness of the drift estimation. In our main algorithm, we use two preliminary iterations where estimate the RT drift and discard outliers outside of the 95% prediction interval by setting r thresh = PI. We the re-estimate the drift and perform a final filtering step with the more stringent MAD by setting r thresh = 2 × MAD. At this stage, it is possible for Π to still contain coefficients of very small magnitude. As an optional postprocessing step, we discard these coefficients by setting all entries smaller than τ max( Π) to zero, for some userdefined τ ∈ [0, 1]. Lastly, a feature from either study could have multiple possible matches, since Π can have more than one non-zero coefficient per row or column. Although reporting multiple matches can be helpful in an exploratory context, for the sake of simplicity in our analysis, the final output of GromovMatcher returns a oneto-one matching, as we only keep those metabolite pairs (i, j) where the entry Π ij is largest in its corresponding row and column. All nonzero entries of Π which do not satisfy this criterion are set to zero. Finally, we convert Π into a binary matching matrix M ∈ {0, 1} p1×p2 with ones in place of its nonzero entries and this final output is returned to the user.
As a naming convention, we use the abbreviation GM for our GromovMatcher method, and use the abbreviation GMT when running GromovMatcher with the optional τ -thresholding step with τ = 0.3.

Existing alignment methods
In this paper, we consider two existing alignment methods for comparison, metabCombiner [14] and M2S [15]. Both of them take the same kind of input as Gromov-Matcher, i.e. feature tables with features identified with their m/z, RT, and intensities across samples.
MetabCombiner is a three-step process that begins by grouping features based on their m/z within userspecified bins. This creates a search space for potential feature pairs. In the second step, MetabCombiner estimates the RT drift using the potential feature pairs identified in the first step, and eliminates outlying pairs over several iterations. This step can incorporate prior knowledge by identifying shared features and marking them as anchors, which are not discarded. In the final step, MetabCombiner scores the remaining feature pairs based on their m/z, RT, and relative intensity compatibility to discriminate between multiple matches for one feature. The scoring system relies on weights assigned to m/z, RT, and feature intensities, with the magnitude of those weights reflecting the reliability of the corresponding measurements across studies. In our experiment, we use a bin width of 0.01, which is identical to the one used in GromovMatcher. We choose not to rely on prior knowledge for drift estimation as Habra et al. [14] show their drift estimation to be efficient and robust, even without prior knowledge. Our own tests confirm this claim, showing the drift estimation to be very consistent, with or without shared entities (Supplementary Information). However, we use prior knowledge to calibrate the scoring weights, as this is a crucial step in the metab-Combiner pipeline, and the weights can be difficult to set manually.
Pinto et al. [15] introduce M2S as a more versatile alternative to metabCombiner, while still adhering to most of its core principles. Like metabCombiner, M2S follows a three-step process. First it searches for matches within user-defined thresholds for m/z, retention time, and mean feature intensity. Next, M2S estimates m/z, RT and feature intensity drifts between datasets and removes any outlier pairs. Finally, M2S selects the best match using a scoring system that weighs each measurement, similar to metabCombiner. M2S notably stands out by providing greater flexibility in the methods and measurements used at each step of the procedure, resulting however in a larger number of parameters that require manual fine-tuning. To address this, we adopt two different approaches for the simulation study and the EPIC study alignment. In the simulation study, we set the initial thresholds to oracle values and investigate technical parameters such as the method to define 'neighbouring' features used for drift estimation, the degree of freedom when estimating the RT drift, the strictness of the outlier detection, and the method to detect poor matches, among others (Supplementary Information). We determine the optimal parameter combination for each setting, and the combination with the best average F1-score across all settings. For the EPIC study alignment, we use the combination of technical parameters with the best average F1-score in the simulation study and select the best threshold values based on the performance on the validation subset.

Metrics for dataset alignment
Every alignment method studied in this paper returns a binary partial matching matrix M ∈ {0, 1} p1×p1 which has at most one nonzero entry in each row and column. Specifically, M ij = 1 if metabolic features i and j in both datasets correspond to each other and M ij = 0 otherwise. In our simulated experiments, we compare the partial matching M to a known ground-truth partial matching matrix M * ∈ {0, 1} p1×p2 .
To do this, we first compute the number of true positives, false positives, true negatives, and false negatives as where 1 denotes the indicator function. Then we use these values to compute the precision and recall as Precision measures the fraction of correctly found matches out of all discovered metabolite matches, while recall, also know as sensitivity, measures the fraction of correctly matched pairs out of all truly matched pairs. These two statistics can be summarized into one metric called the F1-score by taking their harmonic mean These three metrics, precision, recall, and the F1-score, are used throughout the paper to assess the performance of dataset alignment methods, both on simulated data where the ground-truth matching is known, and on the validation subset in EPIC, using results from the manual examination as the ground-truth benchmark.

Validation on simulated data
To assess the performance of GromovMatcher and compare it to existing dataset alignment methods, we simulate realistic pairs of untargeted metabolomics feature with known ground-truth matchings. This allows us to analyze the dependence of alignment methods on the number of shared metabolites, dataset noise level, and feature intensity centering and scaling.

Dataset generation
Our pairs of synthetic feature tables are generated from one real untargeted metabolomics study of 500 newborns within the EXPOsOMICS project, which uses reversed phase liquid chromatography-quadrupole time-offlight mass spectrometry (UHPLC-QTOF-MS) system in positive ion mode [36]. The original dataset is first preprocessed following the procedure detailed in Alfano et al. [36], resulting in p = 4,712 features measured in n = 499 samples available for subsequent analysis. Features and samples from the original study are then divided into two feature tables of respective size (n 1 , p 1 ) and (n 2 , p 2 ), with n 1 + n 2 = n and p 1 , p 2 ≤ p. In order to do this, n 1 = ⌈n/2⌉ randomly chosen samples from the original study are placed into dataset 1 and the remaining n 2 = ⌊n/2⌋ samples from the original study are placed into dataset 2. Here ⌊·⌋ and ⌈·⌉ denote integer floor and ceiling functions. The features of the original study are randomly assigned to dataset 1, dataset 2, or both, allowing the resulting studies to have both common and study-specific features (Fig. 2). Specifically, for a fixed overlap parameter λ ∈ [0, 1], we assign a random subset ⌊λp⌋ of the features into both dataset 1 and dataset 2 while the remaining p − ⌊λp⌋ features are divided equally between the two studies such that p 1 = p 2 . We choose λ ∈ {0.25, 0.5, 0.75} corresponding to low, medium and high overlap.
After generating a pair of studies, random noise is added to the m/z, RT and intensity levels of features in dataset 2 to mimic variations in data acquisition across two different experiments. The noise added to each m/z value in study 2 is sampled from a uniform distribution on the interval [−σ M , σ M ] with σ M = 0.01 [15]. The RTs of dataset 2 are first deviated by the function f (x) = 1.1x + 1.3 sin(1.2 √ x), corresponding to a systematic inter-dataset drift [14,15,33]. A uniformly distributed noise on the interval [−σ RT , σ RT ] is added to the deviated RTs of dataset 2, with σ RT ∈ {0.2, 0.5, 1} (in minutes) corresponding to low, moderate and high variations [14,15,35]. Finally, we add a Gaussian noise N (0, σ 2 FI ) to the feature intensities of both studies where σ F I is the scalar variance of the noise. This noise perturbs the correlation matrices of dataset 1 and dataset 2, making matching based on feature intensity correlations more challenging. We vary σ F I over the set of values {0.1, 0.5, 1}.
Given this data generation process, we test the performance of the four alignment methods (M2S, metab-Combiner, GM, and GMT) under the parameter settings described below.

Dependence on overlap
We first assess how the performance of the four methods is affected by the number of metabolic features shared in both datasets. For each value of λ = 0.25, 0.5, 0.75 (low, medium and high overlap), we randomly generate 20 pairs of datasets with noise on the m/z, RT and feature intensities set to σ M = 0.01, σ RT = 0.5, σ FI = 0.5. The precision and recall of each method at low, medium, and high overlap is recorded for each of the repetitions.

Noise robustness
Next we test the robustness to noise of each method by fixing the metabolite overlap fraction at λ = 0.5 and generating 20 random pairs of datasets at low (σ RT = 0.2, σ FI = 0.1), medium (σ RT = 0.5, σ FI = 0.5), and high (σ RT = 1, σ FI = 1) noise levels. Similarly, the precision and recall of each method is saved for each noise level across the 20 repetitions.

Feature intensity centering and scaling
In order to test how all four methods are affected when the mean feature intensities and variance are not comparable across studies, we assess their performance when the feature intensities in both studies are mean centered and standardized to have unit standard deviation across all samples. We again generate 20 random pairs of datasets with medium overlap and medium noise, normalize the feature intensities in each pair of datasets, and compute the precision and recall of each method across the 20 repetitions.

EPIC data
We also evaluate our method on data collected within the European Prospective Investigation into Cancer and Nutrition (EPIC) cohort, an ongoing multicentric prospective study with over 500,000 participants recruited between 1992 and 2000 from 23 centers in 10 European countries, and who provided blood samples at the inclusion in the study [37]. In EPIC, untargeted metabolomics data were successively acquired in several studies nested within the full cohort.
In the present work, we use untargeted metabolomics data acquired in three studies nested in EPIC, namely the EPIC cross-sectional (CS) study [38] and two matched case-control studies nested within EPIC, on hepatocellular carcinoma (HCC) [39,40] and pancreatic cancer (PC) [41], respectively. All data were acquired at the International Agency for Research on Cancer, making use of the same plateform and methodology: UHPLC-QTOF-MS (1290 Binary Liquid chromatography system, 6550 quadrupole time-of-flight mass spectrometer, Agilent Technologies, Santa Clara, CA) using reversed phase chromatography and electrospray ionization in both positive and negative ionization mode.
In a previous analysis aiming at identifying biomarkers of habitual alcohol intake in EPIC, the 205 features associated with alcohol intake in the CS study were manually matched to features in both the HCC and PC studies [8].
The results from this manual matching are presented in the Supplementary Table 1. This matching process was based on the proximity of m/z and RT, using a matching tolerance of ±15 ppm and ±0.2 min, and on the comparison of the chromatograms of features in a quality control samples from both studies.

Preprocessing
In the HCC and PC studies, samples corresponding to participants selected as cases in either study (i.e., participants selected in the study because of a diagnosis of incident HCC or PC) are excluded. Indeed, the metabolic profiles of participants selected as controls are expected to be more comparable across studies than those of cases, especially if certain features are associated with the risk of HCC or PC. Apart from this additional exclusion criterion, the untargeted metabolomics data of each study is pre-processed following the steps described in Loftfield et al. [8], to eliminate unreliable features and samples, impute missing values and minimize technical variations in the feature intensity levels.

Alcohol biomarker discovery
Loftfield et al. [8] used the untargeted metabolomics data of the CS, HCC and PC studies in their alcohol biomarker discovery study in EPIC, without being able to automatically match their common features and pool the 3 datasets. Instead, the authors first implemented a discovery step, examining the relationship between alcohol intake and metabolic features measured in the CS study and accounting for multiple testing using a false discovery rate (FDR) correction. This led to the identification of 205 features significantly associated with alcohol intake in the CS study. In order to gauge the robustness of these associations, the authors of Loftfield et al. [8] then implemented a validation step using data from two independent test sets. The first test set was composed of data from the EPIC HCC and PC studies, while the second was derived from the Finnish Alpha-Tocopherol, Beta-Carotene Cancer Prevention (ATBC) study. The 205 features identified in the discovery step were manually investigated for matches in the EPIC test set, and 67 features were effectively matched to features in the HCC or PC study, or both. The authors then evaluated the association between alcohol intake and those 67 features, applying a more conservative Bonferroni correction to determine whether the association with alcohol intake persisted. This step led to the identification of 10 features associated with alcohol intake (Extended Data  Fig. 5a). The second test set was then used to determine whether those 10 features were also significant in the ATBC population, which was indeed the case.
To conduct a more in-depth investigation of the matchings produced by the GromovMatcher algorithm, we build upon the analysis previously conducted by Loftfield et al. [8] by exploring potential alcohol biomarkers using a pooled dataset created from the CS, HCC, and PC studies. Our goal is to assess whether pooling the data leads to increased statistical power and allows for the detection of more features associated with alcohol intake. Namely, we generate the pooled dataset by aligning a chosen reference dataset (CS study) with the HCC and PC studies successively using the GM matchings computed in both positive and negative mode (Methods and Extended Data Fig. 5b). Features that are not detected in either the HCC or PC studies are designated as 'missing' in the final pooled dataset for samples belonging to the respective studies where the feature is not found.
To evaluate the potential relationship between alcohol consumption and pooled metabolic features, we use a methodology akin to that of Loftfield et al. [8]. The selfreported alcohol intake data is adjusted for various demographic and lifestyle factors (age, sex, country, bodymass-index, smoking status and intensity, coffee consumption, and study) via the residual method in linear regression models. Feature intensities are also adjusted for technical variables (plate number and position within the plate) via linear mixed effect models. The significance of the association is assessed using correlation coefficients computed from the residuals for both self-reported alcohol intake and feature intensities. P-values are corrected using either false discovery rate (FDR) or Bonferroni correction to account for multiple testing. Corrected p-values less than 5% are considered significant.

DATA AVAILABILITY
The LC-MS data used to generate our simulated validation experiments is located at the bottom of the "Files" section in https://www.ebi. ac.uk/metabolights/MTBLS1684/files under filename 'metabolomics_normalized_data.xlsx'. The EPIC data is not publicly available, but access requests can be submitted to the Steering Committee https://epic.iarc.fr/ access/index.php.

CODE AVAILABILITY
All code for the data preprocessing, figure generation, as well as the GromovMatcher algorithm and its comparison to other methods are available at: https: //github.com/sgstepaniants/GromovMatcher. Instructions and examples for how to run the GromovMatcher method are provided in the Github repository. The metabCombiner implementation written by the original authors was taken from their Github codebase: https: //github.com/hhabra/metabCombiner. The M2S implementation of the original authors was taken from their Github codebase: https://github.com/rjdossan/M2S. cipal Investigators of each of the EPIC centers for sharing the data for our experimental application.

COMPETING INTERESTS
The authors declare no competing interests.

IARC DISCLAIMER
Where authors are identified as personnel of the International Agency for Research on Cancer/World Health Organization, the authors alone are responsible for the views expressed in this article and they do not necessarily represent the decisions, policy, or views of the International Agency for Research on Cancer/World Health Organization.  Figure) Comparison of GromovMatcher and Loftfield et al. [8] analysis for alcohol biomarker discovery on EPIC data. a, Loftfield study implemented a discovery step, examining the relationship between alcohol intake and metabolic features in the CS study. The significant features in CS were manually matched to features from the HCC and PC and the analysis was repeated using samples from the HCC and PC studies. After this step, 10 features associated with alcohol intake were identified. b, GromovMatcher analysis begins by matching features from CS study to HCC and PC studies respectively (top blue, yellow, and red boxes). Samples corresponding to each CS feature are combined with the samples of its matched feature in the HCC study, PC study, or both. This generates a larger pooled data matrix with the same number of features as the CS study but with more samples pooled across the three original studies (center matrix). Because some features in the CS study may not have matches in HCC or PC, the corresponding entries in the pooled matrix are set to NaN/missing values (white regions in matrix). Each column/feature in this matrix is statistically tested for association with alcohol intake (ignoring missing values) and an FDR or a stricter Bonferroni correction is performed to retain only a subset of features from the pooled study that have a strong association. c, Venn diagrams show intersection of feature sets (in positive and negative mode) found to be associated with alcohol intake by one of the four different analyses. In this paper, we study how to match metabolic features across two datasets where Dataset 1 has p 1 metabolic features measured across n 1 patients and Dataset 2 has p 2 metabolic features measured across n 2 patients. Our goal is to identify pairs of indexes (i, j) with i ∈ {1, . . . , p 1 } and j ∈ {1, . . . , p 2 }, such that feature i in Dataset 1 and feature j in Dataset 2 correspond to the same metabolic feature. More formally, we aim to identify a matching matrix M * ∈ {0, 1} p1×p2 such that M * i,j = 1 if features i in Dataset 1 and feature j in Dataset 2 correspond to the same feature, hereafter referred to as matched features. Otherwise we set M * i,j = 0 otherwise. We emphasize that a matching matrix M * can have at most one nonzero entry in each row and column.

CS ←→
Both of the datasets we aim to match are obtained from liquid chromatography-mass spectrometry (LC-MS) experiments. Hence, for Dataset 1 each metabolite i ∈ [p 1 ] is labeled with a mass-to-charge (m/z) ratio m 1 i as well as a retention time (RT) given by RT 1 i . Additionally, each metabolite has a vector of intensities across patients denoted by X i ∈ R n1 . Similarly, each metabolite j ∈ [p 2 ] in Dataset 2 is labeled by its m/z ratio m 2 j , its retention time RT 2 j and its vector of intensities across samples Y j ∈ R n2 .

arXiv:2306.03218v1 [q-bio.QM] 5 Jun 2023
Correlations and distances between metabolomic features Features cannot be aligned based on their m/z and RT alone as they are often too inconsistent across studies. Our method is based on the idea that, in addition to their m/z and RT being compatible, the signal intensities of metabolites measured in two different studies should exhibit similar correlation structures, or more generally exhibit similar distances between their intensity vectors. In other words, if feature intensity vectors X i ∈ R n1 , Y j ∈ R n2 correspond to the same underlying feature (M * ij = 1) and similarly if X k ∈ R n1 , Y l ∈ R n2 correspond to the same feature (M * kl = 1), then we expect that Here we define corr(u, v) to be the Pearson correlation coefficient between two feature intensity vectors u, v ∈ R n by where we defineū as the mean value, Euclidean norm and inner product respectively. More generally, with d x and d y denoting two given distances on R n1 and R n2 respectively, we expect that Throughout this paper, we use the normalized Euclidean distance defined for any u, v ∈ R n as where for d x and d y we take n = n 1 , n 2 respectively. If the signal intensity vectors u, v are mean centered and normalized by their standard deviation as and likewise for v, then it follows that where we denote d cos (u, v) = 1 − corr(u, v) as the cosine distance. For the purposes of this paper, we will always assume that d x and d y denote the normalized Euclidean distance from (5). As shown above, this will be implicitly equal to the cosine distance from (7) on centered and scaled data. The goal of metabolomic feature matching is to learn the binary matching matrix M * that aligns the distances between pairs of features in the most consistent way possible as shown in (4). To formalize this notion into a practical algorithm, we use the mathematical theory of optimal transport [1] which we discuss next.

Optimal transport
Optimal transport (OT) applies in the setting when the points {X i } p1 i=1 and {Y j } p2 j=1 being matched live in the same dimensional space n 1 = n 2 = n. It aims to find a matching between each point X i and its corresponding point Y j such that the sum of distances between matches is minimized. Matches between each pair of points can be stored in a matching matrix M ∈ {0, 1} p1×p2 such that M ij = 1 if X i and Y j are matched, and M ij = 0 otherwise. Again we note that M must have at most one nonzero entry in each row and column to be a valid matching matrix.
Instead of searching over this space of binary matching matrices, optimal transport places masses a i ≥ 0 at all points X i for i = 1, . . . , p 1 and masses b j ≥ 0 at all points Y j for j = 1, . . . , p 2 and optimizes over the space of probabilistic couplings Π ∈ R p1×p2 + which move a Π ij amount of mass from X i to Y j . We assume here for simplicity that the sum of masses in both datasets are equal to one p1 i=1 a i = p2 i=1 b i = 1 and that the coupling Π transports all mass from a into b. More formally, optimal transport optimizes over the constrained set of couplings U(a, b) = Π ∈ R p1×p2 + : Π1 p2 = a and Π T 1 p1 = b (8) where 1 p denotes the all ones vector of length p. In practice, the points X i and Y j in each dataset are all treated the same and the masses placed on the data are chosen to be uniform a = 1 p1 1 p1 and b = 1 p2 1 p2 . The cost function which optimal transport minimizes is the sum of squared distances of its transported mass in the OT objective can be replaced more generally with a cost matrix C ∈ R p1×p2 that is not necessarily a distance matrix. In this case the cost function becomes When the transport cost C ij is a distance, the OT optimization defines a valid distance metric known as the optimal transport distance between discrete distributions {(a i , When d(u, v) is Euclidean, this OT distance is also referred to as the L 1 optimal transport distance, the Wasserstein 1distance, or the Earth mover's distance. As formulated, the computation of the optimal transport objective involves an optimization over coupling matrices Π which can be solved by linear programming [1]. The OT optimization problem becomes time consuming for problems with many points p 1 , p 2 ≫ 1. We show in the next section how augmenting this distance with a regularization term leads to a more efficient algorithm for learning the optimal coupling Π.

Entropic regularization
Define the Kullback-Leibler (KL) divergence between two positive vectors µ, ν ∈ R p + as Given fixed marginals a ∈ R p1 and b ∈ R p2 from the previous section, we can define the entropy of a coupling matrix Π ∈ R p1×p2 + with respect to these fixed marginals as where (a ⊗ b) ij = a i b j denotes the outer product. This can be further simplified as where we define H(Π) by In the second line of the derivation above, we used the fact that the entries of a, b, and Π summed to one, and in the third line we used the fact that the marginals a and b were uniform. Under these assumptions, we see that the KL divergence D KL (Π, a ⊗ b) is independent of the values of the marginals a, b and is equal to H(Π) up to constants. Although here the general definition of entropy through the KL divergence reduces to the simpler formula of H(Π), in the following sections we will need to extend our analysis to cases when a, b, and Π have positive values that do not sum to one (i.e. not distributions). In this context, we will no longer have that D KL (Π, a ⊗ b) = H(Π) + const but we will still be able to use D KL (Π, a ⊗ b) as a general notion of entropy for Π.
The entropy of a coupling D KL (Π, a ⊗ b) is an important notion because it quantifies how uniform or smooth Π is with respect to the product distribution a ⊗ b. In particular, if a and b are set to uniform distributions as commonly done in practice, then D KL (Π, a ⊗ b) is small when Π has close to uniform entries and is large otherwise. This notion of smoothness allows us to use D KL (Π, a ⊗ b) as a regularizer in our optimal transport distance as where ε is a small regularization parameter. Note that here we have denoted the transport cost matrix by C ∈ R p1×p2 which is not necessarily a distance matrix. The introduction of the regularizer εD KL (Π, a ⊗ b) gives us an efficient iterative algorithm known as the Sinkhorn algorithm for optimizing Π which we describe in the following sections.

Unbalanced optimal transport
Before we introduce the Sinkhorn algorithm, we introduce a final modification to our optimal transport distance that allows us to learn couplings between distributions a, b ∈ R p + that do not preserve mass. In other words, the coupling Π is not required to perfectly satisfy the marginal constraints Π1 p2 = a and Π T 1 p1 = b. In our metabolite matching problem, this is particularly useful as not all metabolites in one dataset necessarily appear in the other dataset and hence should be left unmatched. This modification of optimal transport, known as unbalanced optimal transport (UOT) [2], optimizes the following cost function where we have added two KL terms with regularization parameter ρ to enforce that the marginals of the coupling Π1 p2 , Π T 1 p1 are approximately close to the prescribed marginals a, b respectively. We have also kept the smoothness/entropy regularizer εD KL (Π, a ⊗ b) from the previous section.

Unbalanced Sinkhorn algorithm
Now we are ready to present the unbalanced Sinkhorn algorithm [1] for optimizing the unbalanced optimal transport cost defined above. First we rewrite our optimization as The inner minimization can be solved exactly by introducing dual variables f ∈ R p1 , g ∈ R p2 and writing out the Lagrange dual problem where we have removed the terms ρD KL (u, a) and ρD KL (v, b) since they do not depend on Π. Taking the gradient in Π in the inner minimization and setting it to zero we get which implies that Now we can substitute this expression for Π back into our Lagrange dual problem. First we compute Hence, the outer maximization in our Lagrange dual problem for f and g can now be written as where we have removed the last constant sum in a i b j . Finally we can rewrite our entire minimization from the start of this section as By strong duality, we can interchange the minimum and maximum above to write where we define the functions In fact, we can solve the minimizations in U * and V * in closed form to get the minimizers u * = a ⊙ exp(−f /ρ) and v * = b ⊙ exp(−g/ρ) which we can substitute back in to get Likewise we can see that Thus, we can rewrite our full optimization as where we have removed the terms independent of f and g.
Note that now we can optimize the cost function above by performing an alternating minimization on the dual variables f and g. Taking the gradient in f and setting it to zero we see that which implies that Similarly, we can write out We are now ready to write out the full unbalanced Sinkhorn algorithm which performs an alternating minimization on the dual potentials f, g as outlined above. We remind the reader that the coupling matrix can be recovered from the dual potentials by the formula The unbalanced Sinkhorn algorithm proceeds as follows.
Algorithm 1: UnbalancedSinkhorn input : Transport cost C, marginals a, b, marginal relaxation ρ, entropic regularization ε output: Return the coupling matrix Π Initialize g = 0 while (f, g) has not converged do Set gj ← − ερ ε+ρ ln Return the coupling matrix Πij = aibj exp The final output of the Sinkhorn algorithm optimization is a real-valued coupling matrix Π ∈ R p1×p2 + . In some cases, it is desirable to transform the coupling matrix into a binary-valued matching matrix M ∈ {0, 1} p1×p2 with possibly an added restriction that there is at most one nonzero element in each row and column (to obtain a valid partial matching). This can be done by either thresholding the real matrix Π or by assigning all maximal entries in each row (or column) to one and setting the remaining entries to zero. For our metabolomics matching problem, we describe our procedure for transforming our real-valued coupling into a binary matching matrix in the section on the GromovMatcher algorithm below.

Gromov-Wasserstein
Now that we have introduced the general formulation of unbalanced optimal transport and its corresponding Sinkhorn algorithm, we can extend this formulation to matching problems between distributions of points that live in different dimensional spaces. In our metabolomics setting, we aim to match two datasets of p 1 and p 2 metabolic features respectively where each feature in a dataset is associated with a feature intensity vector j=1 ⊂ R n2 respectively across samples. We assume that there exists a true matching matrix M * ∈ {0, 1} p1×p2 with at most one nonzero entry in each row and column such that two metabolites (i, j) are matched if M * ij = 1. We make the further assumption that if feature vectors X i , Y j are matched and feature vectors X k , Y l are matched under M * , then we expect that where d x is a distance metric on R n1 and d y is a distance metric n R n2 . In practice, we always choose these distance metrics to be the normalized Euclidean distance defined for any u, v ∈ R n as which is equal to the cosine distance d cos (i.e. one minus the correlation) for centered and scaled data. Given these two distance matrices we would like to infer the true matching matrix M * by solving an optimization problem.
Consider the following objective function where the matching matrices M ∈ {0, 1} p1×p2 we optimize over are constrained to satisfy marginal constraints Π1 p2 > 0 and Π T 1 p1 > 0. These marginal constraints simply impose that there is at least one nonzero entry in each row and column (i.e. each metabolite in both datasets has at least one corresponding match). Searching for the Π minimizing E X,Y (Π) consists of putting the non-zero entries in Π such that the distance profiles of the matched features are similar, so that the minimizer of this criterion provides a good candidate estimate of Π * . This is closely related to the Gromov-Hausdorff distance [3], an extension of optimal transport to the case where the sets to be coupled do not lie in the same metric space. In practice, it is often desirable to optimize over a different set of matrices in order to make the optimization problem more tractable. Here we take intuition from optimal transport, and search over the set of coupling matrices with marginal constraints U(a, b) = {Π ∈ R p1×p2 + : Π1 p2 = a and Π T 1 p1 = b}. (22) where as before, a ∈ R p1 + and b ∈ R p2 + are desired marginals which are typically set to be uniform distributions a = 1 p1 1 p1 and b = 1 p2 1 p2 . These marginal vectors can be interpreted as distributions of masses a i and b j on the feature vectors X i and Y j respectively for i ∈ [p 1 ], j ∈ [p 2 ].
Coupling matrices in U(a, b) transport the distribution of masses a in the first dataset to the distribution of masses b in the second dataset. Now we can formulate the Gromov-Wasserstein (GW) distance, introduced by Memoli [4], as By optimizing this objective, each entry Π ij now reflects the strength of the matched pair (X i , Y j ). Optimizing GW(a, b) then amounts to placing larger entries in Π whose paired features have similar distance profiles. Before we develop an algorithm to optimize this objective, we first modify it to allow for unbalanced matchings where marginal constraints are not enforced exactly (e.g. features in both datasets can remain unmatched).

Unbalanced Gromov-Wasserstein
In an untargeted context, all features measured in one study are not necessarily observed in another, either because these features are truly not shared or because of measurement error. However, the constraint Π ∈ U(a, b) in the original GW optimization criterion (23) ensures that all the mass is transported from one set to another, resulting in all features being matched across studies. In order to discard study-specific features during the GW computation, we use the unbalanced Gromov-Wasserstein (UGW) distance with an additional entropic regularization for computational purposes, described in Séjourné et al. [5]. The optimization problem therefore reads UGW ρ,ε = min with ρ, ε > 0. Here D KL is the Kullback-Leibler divergence defined in the previous sections and we define the tensor product (P ⊗ P ) i,j,k,l = P i,j P k,l . Here we set the desired marginal constraints to a = 1 p1 1 p1 and b = 1 p2 1 p2 as before. As in the case of unbalanced optimal transport [2], the regularization ρ times the Kullback-Leibler divergences allows for the relaxation of the marginal constraints Π1 p2 = a and Π T 1 p1 = b. The value of ρ > 0 controls the extent to which we allow for mass destruction. Smaller values of ρ tend to lessen the constraint on the marginals of Π, while balanced GW is recovered when ρ → +∞. As proposed in the original paper [5], our UGW cost modifies the UOT formulation by using the quadratic Kullback-Leibler divergence in Π1 p2 ⊗ Π1 p2 and Π T 1 p1 ⊗ Π T 1 p1 instead, hence preserving the quadratic form of the GW cost function E(Π).
serves as an entropic regularization, inspired again by optimal transport. Adding such a penalty is a standard way to compute an approximate solution to the optimal transport problem using the Sinkhorn algorithm as we shall show in the following section. Here again, we modify the entropic penalty in UGW to have a quadratic form in Π ⊗ Π to agree with the quadratic form of the GW cost E(Π). The parameter ε controls the smoothness (entropy) of the coupling matrix Π where larger values of ε encourage Π to put uniform weights on many of its entries, leading to less precision in the feature matches. However, increasing ε also leads to better numerical stability and a significant speedup of the alternating Sinkhorn algorithm used to optimize the objective function described below.

UGW optimization algorithm
Now we are ready to write out an algorithm to optimize the UGW objective in (25). First write our objective as Using the quadratic nature of our cost function, we aim to perform an alternating minimization in the two copies of Π. For the moment, let's differentiate these two copies by Π and Γ and write the new cost Before we expand this cost, we introduce the notation m(π) to denote the sum of the elements of π which can be a vector, matrix or tensor. In general, for four positive distributions π, a ∈ R p + and γ, b ∈ R q + we have that the KL satisfies the tensorization property Specifically, if we remove those terms that do not depend on γ we are left with This allows us to write for the marginal constraints a ∈ R p1 + , b ∈ R p2 + and couplings Π, Γ ∈ R p1×p2 where in the expansions above we have removed all terms that are independent of Γ. Finally, expanding out F ρ,ε (Π, Γ) and keeping only those terms that depend on Γ we get where the cost matrix C Π ∈ R p1×p2 is defined as where we have hidden the dependence of C Π on the distance matrices D x , D y , the marginals a, b, and the regularization parameters ρ, ε for ease of notation.
Remarkably, the cost above in Γ for fixed Π is in the form of an unbalanced optimal transport problem which can be solved through unbalanced Sinkhorn iterations (Algorithm 1). Note that in our derivation above, it did not matter whether we optimized Γ with Π fixed or vice versa because the cost F ρ,ε (Π, Γ) is symmetric in both of its arguments.
Our iterative algorithm for solving the unbalanced GW problem will proceed at each iteration by optimizing Γ to minimize the cost above using the unbalanced Sinkhorn method, setting Π equal to Γ and repeating. With each iteration, we expect this iterative procedure to make smaller and smaller updates to Γ until convergence. By definition, at the end of each iteration we assign Π = Γ so the minimizer of F ρ,ε (Π, Γ) we converge to should also be a minimizer of the original UGW cost L ρ,ε (Π) in the sense that the relaxation of L ρ,ε (Π) to F ρ,ε (Π, Γ) is tight. This is proven rigorously under strict mathematical assumptions in [5]. We state the full UGW optimization algorithm below.
Algorithm 2: UnbalancedGromovWasserstein input : Distance matrices D x , D y , marginals a, b, marginal relaxation ρ, entropic regularization ε output: Return the coupling matrix Π Following the implementation of the UGW algorithm in [5], we initialize both Π and Γ to be the product distribution of the marginals a ⊗ b/ m(a)m(b) before we begin the optimization. Also, we note that if (Π, Γ) is a minimizer of our UGW objective F ρ,ε (Π, Γ), then so is Returning to our metabolomics matching problem, we further guide our UGW optimization procedure by discouraging it from matching metabolic feature pairs whose mass-to-charge ratios are incompatible. Namely, we choose a value m gap such that for all pairs (i, j) with i ∈ [p 1 ], j ∈ [p 2 ] and mass-to-charge ratios m x i , m y j we enforce that In practice, this is done by taking the optimal transport cost C Π in every iteration of the UGW algorithm and premultiplying it elementwise by a factor W ∈ R p1×p2 + given by where 1 X denotes the indicator function that is one when the condition X is satisfied and zero otherwise. Such a prefactor changes the transport cost to be very large for feature matches with incompatible mass-to-charge ratio times, and hence, the entries of Π set small weights at these entries. Our weighted UGW algorithm is rewritten below.
As mentioned before, the coupling matrix returned by our weighted UGW algorithm is a real-valued matrix rather than a binary matching matrix. In the next section, we describe how we incorporate metabolite retention time information to filter out unlikely pairs in our coupling matrix and transform it into a valid one-to-one matching of features across two datasets.

Retention time drift estimation and filtering
To filter out unlikely matches from the coupling matrix returned by Algorithm 3 above, we use the retention times (RTs) of the metabolites in both datasets. We remind the reader that RTs were not incorporated into the Algorithm 3: WeightedUnbalancedGromovWasserstein input : Distance matrices D x , D y , marginals a, b, marginal relaxation ρ, entropic regularization ε, mass-to-charge ratios m x , m y , mass-to-charge ratio gap mgap output: Return the coupling matrix Π weighted UGW algorithm since they often exhibit a non-linear deviation between datasets, and hence are not directly comparable. However, using the metabolite couplingΠ ∈ R p1×p2 + obtained from Algorithm 3, it is possible to estimate this RT drift. The estimated RT driftf : R + → R + allows us to assess the plausibility of the pairs recovered by the restricted UGW couplingΠ, and discard pairs incompatible with the estimated drift.
We propose to learn the driftf through the weighted spline regression where B n,k is the set of n-order B-splines with k knots. All pairs (RT x i , RT y j ) in objective (34) are weighted by the coefficients ofΠ so that larger weights are given to pairs identified with high confidence in the first step of our procedure.
Pairs identified as incompatible with the estimated RT drift are then discarded from the coupling matrix. To do this, we first take the estimated RT driftf , and the set of pairs S = {i, j :Π i,j ̸ = 0} recovered inΠ with nonzero entries. We then define the residual associated with (i, j) ∈ S as The 95% prediction interval and the median absolute deviation (MAD) of these residuals are given by where |S| is the size of S and the functions std, median denote the standard deviation and median respectively. Following [6], we then create a new filtered coupling matrix Π ∈ R p1×p1 + given by where r thresh is a given filtering threshold. The procedure of estimating the drift functionf in (34) and filtering the coupling can be repeated for multiple iterations, to improve the drift and coupling estimation. In our main algorithm, we use two preliminary iterations where we estimate the RT drift and discard outliers with r thresh = PI, defined as points falling outside of the 95% prediction interval. We the re-estimate the drift and perform a final filtering step with the more stringent MAD by setting r thresh = 2 × MAD. At this stage, it is possible for Π to still contain coefficients of very small magnitude. As an optional postprocessing step, we discard these coefficients by setting all entries smaller than τ max( Π) to zero for some scaling constant τ ∈ [0, 1]. Lastly, a feature from either study could have multiple possible matches, since Π can have more than one non-zero coefficient per row or column. Although reporting multiple matches can be helpful in an exploratory context, for the sake of simplicity in our analysis, the final output of GromovMatcher returns a one-to-one matching.
Consequently, we only keep those metabolite pairs (i, j) where the entry Π ij is largest in its corresponding row and column. All nonzero entries of Π which do not satisfy this criterion are set to zero. Finally, we convert Π into a binary matching matrix M ∈ {0, 1} p1×p2 with ones in place of its nonzero entries and this final output is returned to the user.
As a naming convention, we use the abbreviation GM for our GromovMatcher method, and use the abbreviation GMT when running GromovMatcher with the optional τ -thresholding step.

GromovMatcher algorithm summary
In summary, our full GromovMatcher algorithm consists of (1) UGW optimization followed by (2) retention time drift estimation and filtering.
The tuning of ρ and ϵ was computationally driven and the two parameters were set as low as possible, with ρ = 0.05 and ϵ = 0.005. Based on literature [6][7][8][9][10] and what is considered to be a plausible variation of a feature's m/z, we set m gap = 0.01ppm. For RT drift estimation, the order of the B-splines was set to n = 3 by default, while the number of knots k was selected by 10-fold cross-validation. If the optional thresholding step was applied in GMT, we set τ = 0.3. Otherwise, we let τ = 0 which gives the unthresholded GM algorithm. Return M andf .

M2S hyperparameter experiments
M2S [9] depends on a collection of hyperparameters that we fine-tune before applying it to our simulated and experimental datasets. This algorithm follows a three step process to align two LC-MS datasets. First it matches all pairs of metabolic features whose absolute difference in m/z, RT, and median of log 10 FI are within the user-defined thresholds 'MZ_intercept', 'RT_intercept' and 'log10FI_intercept'. On simulated data experiments, we set these thresholds to MZ_intercept = 0.01, RT_intercept = 3.5 and log10FI_intercept = 0.2 which are large enough to not exclude any true feature matches in any of the scenarios for our simulated data under low, medium, and high overlap/noise (see Methods). M2S also offers more detailed options to match features whose absolute difference stays within two lower and upper bound lines with a given slope where the intercepts of these lines are defined using the values above. In our analysis, we set the slopes of these linear boundaries to zero so as to not remove any true matches. Because the reference and target studies we are matching in the simulated analysis are on the same scale, we set the FI adjust method to 'none'.
The second step of M2S involves calculating penalization scores for every pair of matches which are used to determine the best set of matches between metabolic features of both datasets. This step depends on a set of hyperparameters which we perform a grid search over to optimize the performance of M2S. For estimating the m/z, RT, and FI drift, the hyperparameters are the percentage of neighbors 'nrNeighbors', the neighborhood shape 'neighMethod', and the LOESS span percentage 'pctPointsLoess' used to smooth the estimated drift functions. After the drifts are estimated, they are normalized using a method specified by 'residPercentile' that puts the m/z, RT, and FI residuals on the same scale. We always fix residPercentile = NaN which defaults to the standard 2 × MAD normalization. Next, for every remaining metabolic feature match, the residuals/drifts of the m/z, RT, and FI are added together by taking the weighted square root sum of squares. For unnormalized data where feature intensity magnitudes are important, we weight all three drifts equally using W = (1, 1, 1) and for data with normalized feature intensities we set the FI drift weight to zero such that W = (1, 1, 0). Finally, using these weighted penalization scores, M2S selects the best matched pair within a multiple match cluster to obtain a one-to-one matching between datasets.
The third and final step of M2S involves removing those remaining matches which have large differences in m/z, RT, or FI. This can be performed using several methods indicated by the hyperparameter 'methodType'. Each method excludes those matched pairs whose differences in m/z, RT, or FI exceed a certain number of median absolute deviations indicated by the parameter 'nrMAD'. The remaining one-to-one metabolic feature matches are returned as the final result of the M2S algorithm.
To optimally tune M2S on our simulated experiments, we determine the optimal M2S parameter combination for each individual simulation setting (low, medium, high overlap and noise) by performing a grid search over the product of parameter lists • nrMAD = [1,3,5] Each parameter combination for M2S is tested across 20 randomly generated datasets at the same overlap and noise settings. For each setting, the combination of parameters above with the best average F1-score across these 20 trials is used as the optimal parameter choice.
M2S applies initial RT thresholds to search for candidate pairs, which may favor settings where the RT drift follows a linear trend. Therefore, as a sensitivity analysis, we apply M2S to simulated data with a linear drift. The simulation process is identical to that of our main simulation study, except for the deviation of the RT in dataset 2. Specifically, for a given overlap value, we divide the original real-world dataset into two smaller datasets and introduce random noise to the m/z, RT and intensities of the features, without introducing a systematic deviation to the RT in dataset 2. M2S parameters are kept identical to the ones used in our main analysis in comparable settings. The results obtained by M2S on three pairs of datasets generated for three overlap values (0.25, 0.5 and 0.75) and a medium noise level are reported in Supplementary Table II. While the results obtained in a high overlap setting are close to those obtained in our main analysis M2S demonstrates better performance in a low overlap setting when the RT drift is linear than in our main analysis. This observation is consistent with the results obtained by M2S on EPIC data, considering the relatively low estimated overlap between the aligned EPIC studies in our main analysis.
For the EPIC data, we select the parameter combination that yields the highest F1-score across all simulated settings. However, due to the unavailability of oracle values for setting initial thresholds, we perform a search over several MZ intercept values (0.01, 0.05, and 0.1), RT intercept values (0.1, 0.5, 1, and 5), and logFI intercept values (1, 10, and 100).

MetabCombiner hyperparameter experiments
MetabCombiner [6] includes adjustable parameters throughout the pipeline. We set most of them to default values unless otherwise stated. The 'binGap' parameter sets the m/z tolerance of metabCombiner, similar to m gap in GromovMatcher. We used the same value of 0.01 as in GromovMatcher.
MetabCombiner first establishes candidate pairs by binning features in the m/z dimension with a width of binGap, and pairing the features sorted by relative intensities. It then estimates the RT drift using basis splines, and removes pairs associated with a high residual (twice the mean model error) from the candidate set. In our main experiment, the RT drift is estimated exclusively using candidate pairs selected by the pipeline. However, it is also possible to include known ground truth pairs as 'anchors' to estimate the RT drift. Thus, we conduct a sensitivity analysis comparing the results obtained in our main experiment with those obtained when supplying metabCombiner with known shared metabolites to anchor the RT drift estimation. We randomly select 100 anchors from the ground truth matching and compute the metabCombiner matchings with otherwise identical settings as in our main experiment. The results from this analysis (reported in Supplementary Figure 6) show that the unsupervised RT drift estimation (using anchors selected by the pipeline only) performs as well as the supervised RT drift estimation.
After establishing candidate pairs and filtering out those that contradict the estimated RT drift, metabCombiner discriminates between multiple matches using a scoring system that considers m/z, RT, and rankings of the median feature intensities. Each dimension has a specific weight that can be left at default, manually adjusted, or automatically tuned using known matched pairs. Habra et al. [6] provide qualitative guidelines for tuning the weights manually, mainly based on the experimental conditions and visual inspection of the RT drift plot. Since this approach is difficult to implement in the various settings we consider for our simulation study, we rely on the quantitative tuning function included in the metabCombiner pipeline. This function takes into account known shared features and tunes the weights to optimize the scores of those known matches. We randomly select 100 known true matches to define the objective function metabCombiner maximizes. We search over the recommended range of values, with the m/z weight A ∈ [50, 150], the RT weight B ∈ [5,20] and the feature intensities weight C ∈ [0, 1]. Supplementary Figure 6 presents the results obtained with the weights set at default values (A = 100, B = 15, C = 0.5), as a sensitivity analysis.

Centered and scaled data -Negative mode
In this section, we present the results obtained on centered and scaled EPIC data in negative mode, shown in Figure  4 of our main paper. However, due to the smaller size of the validation subset (42 features examined in negative mode compared to 163 in positive mode), the evaluation of the performance of the three methods may be less reliable than in positive mode.
First, we align the CS and HCC studies in negative mode and detect a total of 449, 492, and 180 matches with GM, M2S, and metabCombiner, respectively. Similar to the positive mode analysis, we evaluate the precision and recall of the three methods on the 42 feature validation subset, of which 19 were manually matched. GM and M2S demonstrate identical F1-scores of 0.98, while metabCombiner performs poorly in comparison. GM is able to recover all 19 true matches and identified only 1 false positive, while M2S recovers no false positives but missed 1 true positive.
Next, we align the CS and PC studies in negative mode and detect a total of 485, 569, and 314 matches with GM, M2S, and metabCombiner, respectively. Again, we evaluate the precision and recall of the three methods on the 42 feature validation subset, of which 26 were manually matched. MetabCombiner performs better than in the other EPIC pairings with an F1-score of 0.857, but is still outperformed by the other two methods. GM is slightly outperformed by M2S in this setting, with an almost identical precision of 0.93, but a slightly higher recall for M2S due to detecting 1 additional true positive. However, this remains a good performance for GM since M2S was optimally tuned using the validation subset itself.
Non-centered and non-scaled data As a sensitivity analysis, we apply the three methods to EPIC data that has not been centered or scaled. The detailed results can be found in Supplementary Table III. M2S was tuned manually on the validation subset to ignore feature intensities in both cases. As a result, it maintains its performance compared to our main experiment. On the other hand, the performance of GM and metabCombiner is affected by the lack of consistency in feature intensities. MetabCombiner's recall drops slightly but its precision remains comparable to that of our main experiment, with the method clearly favoring the latter. Although GM's recall decreases slightly in positive mode, it remains more precise than the optimally tuned M2S, and it balances precision and recall better than metabCombiner. Interestingly, GM's results in negative mode are improved compared to our main experiment, and it outperforms both mC and M2S. However, since the validation subset in negative mode is relatively small, these differences may not be significant. Nonetheless, GM maintains a good performance, similar to that of the optimally tuned M2S.
Similar to the analysis we conducted on centered and scaled data, we find a high number of false positives when aligning the CS study and the PC study in positive mode. Therefore, we manually examine the matches recovered by GM. Our examination reveals 2 false positives, 4 unclear matches, and 3 additional good matches that GM also identifies in our main analysis. This demonstrates that the lack of centering and scaling results in two additional false positives for GM that are not present in our main results.

Illustration for alcohol biomarker discovery
Loftfield et al. [7] identified 205 features associated with alcohol intake in the CS study, using a false discovery rate (FDR) correction to account for multiple testing. By applying an FDR correction in our pooled analysis, we identify 243 features associated with alcohol intake. Out of those 243 features, 185 are consistent with the features identified in the discovery step of Loftfield et al. [7], while 55 features are newly discovered (Extended Data Fig. 5c). We examine the 20 features identified as significant in Loftfield et al.'s discovery analysis but that are not significant in our pooled analysis. Both manual and GM matching yield identical results for these features, indicating that the loss of significance is not due to incorrect matching. Upon further investigation, we find that these features do not demonstrate a meaningful association with alcohol intake in the HCC and PC studies. This observation is reinforced by the fact that none of these features are among the 10 features that persisted after the validation step in Loftfield et al.
Out of the 205 features initially discovered in Loftfield et al. [7], 10 are replicated in the EPIC HCC and PC studies using the more stringent Bonferroni correction. When using a Bonferroni correction in our pooled analysis, we find significant association between alcohol intake and 92 features, 36 of which are effectively shared by the three studies. Notably, these features include all 10 features that were retained in Loftfield et al. (Extended Data Fig. 5c).
This analysis illustrates how GromovMatcher can be used in the context of biomarker discovery, and its potential to allow for increased statistical power.  correspond to the design of our main analysis, where 100 randomly selected true pairs are supplied to metabCombiner to set the scoring weights automatically, but are not otherwise used. In the second setting, labelled 'Scores + RT', metabCombiner is allowed to use the 100 true pairs not only to set the scoring weights, but also to estimate the RT drift. Finally, in the third 'Default' setting, we do not use any prior knowledge for the RT drift estimation and keep the scoring weights' default values.   III: Precision and recall on the EPIC validation subset for unnormalized data in (a) positive mode, and (b) negative mode. 95% confidence intervals were computed using modified Wilson score intervals [11,12].