An optimal transport approach for combining untargeted metabolomics datasets (GromovMatcher).

(a) Inputs are two LC-MS datasets of unlabeled metabolic features (rows) identified by their m/z, RT, and feature intensities across biospecimen samples. Both studies can have differing numbers of metabolic features and samples. (b) In both datasets, the intensities across samples of each metabolic feature are formed into a vector and Euclidean distances between these feature vectors are computed and stored in a distance matrix. (c) Based on the technique of optimal transport, the unbalanced GW algorithm learns a coupling matrix that places large weights when fxi and fyy likely correspond to the same metabolic feature. It optimizes to match features with similar pairwise distances (red outlined boxes) whose m/z ratios are close. (d) The final step of GromovMatcher plots the retention times of features from both datasets against each other and fits a spline interpolation f weighted by the estimated coupling weights n. This retention time drift function is then used to set all entries to zero for those outlier pairs (fxi, fyy) which exceed twice the median absolute deviation (MAD) around (green highlighted region). Finally, the coupling matrix is filtered and/or thresholded to obtain a refined coupling which is then binarized to obtain a one-to-one matching M between a subset of metabolite pairs in both datasets.

Simulated data for testing untargeted metabolomics alignment methods.

(a) Initial LC-MS dataset taken from the EXPOsOMICS project with m/z, RT, and feature intensities of p = 4, 712 metabolites identified in cord blood across n = 499 newborns. (b) Newborns (rows) are split into two disjoint groups of sizes n1 = 249 and n2 = 250 respectively and metabolic features (columns) are split into two equal groups of size p1 = p2 with overlap λp where λ = 0.25, 0.5, 0.75 (Methods). Datasets are perturbed by additive noise of magnitude (σM, σRT, σFl) and a nonlinear drift f (x) is applied to the RTs of dataset 2. (c) The two resulting datasets share λ = 25%, 50%, or 75% of the original dataset’s metabolic features.

Comparison of MetabCombiner, M2S, and GromovMatcher on simulated data.

(a) Ground-truth matchings, and matchings inferred by metabCombiner, M2S, GM, and GMT. Pairs of datasets are generated for three levels of overlap (low, medium and high), with a medium noise level (Methods). Matches correctly recovered (true positives) are represented in green. True matches that are not recovered (false negatives) are highlighted in grey. Incorrect matches (false positives) are plotted in red. Features in rows and columns of matching matrices are reordered for visual clarity. (b) Average precision and recall on 20 randomly generated pairs of datasets, for three levels of overlap (low, medium and high) with a medium noise level.

Application of GromovMatcher and comparison to existing methods on EPIC dataset.

(a) Dimensions of the three EPIC studies used. For each ionization mode, the cross-sectional (CS) study is aligned successively with the hepatocellular carcinoma (HCC) study and the pancreatic cancer (PC) study. (b) Demonstration of expert manual matching and GromovMatcher (GM) matching between the CS and HCC studies in positive mode. Experts manually match 90 features from Loftfield et al. (2021) and the correlation matrices of these features in both datasets have similar structure (bottom two matrices). GM discovers 996 shared features between the CS and HCC datasets which have similar correlation structure (top two matrices). We validate that 88 of the 90 features from the manually expert matched subset are contained in the set of features matched by GM. (c) Performance of metabCombiner (mC), M2S and GM in positive mode. Precisions and recalls are measured on a validation subset of 163 manually examined features, and 95% confidence intervals are computed using modified Wilson score intervals. (d) Performance of mC, M2S and GM in negative mode. Precision and recall are measured on a validation subset of42 manually examined features, and 95% confidence intervals are computed using modified Wilson score intervals. See Table 2 and Table 3 for exact precisions, recalls, and confidence intervals in positive and negative mode respectively.

Results from the manual matching conducted for Loftfield et al. Loftfield et al. (2021).

Features from the CS study (163 features in positive mode, 42 features in negative mode) were manually investigated for matches in the HCC and PC studies.

Precision and recall on the EPIC validation subset in positive mode.

95% confidence intervals were computed using modified Wilson score intervals Brown et al. (2001); Agresti and Coull (1998).

Precision and recall on the EPIC validation subset in positive mode.

95% confidence intervals were computed using modified Wilson score intervals Brown et al. (2001); Agresti and Coull (1998).

Comparison of GromovMatcher and Loftfield et al. Loftfield et al. (2021) analysis for alcohol biomarker discovery on EPIC data.

(a) Loftfield study implemented a discovery step, examining the relationship between alcohol intake and metabolic features in the CS study. The significant features in CS were manually matched to features from the HCC and PC and the analysis was repeated using samples from the HCC and PC studies. After this step, 10 features associated with alcohol intake were identified. (b) GromovMatcher analysis begins by matching features from CS study to HCC and PC studies respectively (top blue, yellow, and red boxes). Samples corresponding to each CS feature are combined with the samples of its matched feature in the HCC study, PC study, or both. This generates a larger pooled data matrix with the same number of features as the CS study but with more samples pooled across the three original studies (center matrix). Because some features in the CS study may not have matches in HCC or PC, the corresponding entries in the pooled matrix are set to NaN/missing values (white regions in matrix). Each column/feature in this matrix is statistically tested for association with alcohol intake (ignoring missing values) and an FDR or a stricter Bonferroni correction is performed to retain only a subset of features from the pooled study that have a strong association. (c) Venn diagrams show intersection of feature sets (in positive and negative mode) found to be associated with alcohol intake by one of the four different analyses.

UnbalancedSinkhorn

UnbalancedGromovWasserstein

WeightedUnbalancedGromovWasserstein

GromovMatcher

Performance of metabCombiner with the different parameter settings.

The first setting, labelled ‘Scores’ correspond to the design of our main analysis, where 100 randomly selected true pairs are supplied to metabCombiner to set the scoring weights automatically, but are not otherwise used. In the second setting, labelled ‘Scores + RT’, metabCombiner is allowed to use the 100 true pairs not only to set the scoring weights, but also to estimate the RT drift. Finally, in the third ‘Default’ setting, we do not use any prior knowledge for the RT drift estimation and keep the scoring weights’ default values.

Performance of M2S in a setting where the RT drift between studies is linear.

Sensitivity of thresholded GromovMatcher (GMT) to feature overlap fraction λ, feature imbalance fraction λf, and sample imbalance fraction λs between two datasets being matched.

Sensitivity of M2S to feature overlap fraction λ, feature imbalance fraction λf, and sample imbalance fraction λs between two datasets being matched.

Sensitivity of metabCombiner (mC) to feature overlap fraction λ, feature imbalance fraction λf, and sample imbalance fraction λs between two datasets being matched.

Precision and recall on the EPIC validation subset for unnormalized data in (a) positive mode, and (b) negative mode.

95% confidence intervals were computed using modified Wilson score intervals Brown et al. (2001); Agresti and Coull (1998).

Overlap between the 706 features common to the HCC and PC studies found via reference matching, and the 938 features common to HCC and PC found by direct matching

Overlap between the features identified as common to the three EPIC studies using either the CS study or the HCC study as a reference.

Average precision and recall obtained on simulated data, with fixed overlap λ = 0.5.

The noise level corresponds to different values of σRT and σFI. High, medium and low noise level correspond to (σRT, σFI) = (0.2, 0. 1), (0.5, 0.5) and (1, 1) respectively. We run 20 simulations for each setting.

Performance on centered and scaled data.

The feature intensities of both datasets are centered and scaled to have means of 0 and standard deviations of 1. The average precision and recall of the three methods are computed on 20 randomly generated pairs of datasets, for (a) three levels of overlap (low, medium and high corresponding to λ = 0.25, 0.5 and 0.75, respectively) with a medium noise level ((σRT, σFI) = (0.5, 0.5)), and (b) fixed medium overlap (λ = 0.5) and three different noise levels (low, medium and high corresponding to (σRT, σFI) = (0.2, 0.1), (0.5, 0.5) and (1, 1), respectively).

Consistency of the mean feature intensities (FI) in EPIC.

Each scatter plot represents the mean feature intensities of manually matched features from the validation subset. Each dot represents a pair of manually matched features. The axis represent the mean feature intensities recorded in the two different studies.

Overlap between the matching results obtained by metabCombiner, M2S and GromovMatcher in EPIC. Venn diagrams are not up to scale.

Estimated RT drift between the EPIC studies aligned in the main experiment.

Each dot correspond to a candidate matched pair after the first step of GM (m/z constrained GW matching), before the RT drift estimation and RT based filtering.