MS1Connect: a mass spectrometry run similarity measure

Abstract Motivation Interpretation of newly acquired mass spectrometry data can be improved by identifying, from an online repository, previous mass spectrometry runs that resemble the new data. However, this retrieval task requires computing the similarity between an arbitrary pair of mass spectrometry runs. This is particularly challenging for runs acquired using different experimental protocols. Results We propose a method, MS1Connect, that calculates the similarity between a pair of runs by examining only the intact peptide (MS1) scans, and we show evidence that the MS1Connect score is accurate. Specifically, we show that MS1Connect outperforms several baseline methods on the task of predicting the species from which a given proteomics sample originated. In addition, we show that MS1Connect scores are highly correlated with similarities computed from fragment (MS2) scans, even though these data are not used by MS1Connect. Availability and implementation The MS1Connect software is available at https://github.com/bmx8177/MS1Connect. Supplementary information Supplementary data are available at Bioinformatics online.


Introduction
Over the years, a wealth of proteomics data has been deposited into online repositories such as PRIDE (Perez-Riverol et al., 2019) and MassIVE (Choi et al., 2020). Researchers may be interested in finding data in these online repositories that are similar to their own data in order to perform some joint analysis. However, it is difficult to identify which repository runs should be analyzed with your data. This is because it is hard to measure the similarity of a pair of runs, especially across different studies. Specifically, biologically irrelevant differences in sample preparation, liquid chromatography and instrument parameters all affect the resulting data.
Currently, few methods exist for measuring the similarity of a pair of proteomics runs. One metric is to count the number of confidently detected peptides in common (Tabb et al., 2010). Unfortunately, this method requires knowing the species composition of the samples, which is not always known, and requires a database search. Another previously developed method for measuring the similarity between a pair of mass spectrometry runs directly compares the set of peptide fragment spectra (MS2 scans) from each run. Conceptually this method works by calculating the spectra dot product for all pairs of spectra in two runs and then counting the fraction of scores above a threshold (Rieder et al., 2017). This approach has been successfully used for species identification (Belghit et al., 2019), molecular phylogenetics (Palmblad and Deelder, 2012) and differentiating between experimental protocols (van der Plas-Duivesteijn et al., 2016).
Unfortunately, these MS2-based methods do not account for differences in how MS2 scans are collected. MS2 spectra collected by data-independent acquisition (DIA), data-dependent acquisition (DDA) and targeted analyses are all different from each other. Therefore, new methods are needed that can measure the similarity between a pair of proteomics runs regardless of how the data were acquired.
In this work, we describe a new method, MS1Connect, that only uses intact peptide scans (MS1 scans) to calculate the similarity between a pair of runs. Since MS1 scans are always collected in the same way, these data can be used to compare targeted, DDA and DIA data. To our knowledge, MS1 data have not been used to measure the overall similarity between a pair of proteomics runs. However, several methods have been developed to align MS1 features maps of a pair of proteomics runs (Cox and Mann, 2008;Rö st et al., 2016;Wang et al., 2019), but these methods do not report an overall similarity score.
Our method, MS1Connect, frames scoring the similarity between a pair of proteomics runs as a maximum bipartite matching problem. A bipartite graph consists of two disjoint sets of vertices and a set of edges that connect the two sets of vertices. In a maximum bipartite matching problem, the goal is to select a subset of edges in a bipartite graph that maximizes some objective value subject to some constraint. In our setting, each of the two disjoint sets of vertices represents the set of MS1 features found in a run, and edges link MS1 features, in different runs, whose masses match within some tolerance. In addition, we require that every MS1 feature be associated with at most one edge in the set of selected edges.
The MS1Connect objective function consists of a weighted combination of three modular terms and a fourth supermodular term. Modular and supermodular functions are both types of set functions. In a modular function, the sum is equal to the sum of its parts. More specifically, given two sets of disjoint items X and Y and a scoring function f, f ðXÞ þ f ðYÞ ¼ f ðX [ YÞ. On the other hand, for a supermodular function, the sum is greater than the sum of its parts. In this case, f ðXÞ þ f ðYÞ f ðX [ YÞ. While there exists a robust set of literature dedicated to the theory of supermodular maximization (Bai and Bilmes, 2018;Feige, 1998;Ji et al., 2020;Lin and Bilmes, 2011), these functions have rarely been applied to the analysis of biological data. The one example we are aware of uses a surrogate function to estimate the supermodular relationship between fragment ions in database searching (Bai et al., 2016).
Each of the MS1Connect terms measures a different aspect of proteomics run similarity. The first modular term (M 1 ) favors solutions with a large number of edges. The second modular term (M 2 ) favors selection of edges with high-intensity values. The third modular term (M 3 ) favors edges with small normalized retention time shifts. We define the normalized retention time shift as the difference in normalized retention times between the two MS1 features in an edge. Finally, the fourth supermodular term (M 4 ) favors solutions with pairs of edges that are similar to each other. Two edges are similar if they have similar normalized retention time shifts and if the normalized retention times of the MS1 features, in the same run, are similar to each other.
We show evidence that the MS1Connect score accurately measures the similarity between two proteomics runs. There are many ways to define whether two runs are similar to each other. For this work, we focus on the task of species prediction. We show that MS1Connect scores outperform baselines for predicting the species a sample originated from. In addition, we show that MS1Connect scores are able to recapitulate similarities based on MS2 spectra. Specifically, we show a high correlation between MS1Connect scores and the Jaccard index between the sets of confidently detected peptides for a pair of runs.

Representation of a mass spectrometry run
We represented each tandem mass spectrometry run as a bag of MS1 features, where each feature nominally corresponds to a peptide detected in a set of precursor scans. However, we note that not every MS1 feature will be a peptide and could instead be a chemical contaminant or other analyte. Each MS1 feature is represented as a tuple of four values: m=z, intensity, charge and retention time (in seconds). All four of these values are reported by pyOpenMS (Rö st et al., 2014), the tool we use for MS1 feature detection. To generate the input files for pyOpenMS, we used Proteowizard version 3.0 (Chambers et al., 2012) to convert Thermo RAW files to .mzML format (Martens et al., 2011). We use the N most intense MS1 features to represent each run, where N is a hyperparameter. In Supplementary Section S1.1, we discuss how we normalized the values returned by pyOpenMS.

Mass spectrometry run matching as a maximum bipartite matching problem
We frame the measurement of the similarity between a pair of mass spectrometry runs as a maximum bipartite matching problem. In this approach, we aim to select a set of edges in a given bipartite graph that achieves a maximum objective value.
A generic bipartite graph G ¼ ðU; V; EÞ consists of two disjoint sets of vertices, U and V, and a set of edges E U Â V, where each edge e connects a vertex u 2 U to a vertex v 2 V. For our specific formulation, U and V are the sets of MS1 features from two different mass spectrometry runs, r 1 and r 2 , and the edges E link MS1 features between U and V ( Supplementary Fig. S1). For this reason, we have one bipartite graph G r1;r2 ¼ ðU r1 ; V r2 ; E r1;r2 Þ associated with every run pair r 1 ; r 2 , where E r1;r2 U r1 Â V r2 . For notational simplicity, we drop the r 1 ; r 2 subscripts except when they are needed for run-pair disambiguation.
We include in the graph edges between all pairs of MS1 features whose charges match and whose m=z values match within some tolerance d 1 (in units of ppm): (1) where mðuÞ is the m=z of u, mðvÞ is the m=z of v, cðvÞ is the charge of MS1 feature v, and cðuÞ is the charge of u. By connecting MS1 features with the same charge and similar m=z, these edges attempt to connect the same peptide precursor. The goal in maximum bipartite matching is to select a set of edges A that achieves a maximum score, as measured by a specified objective function S, subject to some matching constraints. Specifically, because we expect that each peptide precursor will be detected at most once per run, we require that a valid matching connects each feature in U to at most one feature in V and vice versa. More formally, for two runs r 1 and r 2 , we want to choose a subset of edges A E that achieves a maximum value of the objective defined below subject to the following constraints: 8e 2 A; degreeðuðeÞÞ ¼ 1 anddegree ðvðeÞÞ ¼ 1; where uðeÞ retrieves the relevant MS1 feature from U and vðeÞ retrieves the corresponding MS1 feature in V. To designate these constraints, we define E U ¼ fA E : 8e 2 A; jA \ IncidentEdgesðuðeÞÞj ¼ 1g which is the set of all subsets of edges that satisfy by the required degree constraint of the corresponding nodes on the U side, and we correspondingly define E V ¼ fA E : 8e 2 A; jA \ IncidentEdgesðvðeÞÞj ¼ 1g for the V side. Note that A represents any possible subset of E, whereas E V represents any possible subset of E that fulfills the degree constraint on the V side. Also, IncidentEdgesðvðeÞÞ gives the set of edges that are incident to the vertex vðeÞ.

Scoring a candidate matching
Our handcrafted score function S r1;r2 uses a maximum matching approach which maximizes an objective that consists of a weighted combination of four terms. The score for two runs r 1 ; r 2 is defined as follows: where M 1 through M 4 are terms that are defined in Supplementary Section S1.3, k 1 through k 4 are convex mixture hyperparameters that weight the relative importance of each term, and E r1;r2 is the set of valid edges between runs r 1 and r 2 . Each term in the inner maximization is normalized by the score for the full set E with no matching constraints. The inner maximization maximizes over all subsets of edges that satisfy by both degree constraints, and that is indicated via A 2 ðE Ur 1 \ E Vr 2 Þ since if A is a member of both E Ur 1 and E Vr 2 , then no node incident to any edge has more than degree one in the matching. The leading summation serves to un-normalize the objective value via a multiplicative ðr 1 ; r 2 Þ-dependent constant. This twostep normalization and un-normalization process is used to make the k hyperparameters interpretable (i.e. each k i indicates the relative contribution of each term) and also to ensure that the scores are calibrated over multiple distinct pairs of runs. As a result of the normalization process, the four k values must sum to 1 and range between 0 and 1, inclusive. We describe the M i terms, each of which measures a different aspect of proteomics run similarity in Supplementary Section S1.3 (Fig. 1).

Selecting the best matching
In order to choose the best matching, we must choose a feasible subset of edges that maximizes our objective function, that is, we must compute max AE:A2ðEU \EV Þ

MðAÞ;
(3) where MiðEÞ . The first thing to note is that MðAÞ is the summation over entries of a jUjjVj Â jUjjVj matrix. That is, ; e 2 Þ and where m i ðe 1 ; e 2 Þ is the element of the corresponding sub-objective M i defined above (i.e. M i ðAÞ ¼ P e1;e2 m i ðe 1 ; e 2 Þ). Note that for i 2 f1; 2; 3g, we have that m i ðe 1 ; e 2 Þ ¼ 0 whenever e 1 6 ¼ e 2 . We also note that m 4 ðe 1 ; e 2 Þ ¼ 1 whenever e 1 ¼ e 2 . Hence, we can define mðe 1 ; e 2 Þ as follows: Ordinarily, computing such a maximization over an exponential number of subsets would be intractable. It turns out, however, that this maximization problem exhibits useful structure that allows an efficient algorithm to be used to obtain an approximate solution. Specifically, we use the greedy algorithm to solve an instance of supermodular maximization subject to matroid constraints. In Supplementary Section S1.2, we discuss our approach to select the best matching as well as the approximation bound of the solution provided by the greedy algorithm (Bai and Bilmes, 2018;Edmonds, 1970;Feige, 1998;Lin and Bilmes, 2011;Oxley, 2011).
Lastly, we note again that the above is described for a generic bipartite graph G ¼ ðU; V; EÞ. In practice, we have one distinct bipartite graph for each pair r 1 ; r 2 of runs; hence, the greedy algorithm is run for each run pair.

Materials and methods
Detailed descriptions of the representation of a mass spectrometry run, selection of the best bipartite matching (Bai and Bilmes, 2018;Edmonds, 1970;Feige, 1998;Lin and Bilmes, 2011;Oxley, 2011), baseline similarity scores, database searching (Lin et al., 2018;McIlwain et al., 2014;Park et al., 2008;The UniProt Consortium, 2019), hyperparameter searching and sample generation (Cole et al., 2014;Kelly et al., 2006) can be found in the Supplementary material. In addition, the Supplementary material contains a table of notation used in this manuscript (Supplementary Table S6). Finally, information regarding datasets used in this study can be found in the Supplementary material as well as Supplementary Files S1-S4.

MS1Connect versions
Since the objective function of MS1Connect contains four terms, we compared five different versions of MS1Connect against each other. Four of the versions used a single term of the objective function while the fifth version used all four terms of the objective function. We notate the version of MS1Connect that only uses the M 1 term as 'MS1Connect (M 1 only)', and we notate the version of MS1Connect that uses all four terms as 'MS1Connect (M 1 -M 4 )'.

Evaluation metrics
We use two different measures to quantify the performance of a given MS1 similarity score at predicting the metadata label of a query run given a repository of runs with known metadata labels.
The first performance measure is the per-query average precision (QAP), defined as the average precision across all queries q 2 R: where P k ðS; q; RÞ and R k ðS; q; RÞ are the precision and recall, respectively, after k repository runs have been retrieved from R using query q and similarity S. Note that the query run q is not included in the ranking when computing the precision and recall. The second measure, aggregate average precision (AAP), is similar to QAP except that the average precision is calculated once on an aggregated list of similarity scores. This aggregated list is produced by sorting together all pairs of runs, considering only the upper triangle of the run-by-run matrix. In this work, we focused on the QAP because these two performance metrics are highly correlated ( Supplementary Fig. S2).

MS1Connect scores can be used for species prediction
To determine whether the MS1Connect score successfully measures the similarity of a pair of runs, we investigated whether these scores can be used to predict the species label of a proteomics run. Given a set of runs, we designated a single run as the query run and the remaining runs as the repository. The repository runs were then ordered by their similarity to the query run. Then, the average precision was calculated based on whether the repository runs had the same species label as the query run. We repeated this process for each of the runs in the dataset to calculate our two metrics, QAP and AAP. MS1Connect has nine different hyperparameters (Table 1), and to determine which set of hyperparameters achieved the best performance on the training set, as measured by QAP, we sampled the hyperparameter space using a random grid search. We found that MS1Connect scores can be successfully used for predicting what species a sample was generated from (Table 1). Comparing the various versions of MS1Connect against each other, we discovered that the two supermodular versions of MS1Connect, MS1Connect (M 1 -M 4 ) and MS1Connect (M 4 only), had the best performance. Specifically, these two versions had the best or second best performance, as measured by QAP or AAP, in both the training and test data. These two versions are supermodular because they incorporate the supermodular M 4 term. The higher performance of these two supermodular versions of MS1Connect, compared to the three modular versions of MS1Connect, suggests that using supermodularity may lead to higher performance. In Supplementary Note S3.1, we discuss a possible reason why supermodular methods outperform modular methods.
Next, we compared the performance of MS1Connect against our baseline similarity methods (Table 1). We found that the supermodular MS1Connect methods outperformed all of the baselines. In addition, we found the performance of MS1Connect (M 1 only) and MS1Connect (M 3 only) to be similar to the performance of the baselines while MS1Connect (M 2 only) under-performed the baselines. Note that Table 1 does not include the performance of the two baselines where G is the maximum of the two input values as these methods exhibited poor performance ( 0.13 for both QAP and AAP in the training set).
After we compared MS1Connect against the baselines, we compared the various baselines against each other. We found that performance generally decreased when retention time was considered (Table 1). This result may seem unexpected because retention times, in principle, should improve the specificity of a matching. However, binning retention times could lead to a significant number of edge effects because of systemic retention time shifts that reflect differences in chromatography conditions. As an example, the number of MS1 features in common between two Candida albicans runs, 1512006-2-TRIS1-10-Calbicans.raw and Control-60min_R3.raw, decreased by 30.8% when retention time is considered (n ¼ 2) compared to when retention time is not considered (n ¼ 1).
Following the quantitative comparison, we examined the MS1Connect scores in the species training data to understand whether these scores made qualitative sense. We expected that runs generated by the same experiment should be highly similar to each other. Beyond that, we also expected runs from the same species to also be similar with each other.
Using the hyperparameter set that yielded the best performance, as measured by QAP on the training data, we visualized the MS1Connect scores between all pairwise runs in the species training dataset as a heatmap (Fig. 2). In this heatmap, the rows and columns are ordered by expected similarity. In general, the resulting heatmap matched our expectation. For example, the strong diagonal component in the heatmap resulted from runs that were generated by the same experiment. In addition, the block-diagonal structure (delimited in the figure by the solid white lines) showed that runs generated from the same species tend to have high scores. For example, all the runs generated by Staphylococcus aureus and C. albicans generally only have high scores with each other. On the other hand, we also found a few examples of runs whose similarity profile did not match our expectation. One prominent example is a set of five Escherichia coli runs from the same experiment that only had high MS1Connect scores when comparing the same run to itself. A second example is a set of five Arabidopsis thaliana runs from the same experiment where a run often did not have a high MS1Connect score to itself. This result occurred because these 10 runs have few MS1 features. In general, larger MS1Connect scores can be achieved when a run has a larger number of MS1 features. In this specific situation, these 10 runs have the fewest number of MS1 features in the overall dataset.  In addition, we found that MS1Connect scores may be sensitive enough to detect inter-species relationships. Considering the human and mouse runs, the MS1Connect scores indicated that these two species have some degree of similarity. This finding fits with our phylogenetic understanding of these two species, as human and mice are both mammals. These two species are highly similar to each other in the context of our dataset, which also included bacteria, fungi and plants. Another example is that Bacillus anthracis and Bacillus cereus runs are similar to each other, which is expected since these two species are in the same genus. In the species test data, we saw comparable results with the two gram-negative bacterial species, E. coli and Salmonella enterica, being similar to each other ( Supplementary Fig. S5).
A close inspection of the heatmap showed an unexpectedly high similarity between a group of S.aureus runs and all the Homo sapiens runs (Fig. 2). The S. aureus runs originated from a study that used both cultures and clinical samples of S. aureus. We hypothesized that these runs originated from clinical samples and therefore contained a large number of human peptides. This type of contamination would cause these bacterial runs to be similar to human runs. We were unable to confirm this information from the PRIDE metadata, because the submission did not track which samples were cultured-based and which samples were clinically based. However, a database search of the S.aureus runs against a concatenated S. aureus and H.sapiens proteome detected a large number of human peptides, supporting our hypothesis. Figure 2 also suggested that that the second set of human runs was unexpectedly dissimilar to all other human and mouse runs. We do not have access to metadata that might explain this phenomenon, but we speculate that the dissimilarity may arise because the second set of human runs originated from single cell data, whereas the remaining runs were derived from bulk samples. Note that these human runs are most similar to the previously discussed human contaminated S. aureus samples. The high similarity between these runs is expected since the MS1Connect score is symmetric.

MS1Connect scores replicate database search results
Our analyses so far suggest that MS1Connect scores can successfully measure the similarity between pairs of proteomics runs. Next, we compared whether our MS1-based similarity method replicates similarities generated from MS2 data. We expect that these two methods should generally agree with each other.
To test this hypothesis, we compared the MS1Connect scores against the similarities generated by a MS2-based method on a set of samples with known species composition. For the MS2-based method, we calculated the Jaccard index between the sets of confidently detected peptides at a 1% FDR threshold for a pair of runs. The dataset we used contained a set of 16 samples, labeled 'A' through 'P', where each sample contained one, two or three different organisms (Supplementary Table S7). Each sample, except for the sample labeled 'P' was run on a mass spectrometer four times. Three of the runs were analyzed by a single instrument in one laboratory while the fourth run was analyzed on a different instrument in a second laboratory. The two laboratories are independent from each other, and therefore the samples were run with differing instrument parameters and chromatography conditions.
Our results showed that MS1Connect scores indeed have a high correspondence to similarities measured from MS2 data. A heatmap where the upper triangle is MS1Connect scores and the lower triangle is the Jaccard index showed that these two methods are highly consistent with each other (Fig. 3). In addition, we found that these two scores are highly correlated, with a Spearman rank correlation of 0.91. Overall, this result showed that MS1Connect scores can replicate MS2-based analyses. In addition, our result suggested that MS1Connect scores can measure the similarity between runs that contain multiple species in the face of differing chromatography conditions and instrumentation.
A distinct checkerboard pattern in the lower triangle of the heatmap occurred because three out of the four runs from each sample are run on a single instrument while the fourth is run on a second instrument (Fig. 3). This result matched our expectation since the three runs from the same instrument should be more similar to each other than the run from the second instrument. This pattern, while not as prominent, also exists on the MS1Connect side of the heatmap and further showed that MS1Connect scores can replicate database search results.

MS1Connect scores can identify mislabeled runs
Having shown the ability of MS1Connect to accurately measure the similarity between pairs of proteomics runs, we hypothesized that our method could be used to identify runs with mislabeled metadata. To this end, we analyzed data collected from seven different Bacillus species (Pfrunder et al., 2016). Each species was analyzed by mass spectrometry six times. We calculated the MS1Connect scores between all of the runs. In addition, we calculated the Jaccard index between the sets of confidently detected peptides at 1% FDR threshold for all pairs of runs.
We found that MS1Connect scores identify runs whose metadata may have been mislabeled. Specifically, when visualizing a heatmap of the MS1Connect scores between all pairs of runs (upper triangle of Fig. 4), we observed that one of the runs labeled as Bacillus tonyonesis exhibits low similarity with other B.tonyonesis runs but has high similarity with other Bacillus cytotoxicus runs (last row/ column of the heatmap). A database search also confirmed this result (lower triangle of Fig. 4). Together, these results strongly suggest that a single run labeled as B.tonyonesis should instead be labeled B.cytotoxicus. This mislabeling could have occurred in many ways. For example, during the sample preparation process a vial could have been mislabeled or there could have been an error during some pipetting step. Alternatively, when placing samples into an autosampler, a vial could have been placed incorrectly onto the tray or the vial position could have been incorrectly recorded into the instrument.

Analysis of hyperparameters
After the assessment of our method, we investigated the performance of MS1Connect as a function of the hyperparameters to determine if any of the hyperparameters were correlated with performance. Specifically, we studied the performance of the 2500 different hyperparameterizations of MS1Connect, as described in Supplementary Section S2.2, with respect to each hyperparameter.  Supplementary Table S7. Within a box delineated by the white lines, the first three rows are the runs from one laboratory while the fourth row is the run from a second laboratory This analysis revealed trends in the performance of various hyperparameterizations of MS1Connect as a function of N and d 1 (Fig. 5). Specifically, we found that the best performance, as measured by QAP on the training dataset, occurred when the number of MS1 features used to generate the bipartite graph N was set to 4000 and when the m=z tolerance d 1 was set to 4 ppm. We note that 4 ppm is within the range of precursor mass tolerances researchers typically use for database searching. On the other hand, we found no trends in the performance of MS1Connect as a function of the remaining hyperparameters ( Supplementary Figures S6 and S7).
Given that best performance occurred when N ¼ 4000 and m=z tolerance ¼ 4 ppm, we fixed these two hyperparameters and then asked whether there were any trends in the performance as a function of the remaining hyperparameters. We found that k 4 was correlated with performance ( Supplementary Fig. S8A). The performance increased as k 4 increased from 0.0 to 0.9, then performance decreased as k 4 approached 1.0. This trend suggested that larger k 4 may lead to higher performance and hence that supermodularity is important for achieving high performance. This is also shown by the fact that the best performing hyperparameterizations of MS1Connect occurred when k 4 ¼ 0:9. In addition to k 4 , k 2 was also found to be correlated with performance ( Supplementary Fig. S8B). In general, the performance MS1Connect decreased as k 2 increased from 0.0 to 1.0. This result indicated that M 2 , which measures intensity, is not useful for achieving high performance. However, we note that another hyperparameter, N, limits MS1 features in the bipartite graph to high-intensity features. None of the remaining hyperparameters were correlated with performance (Supplementary Figs S9 and S10).

Discussion
In this work, we introduced a new method, MS1Connect, that measures the similarity between pairs of proteomics runs. We showed evidence that MS1Connect successfully measures proteomics run similarity. Specifically, we showed that MS1Connect scores can be used for classifying what species a sample was generated from, and we showed evidence that MS1Connect scores can replicate database search results.
We also considered including in our empirical comparison a method based on dynamic time warping, but this approach turned out to be computationally infeasible. For this approach, we used FastDTW (Salvador and Chan, 2004), with a cosine distance as the distance metric, to score the warping between two runs. We estimated that running dynamic time warping between all pairs of runs in our training dataset for the largest m=z bin width (0.01 Da) in our hyperparameter range would take $3 weeks and 300 Gb of memory. The large memory requirement is due to the large matrices being compared to each other. For example, the run with the largest matrix had $30 000 time bins and 700 000 m=z bins (m=z bin width of 0.0035 Da). The long time requirement resulted from having to calculate the warping between all pairs of runs (>13 000 warpings).
Because MS1Connect only uses MS1 data, it is agnostic to acquisition style. Therefore, MS1Connect can calculate the similarity between runs that have been collected in different ways. Although we did not specifically test this idea, we note that our training set contained data collected by various means. For example, three of the human runs did not contain any MS2 data. As a result, standard MS2-based methods cannot be applied to these runs. In addition, we note that one set of five S.aureus runs in the training data was collected by DIA, whereas the remaining runs were collected by DDA. Our results suggested that MS1Connect scores can measure the similarity between DIA and DDA runs. In turn, this ability to measure similarity is the first step for allowing joint analysis of DDA and DIA data.
While we have shown that MS1Connect scores can be used to predict metadata labels of runs, future works needs to be conducted Fig. 5. Performance as a function of by no. of MS1 features or m=z tolerance. These figures plot the performance of 2500 different hyperparameterizations of MS1Connect as a function of (A) no. of MS1 features or (B) m=z tolerance. The best performance occurs when m=z tolerance is set to 4 ppm and no. of MS1 features is set to 4000. Note that the values near the top are the number of points found in each column Fig. 4. Possible identification of a mislabeled run. The figure is a heatmap of pairwise similarity scores, where the upper triangle is MS1Connect scores and the lower triangle is the Jaccard index between the two sets of confidently detected peptides at 1% FDR threshold for the Bacillus genus dataset. The x-and y-axis show the species labels of each run, as described by the PRIDE repository, with six runs per species. The solid white lines denote the border between species. Both the MS1Connect scores and the Jaccard index indicate that a run labeled as B.tonyonesis (last row/ column of the matrix) was mislabeled and should be labeled B.cytotoxicus to determine the limits of our method. For example, none of the data used in this study was isobarically labeled. MS1Connect could be extended to include these types of samples. In addition, MS1Connect has not been tested on a wide variety of samples, such as fractionated samples. Additional work could be pursued to test the applicability of MS1Connect in these scenarios.
Finally, we note that while we developed MS1Connect for use in proteomics our method can be used to analyze data from other mass spectrometry-based fields such as metabolomics and lipidomics. Our method may be especially useful in these fields because MS2-based identifications are frequently missing. Therefore, MS2-based methods may not be as valuable. In the future, MS1Connect could be applied to data generated by metabolomics and lipidomics.