DART-ID increases single-cell proteome coverage

Analysis by liquid chromatography and tandem mass spectrometry (LC-MS/MS) can identify and quantify thousands of proteins in microgram-level samples, such as those comprised of thousands of cells. This process, however, remains challenging for smaller samples, such as the proteomes of single mammalian cells, because reduced protein levels reduce the number of confidently sequenced peptides. To alleviate this reduction, we developed Data-driven Alignment of Retention Times for IDentification (DART-ID). DART-ID implements principled Bayesian frameworks for global retention time (RT) alignment and for incorporating RT estimates towards improved confidence estimates of peptide-spectrum-matches. When applied to bulk or to single-cell samples, DART-ID increased the number of data points by 30–50% at 1% FDR, and thus decreased missing data. Benchmarks indicate excellent quantification of peptides upgraded by DART-ID and support their utility for quantitative analysis, such as identifying cell types and cell-type specific proteins. The additional datapoints provided by DART-ID boost the statistical power and double the number of proteins identified as differentially abundant in monocytes and T-cells. DART-ID can be applied to diverse experimental designs and is freely available at http://dart-id.slavovlab.net.

Introduction Advancements in the sensitivity and discriminatory power of protein mass-spectrometry (MS) have enabled the quantitative analysis of increasingly limited amounts of samples. Recently, we have developed Single Cell Proteomics by Mass Spectrometry (SCoPE-MS). SCoPE-MS uses a barcoded carrier to boost the MS signal from single-cells and enhance sequence identification [1,2]. While this design allows quantifying hundreds of proteins in single mammalian cells, sequence identification remains challenging because many lowly abundant peptides generate only a few fragment ions that are insufficient for confident identification [3,4]. Such low confidence peptides are generally not used for protein quantification, and thus reduce the data points available for further analyses. We sought to overcome this challenge by using both the retention time (RT) of an ion and its MS/MS spectra to achieve more confident peptide identifications. To this end, we developed a novel data-driven Bayesian framework for aligning RTs and for updating peptide confidence. DART-ID minimizes assumptions, aligns RTs with median residual error below 3 seconds, and increases the fraction of cells in which peptides are confidently identified.
Multiple existing approaches-including Skyline ion matching [5], moFF match-betweenruns [6], MaxQuant match-between-runs [7,8], DeMix-Q [9] and Open-MS FFId [10]-allow combining MS1 spectra with other informative features, such as RT and precursor ion intensity, to enhance peptide identification. These methods, in principle, may identify any ion detected in a survey scan (MS1 level) even if it was not sent for fragmentation and second MS scan (MS2) in every run. Thus by not using MS2 spectra, these methods may overcome the limiting bottleneck of tandem MS: the need to isolate, fragment and analyze the fragments in order to identify and quantify the peptide sequence.
However not using the MS2 spectra for identification has a downside: The MS2 spectra contain highly informative features even for ions that could not be confidently identified based on spectra alone. This is particularly important when MS/MSed ions are the only ones that can be quantified, as in the case of isobaric mass tags. Thus, the MS1-based methods have a strong advantage when quantification relies only on MS1 ions (e.g., LFQ [11], and SILAC [12]), while methods using all MS2 spectra can more fully utilize all quantifiable data from isobaric tandem-mass-tag experiments.
DART-ID aims to use all MS2 spectra, including those of very low confidence PSMs, and combines them with accurate RT estimates to update peptide-spectrum-match (PSM) confidence within a principled Bayesian framework. Unlike previous MS2-based methods which incorporate RT estimates into features for FDR recalculation [13], discriminants [14], filters [15][16][17], or scores [18,19], we update the ID confidence directly with a Bayesian framework [20,21]. Crucial to this method is the accuracy of the alignment method; the higher the accuracy of RT estimates, the more informative they are for identifying the peptide sequence.
The RT of a peptide is a specific and informative feature of its sequence, and this specificity has motivated approaches aiming to estimate peptide RTs. These approaches either (i) predict RTs from peptide sequences or (ii) align empirically measured RTs. Estimated peptide RTs have a wide range of uses, such as scheduling targeted MS/MS experiments [22], building efficient inclusion and exclusion lists for LC-MS/MS [23,24], or augmenting MS2 mass spectra to increase identification rates [14][15][16][17][18][19].
Peptide RTs can be estimated from physical properties such as sequence length, constituent amino acids, and amino acid positions, as well as chromatography conditions, such as column length, pore size, and gradient shape. These features predict the relative hydrophobicity of peptide sequences and thus RTs for LC used with MS [25][26][27][28][29][30][31]. The predicted RTs can be improved by implementing machine learning algorithms that incorporate confident, observed peptides as training data [15,19,[32][33][34][35].
Predicted peptide RTs are mostly used for scheduling targeted MS/MS analyses where acquisition time is limited, e.g., multiple reaction monitoring [22]. They can also be used to aid peptide sequencing, as exemplified by "peptide fingerprinting"-a method that identifies peptides based on an ion's RT and mass over charge (m/z) [28,[36][37][38]. While peptide fingerprinting has been successful for low complexity samples, where MS1 m/z and RT space is less dense, it requires carefully controlled conditions and rigorous validation with MS2 spectra [37][38][39][40][41]. Predicted peptide RTs have more limited use with data-dependent acquisition, i.e., shotgun proteomics. They have been used to generate data-dependent exclusion lists that spread MS2 scans over a more diverse subset of the proteome [23,24], as well as to aid peptide identification from MS2 spectra, either by incorporating the RT error (difference between predicted and observed RTs) into a discriminant score [14], or filtering out observations by RT error to minimize the number of false positives selected [15][16][17]. In addition, RT error has been directly combined with search engine scores [18,19]. Besides automated methods of boosting identification confidence, proteomics software suites such as Skyline allow the manual comparison of measured and predicted RTs to validate peptide identifications [5].
The second group of approaches for estimating peptide RTs aligns empirically measured RTs across multiple experiments. Peptide RTs shift due to variation in sample complexity, matrix effects, column age, room temperature and humidity. Thus, estimating peptide RTs from empirical measurements requires alignment that compensates for RT variation across experiments. Usually, RT alignment methods align the RTs of two experiments at a time, and typically utilize either a shared, confidently-identified set of endogenous peptides, or a set of spiked-in calibration peptides [42,43]. Pairwise alignment approaches must choose a particular set of RTs that all other experiments are aligned to, and the choice of that reference RT set is not obvious. Alignment methods are limited by the availability of RTs measured in relevant experimental conditions, but can result in more accurate RT estimates when such empirical measurements are available [7,8,43]. Generally, RT alignment methods provide more accurate estimates than RT prediction methods, discussed earlier, but also generally require more extensive data and cannot estimate RTs of peptides without empirical observations.
Methods for RT alignment are various, and range from linear shifts to non-linear distortions and time warping [44]. Some have argued for the necessity of non-linear warping functions to correct for RT deviations [45], while others have posited that most of the variation can be explained by simple linear shifts [46]. More complex methods include multiple generalized additive models [47], or machine-learning based semi-supervised alignments [48]. Once experiments are aligned, peptide RTs can be predicted by applying experiment-specific alignment functions to the RT of a peptide observed in a reference run.
Peptide RTs estimated by alignment can be used to schedule targeted MS/MS experiments-similar to the use of predicted RTs estimated from the physical properties of a peptide [43]. RT alignments are also crucial for MS1 ion/feature-matching algorithms, as discussed earlier [5][6][7][8][9][10], as well as in targeted analyses of results from data-independent acquisition (DIA) experiments [49][50][51]. The addition of a more complex, non-linear RT alignment model that incorporates thousands of endogenous peptides instead of a handful of spiked-in peptides increased the number of identifications in DIA experiments by up to 30% [52].
With DART-ID, we implement a novel global RT alignment method that takes full advantage of SCoPE-MS data, which feature many experiments with analogous samples run on the same nano-LC (nLC) system [1,2]. These experimental conditions yield many RT estimates per peptide with relatively small variability across experiments. In this context, we used empirical distribution densities that obviated assumptions about the functional dependence between peptide properties, RT, and RT variability and thus maximized the statistical power of highly reproducible RTs. This approach increases the number of experiments in which a peptide is identified with high enough confidence and its quantitative information can be used for analysis. The DART-ID program is freely available and can easily be run over the output of peptide search engines such as MaxQuant [7,8].

Model for global RT alignment and PSM confidence update
Using RT for identifying peptide sequences starts with estimating the RT for each peptide, and we aimed to maximize the accuracy of RT estimation by optimizing RT alignment. Many existing methods can only align the RTs of two experiments at a time, i.e., pairwise alignment, based on partial least squares minimization, which does not account for the measurement errors in RTs [53]. Furthermore, the selection of a reference experiment is non-trivial, and different choices can give quantitatively different alignment results. In order to address these challenges, we developed a global alignment method, sketched in Fig 1a and 1b. The global alignment infers a reference RT for the ith peptide, μ i as a latent variable with value μ ik in the kth experiment. This can be related to the measured RT for peptide i in experiment k, ρ ik .
where μ ik ≜ g k (μ i ) and � ik is an independent mean-zero error term expressing residual (unmodeled) RT variation. As a first approximation, we assume that the observed RTs for any experiment can be well approximated using a two-segment linear regression model: where s k is the split point for the two segment regression in each experiment, and the parameters are constrained to not produce a negative RT and can be generalized to more complex monotonically-constrained models, such as spline fitting or locally estimated scatterplot smoothing (LOESS Using the vectorized likelihood function from Eq 3 and the priors described in Methods, we solve Eq 4 to infer the joint posterior distribution of all reference RTs (and associated The observed RTs are modeled as a function of the reference RT, which allows incorporating experiment specific weights and the uncertainty in measured RTs and peptide identification as shown in Eq 3. Then the global alignment model simultaneously infers the reference RT and aligns all experiments by solving Eq 4. (c) A conceptual diagram for updating the confidence in a peptide-spectrum-match (PSM). The probability to observe each PSM is estimated from the conditional likelihoods for observing the RT if the PSM is assigned correctly (blue density) or incorrectly (red density). For PSM 1, P(δ = 1 | RT) < P(δ = 0 | RT), and thus the confidence decreases. Conversely, for PSM 2, P(δ = 1 | RT) > P(δ = 0 | RT), and thus the confidence increases. Pða; b; β 0 ; β 1 ; s; μ j ρ; λÞ |ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl fflffl {zffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl fflffl } Posterior / Pðρ j a; b; β 0 ; β 1 ; s; μ; λÞ |ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl fflffl {zffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl fflffl } Likelihood Eq 3 Pða; b; β 0 ; β 1 ; s; μÞ |ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl {zffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl } Prior ð4Þ The inference described above infers all reference RTs, μ, from one global solution of Eq 4. It allows the alignment to take advantage of any peptide observed in at least two experiments, regardless of the number of missing observations in other experiments. Furthermore, the mixture model described in Eq 3 allows for the incorporation of low confidence peptides by using appropriate weights and accounting for the presence of false positives. Thus this method maximizes the data used for alignment and obviates the need for spiked-in standards. Furthermore, the reference RT provides a principled choice for a reference (rather than choosing a particular experiment) that is free of measurement noise. The alignment process accounts for the error in individual observations by inferring a per peptide RT distribution, as opposed to aligning to a point estimate, as well as for variable RT deviations in experiments by using experiment-specific weights. The conceptual idea based on which we incorporate RT information for sequence identification is illustrated in Fig 1c and formalized with Bayes' theorem in Fig 1d. We start with a peptide-spectrum-match (PSM) from a search engine and its associated probability to be incorrect (PEP; posterior error probability) and correct, 1-PEP. If the RT of a PSM is far from the RT of its corresponding peptide, as PSM1 in Fig 1c, then the spectrum is more likely to be observed if the PSM is incorrect, and thus we can decrease its confidence. Conversely, if the RT of a PSM is very close to the RT of its corresponding peptide, as PSM2 in Fig 1c, then the spectrum is more likely to be observed if the PSM is correct, and thus we can increase its confidence. To estimate whether the RT of a PSM is more likely to be observed if the PSM is correct or incorrect, we use the conditional likelihood probability densities inferred from the alignment procedure in Eq 3 (Fig 1b). Combining these likelihood functions with Bayes' theorem in Fig 1d allows us to formalize this logic and update the confidence of analyzed PSMs, which we quantify with DART-ID PEPs.

Global alignment process reduces RT deviations
To evaluate the global RT alignment by DART-ID, we used a staggered set of 46 60-minute LC-MS/MS runs performed over a span of 3 months. Each run was a diluted 1 × M injection of a bulk 100 × M SCoPE-MS sample, as described in Table 1 and by Specht et al. [2]. The experiments were run over a span of three months so that the measured RTs captured expected variance in the chromatography. The measured RTs were compared to RTs predicted from peptide sequences [30,31,34], and to top-performing alignment methods [7,8,43,52], including the reference RTs from DART-ID; see Fig 2a. All methods estimated RTs that explained the majority of the variance of the measured RTs, Fig 2a. As expected, the alignment methods provided closer estimates, explaining over 99% of the variance.
To evaluate the accuracy of RT estimates more rigorously, we compared the distributions of differences between the reference RTs and measured RTs, shown in Fig 2b. This comparison again underscores that the differences are significantly smaller for alignment methods, and smallest for DART-ID. We further quantified these differences by computing the mean and median absolute RT deviations, i.e., |ΔRT|, which is defined as the absolute value of the difference between the observed RT and the reference RT. For the prediction methods-SSRCalc [30], BioLCCC [31], and ELUDE [34]-the average deviations exceed 2 min, and ELUDE has the smallest average deviation of 2.5 min. The alignment methods result in smaller average deviations, all below < 1 min, and DART-ID shows the smallest average deviation of 0.044 min (2.6 seconds). Detailed alignment statistics can be visualized in both the graphical output of the DART-ID program and in the DO-MS visualization platform [54].

DART-ID increases proteome coverage of SCoPE-MS experiments
Search engines such as MaxQuant [7,8] use the similarity between theoretically predicted and experimentally measured MS2 spectra of ions to match them to peptide sequences, i.e., peptide-spectrum-matches (PSM). The confidence of a PSM is commonly quantified by the probability of an incorrect match: the posterior error probability (PEP) [21,55,56]. Since the estimation of PEP does not include RT information, we sought to update the PEP for each PSM by incorporating RT information within the Bayesian framework displayed in Fig 1c and 1d. This approach allowed us to use the estimated RT distributions for each peptide with minimal assumptions.
The Bayesian framework outlined in Fig 1c and 1d can be used with RTs estimated by other methods, and its ability to upgrade PSMs is directly proportional to the accuracy of the estimated RTs. To explore this possibility, we used our Bayesian model with RTs estimated by all methods shown in Fig 2. The updated error probabilities of PSMs indicate that all RT estimates enhance PSM discrimination, S5 Fig. Even lower accuracy RTs predicted from peptide sequence can be productively used to upgrade PSMs. However, the degree to which PSMs are upgraded, i.e. the magnitude of the confidence shift, increases with the accuracy of the RT estimates and is highest with the DART-ID reference RTs.
We refer to the PEP assigned by the search engine (MaxQuant throughout this paper) as "Spectral PEP", and after it is updated by the Bayesian model from  Data-driven Bayesian framework for increasing proteome coverage spectra alone, but high confidence when RT evidence is added to the spectral evidence. To visualize how these peptides are distributed across experiments, we marked them with red dashes in Fig 3b. The results indicate that the data sparsity decreases; thus DART-ID helps mitigate the missing data problem of shotgun proteomics. Fig 3b is separated into two subsets, DART-ID 1 and DART-ID 2 , which correspond respectively to peptides that have at least one confident spectral PSM, and peptides whose spectral PSMs are all below the set confidence threshold of 1% FDR. While the PSMs of DART-ID 2 very likely represent the same peptide sequence-since by definition they share the same RT, MS1 m/z and MS2 fragments consistent with its sequence-we cannot be confident in the exact sequence assignment. Thus, they are labeled separately and their sequence assignment is further validated in the next section. The majority of PSMs whose confidence is increased by DART-ID have multiple confident Spectral PSMs, and thus reliable sequence assignment. Analysis of newly identified peptides in Fig 3c shows that DART-ID helps identify about 50% more PSMs compared to spectra alone at an FDR threshold of 1%. This corresponds to an increase of *30-50% in the fraction of PSMs passing an FDR threshold of 1%, as shown in the bottom panel of Fig 3c. Furthermore, the number of distinct peptides identified per experiment increases from an average of *1000  [30], BioLCCC [31], and ELUDE [34]. The right column displays comparisons for alignment methods-precision iRT [52], MaxQuant match-between-runs [7,8] Data-driven Bayesian framework for increasing proteome coverage to an average of *1600 , Fig 3d. Percolator, a widely used FDR recalculation method that also incorporates peptide RTs [13], also increases identification rates, albeit to a lesser degree than DART-ID, Fig 3c and 3d. The visualizations in Fig 3a, 3c and 3d can be generated for user inputted data by the DO-MS visualization platform [54].
We observe that DART-ID PEPs are bimodally distributed (S6 Fig), suggesting that DAR-T-ID acts as an efficient binary classifier. Modifying error probabilities, however, does risk changing the overall false discovery rate (FDR) of the PSM set. To evaluate the effect of DAR-T-ID on the overall FDR, we allowed the inclusion of decoy hits in both the alignment and confidence update process [55]. The results from this analysis in Fig 3e indicate that, as Data-driven Bayesian framework for increasing proteome coverage expected, the fraction of PSMs matched to decoys is proportional to the FDR estimated both from the Spectral PEP and from the updated DART-ID PEP. We encourage users of DART-ID to evaluate the results from applying DART-ID and other related methods on their datasets using this benchmark as well as the numerous quantitative benchmarks described in the subsequent sections.  Data-driven Bayesian framework for increasing proteome coverage increase in confident PSMs, when using DART-ID PEPs instead of spectral PEPs Fig 4b, also fell into the expected range of 30-50% at 1% FDR. The increase in confident PSMs is shown in discrete terms in Fig 4c, where experiments in both the label-free and TMT-labelled sets receive thousands of more confident PSMs that can then be used for further quantitative analysis.

DART-ID decrease missing datapoints
These increases of confident PSMs, in both the SCoPE-MS and bulk LC-MS/MS sets, decreases the amount of missing data per run. In Fig 5a we show qualitatively that DART-ID can fill in many of these missing values on the protein level. On the level of experimental runs, as shown quantitatively in Fig 5b, DART-ID significantly reduces the amount of missing data and mitigates the stochasticity that is inherently to data-dependent MS methods.

Validation of PSMs upgraded by DART-ID
We next sought to evaluate whether the confident DART-ID PSMs without confident Spectral PSMs, i.e. DART-ID 2 from Fig 3b, are matched to the correct peptide sequences. To this end, we sought to evaluate whether the RTs of such PSMs match the RTs for the corresponding peptides identified from high-quality, confident spectra. For this analysis, we split a set of experiments into two subsets, A and B, Fig 6a. The application of DART-ID to A resulted in two disjoint subsets of PSMs: A 1 , corresponding to PSMs with confident spectra (Spectral Data-driven Bayesian framework for increasing proteome coverage PEP < 0.01), and A 2 , corresponding to "upgraded" PSMs (Spectral PEP > 0.01 and DART-ID PEP < 0.01). We overlapped these subsets with PSMs from B having Spectral PEP < 0.01, so that the RTs of PSMs from B can be compared to the RTs of PSMs from subsets A 1 and A 2 , Fig  6a. This comparison shows excellent agreement of the RTs for both subsets A 1 and A 2 with the RTs for high quality spectral PSMs from B, Fig 6b and 6c. This result suggests that even peptides upgraded without confident spectral PSMs are matched to the correct peptide sequences.

Validation by internal consistency
We ran DART-ID on SCoPE-MS method development experiments [2], all of which contain quantification data in the form of 11-plex tandem-mass-tag (TMT) reporter ion (RI) intensities. Out of the 10 TMT "channels", six represent the relative levels of a peptide in simulated single cells, i.e., small bulk cell lysate diluted to a single cell-level level. These six single cell channels are made of T-cells (Jurkat cell line) and monocytes (U-937 cell line). We then used the normalized TMT RI intensities to validate upgraded PSMs by analyzing the consistency of protein quantification from distinct peptides.
Internal consistency is defined by the expectation that the relative intensities of PSMs reflect the relative levels of their corresponding proteins. If upgraded PSMs are consistent with Spectral PSMs for the same protein, then their relative RI intensities will have lower coefficients of variation (CV) within a protein than across different proteins [59]. CV is defined as σ/μ, where σ is the standard deviation and μ is the mean of the normalized RI intensities of PSMs This demonstrates that the protein-specific variance in the relative quantification, due to either technical or biological noise, is preserved in these upgraded PSMs.

Proteins identified by DART-ID separate cell types
The upgraded PSMs from the DART-ID set are not just representative of proteins already quantified from confident spectral PSMs, but when filtering at a given confidence threshold (e.g., 1% FDR), they allow for the inclusion of new proteins for analysis. As the quantification Data-driven Bayesian framework for increasing proteome coverage of these new proteins from the DART-ID PSMs cannot be directly compared to that of the proteins from the Spectra PSMs, we instead compare how the new proteins from DART-ID can explain the biological differences between two cell types-T-cells (Jurkat cell line) and monocytes (U-937 cell line)-present in each sample and experiment. The data was split into sets in the same manner as the previous section, as shown in Fig 7a, where the Spectra and DART-ID sets of PSMs are disjoint. We then filtered out all PSMs from DART-ID that belonged to any protein represented in Spectra, so that the sets of proteins between the two sets of PSMs were disjoint as well.
To test whether or not DART-ID identified peptides consistently across experiments, we Only peptides with less than 5% missing data were used for this analysis, and the missing data were imputed. (b) The distributions of some features of the Spectra and DART-ID PSMs differ slightly. These features include: precursor ion area is the area under the MS1 elution peak and reflects peptide abundance; precursor ion fraction which reflects MS2 spectral purity; missed cleavages is the average number of internal lysine and arginine residues; and % missing data is the average fraction of missing TMT reporter ion quantitation per PSM. All distributions are significantly different, with p < 10 −4 .
https://doi.org/10.1371/journal.pcbi.1007082.g008 cleavages, and missing data; see Fig 8b. However, the distributions of these features are largely overlapping, and the magnitude of these differences are relatively small; most spectra of DAR-T-ID PSMs are still >90% pure, and have less than 16% missing data and missed cleavages. Of course the intended usage of DART-ID is not to separate these two groups of PSMs and analyze them separately, but instead to combine them and increase the number of data points available for analysis. Indeed, adding DART-ID PSMs to the Spectra PSMs doubles the number of differentially abundant proteins between T-cells and monocytes, Fig 9a, 9b and 9c.

Discussion
Here we present DART-ID, a new Bayesian approach that infers RTs with high accuracy and uses these accurate RT estimates to improve peptide sequence identification. We demonstrate that DART-ID can estimate and align RTs with accuracy of a few seconds for 60 minute LC-MS/MS runs and can leverage this high accuracy towards increasing the confidence in Data-driven Bayesian framework for increasing proteome coverage correct PSMs and decreasing the confidence in incorrect PSMs. This principled and rigorous estimation of the confidence of PSMs increases quantification coverage by 30-50%, primarily by increasing the number of experiments in which a peptide is quantified.
We validated the upgraded PSMs using methods for FDR estimation (Fig 3e), cross-validation (Fig 6), intra-protein CV validation (Fig 7), and biological signal validation (Fig 8). All of these methods strongly support the reliability of DART-ID inferences. We encourage the use of these methods for benchmarking the application of DART-ID (and any other related method) on other datasets.
DART-ID is applicable to any large set of LC-MS/MS analyses with a consistent LC setup. The more consistent the LC, the more powerful DART-ID is since its statistical power is proportional to the accuracy of RT estimates. Our SCoPE-MS and SCoPE2 runs have highly consistent RTs [1,4,60] and motivated us to develop DART-ID. However, we found (show in Fig  4) that DART-ID performs similarly well with bulk LC-MS/MS runs of TMT-labeled and label-free samples.
A principal advantage of DART-ID is that its probabilistic model naturally adapts to the RT reproducibility and obviates thresholds, e.g., a threshold on RT errors. Rather DART-ID updates the confidence of each PSMs using a rigorous quantitative model based on empirically derived distributions of RT reproducibility. Thus, it adapts and controls for the reproducibility of the LC and the accuracy of the RT estimates as shown in S5 Fig.
Another principal advantage of DART-ID is its ability to use all PSMs (including those with sparse observations and low confidence) to create a global RT alignment. This is possible because DART-ID alignment takes into account the confidence of PSMs as part of the mixture model in Eq 3. This results in accurate RT estimates (Fig 2) that are robust to missing data and benefit from all PSMs regardless of their identification confidence.
If the LC and RTs of a dataset are very variable, one may extend the alignment model beyond Eq 2 to capture the increased variability. The two-segment linear regression from Eq 2 demonstrated here captures more variation than a single-slope linear regression. DART-ID, however, is not constrained to these two functions and can implement any monotone function. Non-linear functions that are monotonically constrained, such as the logit function, have been implemented in our model during development. More complex models, for example monotonically-constrained general additive models, could increase alignment accuracy further given that the input data motivates added complexity.
While DART-ID is focused on aligning and utilizing RTs from LC-MS/MS experiments, the alignment method could potentially be applied to other separation methods, including ion mobility, gas chromatography, supercritical fluid chromatography, and capillary electrophoresis. The ion drift time obtained from instruments with an ion mobility cell are particularly straightforward to align and incorporate by DART-ID's Bayesian framework. Another potential extension of DART-ID is to offline separations prior to analysis, i.e., fractionation. RT alignment would only be applicable between replicates of analogous fractions, but a more complex model could also take into account membership of a peptide to a fraction as an additional piece of evidence.
DART-ID is modular, and the RT alignment module and PEP update modules may be used separately. For example, the RT estimates may be applied to increase the performance of other peptide identification methods incorporating RT evidence [14][15][16][17]. One application is integrating the inferred RT from DART-ID into the search engine score, as done by previous methods [18,19], to change the best hit for a spectrum, save a spectrum from filtering due to high score similarities (i.e., low delta score) [21], or provide evidence for hybrid spectra. Although DART-ID's alignment is based on point estimates of RT, the global alignment methodology could also be applied to feature-based alignments [6,[8][9][10] to obviate the limitations inherent in pairwise alignments.

Data sources and experimental design
The data used for the development and validation of the DART-ID method were 263 methoddevelopment experiments for SCoPE-MS and its related projects. All samples were lysates of the Jurkat (T-cell), U-937 (monocyte), or HEK-293 (human embryonic kidney) cell lines. Samples were prepared with the mPOP sample preparation protocol, and then digested with trypsin [2]. All experiments used either 10 or 11-plex TMT for quantification. Most but not all sets followed the experimental design as described by Table 1. All experiments were run on a Thermo Fisher (Waltham, MA) Easy-nLC system with a Waters (Milford, MA) 25cm x 75μm, 1.7μm BEH column with 130Å pore diameter, and analyzed on a Q-Exactive (Thermo Fisher) mass spectrometer. Gradients were run at 100 nL/min from 5-35%B in 48 minutes with a 12 minute wash step to 100%B. Solvent composition was 0% acetonitrile for A and 80% acetonitrile for B, with 0.1% formic acid in both. A subset of later experiments included the use of a trapping column, which extended the total run-time to 70 minutes. Detailed experimental designs and mass spectrometer parameters of each run can be found in S1 Table. All Thermo . RAW files are publicly available online. More details on sample preparation and analysis methods can be found from the mPOP protocol [2].

Searching raw MS data
Searching was done with MaxQuant v1.6.1.0 [7] against a UniProt protein sequence database with 443722 entries. The database contained only SwissProt entries and was downloaded on 5/ 1/2018. Searching was also done on a contaminant database provided by MaxQuant, which contained common laboratory contaminants and keratins. MaxQuant was run with Trypsin specificity which allowed for two missed cleavages, and methionine oxidation (+15.99492 Da) and protein N-terminal acetylation (+42.01056 Da) as variable modifications. No fixed modifications apart from TMT were specified. TMT was searched using the "Reporter ion MS2" quantification setting on MaxQuant, which searches for the TMT addition on lysine and the n-terminus with a 0.003 Da tolerance. Observations were selected at a false discovery rate (FDR) of 100% at both the protein and PSM level to obtain as many spectrum matches as possible, regardless of their match confidence. All raw MS files, MaxQuant search parameters, the sequence database, and search outputs are publicly available online.

Data filtering
Only a subset of the input data is used for the alignment of experiments and the inference of RT distributions for peptides. First, decoys and contaminants are filtered out of the set. Contaminants may be problematic for RT alignment since their retention may be poorly defined, e.g., they may be poorly chromatographically resolved. Then, observations are selected at a threshold of PEP < 0.5. Observations are additionally filtered through a threshold of retention length, which is defined by MaxQuant as the range of time between the first matched scan of the peptide and the last matched scan. Any peptide with retention length > 1 min for a 60 min run is deemed to have too wide of an elution peak, or chromatography behavior more consistent with contaminants than retention on column. In our implementation, this retention length threshold can be set as a static number or as a fraction of the total run-time, i.e., (1/60) of the gradient length.
For our data, only peptide sequences present in 3 or more experiments were allowed to participate in the alignment process. The model can allow peptides only present in one experiment to be included in the alignment, but the inclusion of this data adds no additional information to the alignment and only serves to slow it down computationally. The definition of a peptide sequence in these cases is dynamic, and can include modifications, charge states, or any other feature that would affect the retention of an isoform of that peptide. For our data, we used the peptide sequence with modifications but did not append the charge state.
Preliminary alignments revealed certain experiments where chromatography was extremely abnormal, or where peptide identifications were too sparse to enable an effective alignment. These experiments were manually removed from the alignment procedure after a preliminary run of DART-ID. From the original 263 experiments, 37 had all of their PSMs pruned, leaving only 226 experiments containing PSMs with updated confidences. These experiments are included in the DART-ID output but do not receive any updated error probabilities as they did not participate in the RT alignment. All filtering parameters are publicly available as part of the configuration file that was used to generate the data used in this paper.

Global alignment model
Let ρ ik be the RT assigned to peptide i in experiment k. In order to infer peptide and experiment-specific RT distributions, we assume that there exists a set of reference retention times, μ i , for all peptides i. Each peptide has a unique reference RT, independent of experiment. We posit that for each experiment, there is a simple monotone increasing function, g k , that maps the reference RT to the predicted RT for peptide i in experiment k. An observed RT can then be expressed as in Eq 1. As a first approximation, we assume that the observed RTs for any experiment can be well approximated using a two-segment linear regression model as described by Eq 2. This model can be extended to more complex monotonic models, such as spline fitting, or non-linear monotonic models, such as a logit function or LOESS.
To factor in the spectral PEP given by the search engine, and to allow for the inclusion of low probability PSMs, the marginal likelihood of an RT in the alignment process can be described using a mixture model as described in S1 Fig. For a PSM assigned to peptide i in experiment k the RT density is Pðr ik jm ik ; s ik ; l ik Þ |ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl {zffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl } Likelihood / 1fr ik > 0g ð1 À l ik Þ � f ik ðr ik j m ik ; s ik Þ |ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl {zffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl } PSM is correct þ ðl ik Þ � f 0 k ðr ik Þ |ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl {zffl ffl ffl ffl ffl ffl ffl ffl ffl ffl } where λ ik is the error probability (PEP) for the PSM returned by MaxQuant, f ik is the inferred RT density for peptide i in experiment k and f 0 k is the null RT density. In our implementation, we let: which we found worked well in practice (See S4 Fig). However, our framework is modular and it is straightforward to utilize different residual RT and null distributions if appropriate. For example, with non-linear gradients that generate a more uniform distribution of peptides across the LC run [22], it may be sensible for the null distribution to be defined as uniformly distributed, i.e. f 0 k � Uniformð RT min ; RT max Þ. Finally, to reflect the fact that residual RT variation increases with mean RT and varies between experiments (S3 Fig), we model the standard deviation of a peptide RT distribution, σ ik , as a linear function of the reference RT: where μ i is the reference RT of the peptide sequence, and a k and b k are the intercept and slope which we infer for each experiment. a k , b k and μ i are constrained to be positive, and hence σ ik > 0 as well.
To estimate all unknown parameters, we consider the joint posterior distribution of the experiment specific alignment parameters and the reference RTs given the observed retention times, where P (a, b, β 0 , β 1 , s, μ) are the prior distributions for all unknown alignment parameters and reference RTs and P (ρ | a, b, β 0 , β 1 , s, μ) is the likelihood, as determined by Equation. a, b, β 0 , β 1 , s are all K-vectors of alignment parameters for each experiment. μ consists of the reference RTs for every peptide.
The priors for the Bayesian inference can be found in the .stan model files, and for the analyses in this paper, are as follows: where RT mean and RT sd are the mean and standard deviation of all RTs across all experiments, respectively. max(RT) is the maximum observed RT of all RTs across all experiments. These priors were chosen for groups of 60 min LC-MS/MS runs, and can be adjusted accordingly for different run lengths, gradient shapes, and groupings of runs with different run times.

Alignment comparison
We compared the DART-ID alignment accuracy against five other RT prediction or alignment algorithms. As some methods returned absolute predicted RTs (such as BioLCCC [31]) and others returned relative hydrophobicity indices (such as SSRCalc [30]), a linear regression was built for each prediction method. Alignment accuracy was evaluated using three metrics: R 2 , the Pearson correlation squared, and the mean and median of |ΔRT|, the absolute value of the residual RT, and is defined as |Observed RT − Predicted RT|. We selected only confident PSMs (PEP < 0.01) for this analysis, and used data that consisted of 33383 PSMs from 46 LC-MS/MS experiments run over the course of 90 days in order to produce more chromatographic variation. A list of these experiments is found in S1 Table. Reference RT: μ * Normal(RT mean , RT sd ) Data-driven Bayesian framework for increasing proteome coverage SSRCalc [30] was run from SSRCalc Online (http://hs2.proteome.ca/SSRCalc/SSRCalcQ. html), with the "100Å C18 column, 0.1% Formic Acid 2015" model, "TMT" modification, and "Free Cysteine" selected. No observed RTs were inputted along with the sequences.
BioLCCC [31] was run online from http://www.theorchromo.ru/ with the parameters of 250mm column length, 0.075mm column inner diameter, 130Å packing material pore size, 5% initial concentration of component B, 35% final concentration of component B, 48 min gradient time, 0 min delay time, 0.0001 ml/min flow rate, 0% acetonitrile concentration in component A, 80% acetontrile concentration in component B, "RP/ACN+FA" solid/mobile phase combination, and no cysteine carboxyaminomethylation. As BioLCCC could only take in one gradient slope as the input, all peptides with observed RT > 48 min were not inputted into the prediction method.
ELUDE [34] was downloaded from the percolator releases page https://github.com/ percolator/percolator/releases, version 3.02.0, Build Date 2018-02-02. The data were split into two, equal sets with distinct peptide sequences to form the training and test sets. The elude program was run with the --no-in-source and --test-rt flags. Predicted RTs from ELUDE were obtained from the testing set only, and training set RTs were not used in further analyses.
For iRT [43], the same raw files used for the previous sets were searched with the Pulsar search engine [61], with iRT alignment turned on and filtering at 1% FDR. From the Pulsar search results, only peptide sequences in common with the previous set searched in MaxQuant were selected. Predicted RT was taken from the "PP.RTPredicted" column and plotted against the empirical RT column "PP.EmpiricalRT". Empirical RTs were not compared between those derived from MaxQuant and those derived from Pulsar.
MaxQuant match-between-runs [7,8] was run by turning the respective option on when searching over the set of 46 experiments, and given the options of 0.7 min match time tolerance and a 20 min match time window. The "Calibrated retention time" column was used as the predicted RT, and these predicted RTs were related to observed RTs with a linear model for each experiment run.
For DART-ID, predicted RTs are the same as the mean of the inferred RT distribution, and no linear model was constructed to relate the predicted RTs to the observed RTs.

Comparison to linear alignment model
To compare the performance of the two-piece linear model for RT alignment against a simple linear model, we ran both alignments separately on the same dataset as described in the RT alignment comparison section. For S2 Fig, we used one experiment-180324S_QC_SQC69Aas an example to illustrate the qualitative differences between the two models. Panels b and c used all experiments from the set to give a more quantitative comparison.

Confidence update
We update the confidence for PSM i in experiment k according to Bayes' theorem. Let δ ik = 1 denote that PSM i in experiment k is assigned to the correct sequence (true positive), δ ik = 0 denotes that the PSM is assigned to the incorrect sequence (a false positive), and as above, ρ ik is an observed RT assigned to peptide i. At a high level, the probability that the peptide assignment is a true positive is Each term is described in more detail below: The confidence update depends on the global alignment parameters. Let θ consist of the global alignment parameters and reference RTs, i.e. β 0k , β 1k , σ ik and μ i . If θ were known, then the Bayesian update could be computed in a straightforward manner as described above. In practice the alignment parameters are not known and thus must be estimated using the full set of observed RTs across all experiments, ρ. The PSM confidence update can be expressed unconditional on θ, by integrating over the uncertainty in the estimates of the alignment parameters: Although we can estimate this posterior distribution using Markov Chain Monte Carlo (MCMC), it is prohibitively slow given the large number of peptides and experiments that we analyze. As such, we estimate maximum a posteriori (MAP) estimates for the reference RTs μ i , alignment parameters β 0k , β 1k , and RT standard deviation σ ik using an optimization routine implemented in STAN [62]. If computation time is not a concern, it is straightforward to generate posterior samples in our model by running MCMC sampling in STAN, instead of MAP optimization. This approach is computationally efficient but is limited in that parameter uncertainty quantification is not automatic.
To address this challenge, we incorporate estimation uncertainty using a computationally efficient procedure based on the parametric bootstrap. Note that uncertainty about the alignment parameters β 0k and β 1k is small since they are inferred using thousands of RT observations per experiment. By contrast, the reference RTs, μ i , have much higher uncertainty since we observe at most one RT associated with peptide i in each experiment (usually far fewer). As such, we choose to ignore uncertainty in the alignment parameters and focus on incorporating uncertainty in estimates of μ i .
Letm ik andŝ ik denote the MAP estimates of the location and scale parameters for the RT densities. To approximate the posterior uncertainty in the estimates of μ i , we use the parametric bootstrap. First, we sample r ðbÞ ik from f ik ðr ik jm ik ;ŝ ik Þ with probability 1 − λ ik and f 0 k ðr ik Þ with probability λ ik . We then map r ðbÞ ik back to the reference space using the inferred alignment parameters asĝ À 1 ðr ik Þ and compute a bootstrap replicate of the reference RT associated with peptide i as the median (across experiments) of the resampled RTs: m ðbÞ i ¼ median kĝ À 1 ðr ðbÞ ik Þ, δ ik An indicator for whether or not the peptide sequence assignment, i in experiment k is correct (i.e. a true or false positive).
P(δ ik = 1|ρ ik ) The posterior probability that the PSM is assigned to the right sequence, given the observed RT, ρ ik . P(ρ ik j δ ik = 1) The RT density for peptide i in experiment k given the assignment is correct (true positive). Conditional on the alignment parameters, the true positive RT density f ik (ρ ik j μ ik , σ ik ) is Laplace(μ ik , s 2 ik ). In our implementation, we incorporate uncertainty in the estimation of the alignment parameters with a parametric bootstrap, explained in more detail below and in S8 Fig. P(ρ ik j δ ik = 0) The RT density given the assignment is incorrect (false positive). We assume that a false positive match is assigned to a peptide at random and thus take f 0 k ðr ik Þ to be a broad distribution reflecting variation in all RTs in experiment k. We model this distribution as Normal(μ k , s 2 k ), where μ k is approximately the average of all RTs in the experiment and s 2 k is the variance in RTs. P(δ ik = 1) The prior probability that the PSM's assigned sequence is correct, i.e. one minus the posterior error probability (PEP) provided by MaxQuant, 1 − λ ik .
P(ρ ik ) The marginal likelihood for observing the RT assigned to peptide i in experiment k. By the law of total probability, this is simply the mixture density from Eq 5. https://doi.org/10.1371/journal.pcbi.1007082.t003 Data-driven Bayesian framework for increasing proteome coverage as the maximum likelihood estimate of the location parameter of a Laplace distribution is the median of independent observations. For each peptide we repeat this process B times to get several bootstrap replicates of the reference RT for each peptide. We use the bootstrap replicates to incorporate the uncertainty of the reference RTs into the Bayesian update of the PSM confidence. Specifically, we approximate the confidence update in Eq 10 as This process is depicted in S8 Fig. In addition to updating the PEPs for each PSM, DART-ID also recalculates the set-wide false discovery rate (FDR, q-value). This is done by first sorting the PEPs and then assigning the q-value to be the cumulative sum of PEPs at that index, divided by the index itself, to give the fractional expected number of false positives at that index (i.e., the mean PEP) [56].

TMT reporter ion intensity normalization
Reporter ion (RI) intensities were obtained by selecting the tandem-mass-tag (TMT) 11-plex labels in MaxQuant, for both attachment possibilities of lysine and the peptide N-terminus, and with a mass tolerance of 0.003 Da. Data from different experiments and searches are all combined into one matrix, where the rows are observations (PSMs) and the 10 columns are the 10 TMT channels. Observations are filtered at a confidence threshold, normally 1% FDR, and observations with missing data are thrown out.
Before normalization, empty channels 127N, 128C, and 131C are removed from the matrix. Each column of the matrix is divided by the median of that column, to correct for the total amount of protein in each channel, pipetting error, and any biases between the respective TMT tags. Then, each row of the matrix is divided by the median of that row, to obtain the relative enrichment between the samples in the different TMT channels. In our data the relative enrichment was between the two cell types present in our SCoPE-MS sets, T-cells (Jurkat cell line) and monocytes (U-937 cell lines).
Assuming that the relative RI intensities of PSMs are representative of their parent peptide, the peptide intensity can be estimated as the median of the RI intensities of its constituent PSMs. Similarly, if protein levels are assumed to correspond to the levels of its constituent peptides, then protein intensity can be estimated as the median of the intensities of its constituent peptides. The previous steps of RI normalization makes all peptide and protein-level quantitation relative between the conditions in each channel.

Principal component analysis
For the principal component analysis as shown in Fig 8a, data was filtered and normalized in the same manner as discussed previously. Additional experiments were manually removed from the set due to different experimental designs or poorer overall coverage that would have required additional imputation on that experiment's inclusion.
PSMs were separated into two sets, as described in Fig 7a: Spectra and DART-ID. PSMs in the DART-ID set belonging to any parent protein in the Spectra set were filtered out, so that the two PSM sets contained no shared proteins. Additionally, proteins that were not observed in at least 95% of the selected experiments were removed in order to reduce the amount of imputation required.
Normalized TMT quantification data was first collapsed from PSM-level to peptide-level by averaging (mean) PSM measurements for the same peptide. This process was repeated to estimate protein-level quantitation from peptide-level quantitation. This data, from both sets, was then reshaped into an expression matrix, with proteins on the rows and "single cells" (TMT channel-experiment pairs) on the columns. As described earlier in the Results section, these samples are not actual single cells but are instead comprised of cell lysate at the expected abundance of a single cell; see Table 1.
Missing values in this expression matrix were imputed with the k-nearest-neighbors (kNN) algorithm, with Euclidean distance as the similarity measure and k set to 5. A similarity matrix was then derived from this expression matrix by correlating (Pearson correlation) the matrix with itself. Singular value decomposition (SVD) was then performed on the similarity matrix to obtain the principal component loadings. These loadings are the left singular vectors (the columns of U of SVD: UDU T ). Each circle was then colored based on the type of the corresponding cell from annotations of the experimental designs.

Protein inference
Our raw data was searched with both the PSM and protein FDR threshold set, in the search engine, to 100% to include as many PSMs as possible. Therefore, once PSM confidences were updated with RT evidence, we needed to propagate those new confidences to the protein level in order to avoid any spurious protein identifications from degenerate peptide sequences [63]. This is especially pertinent as many of the new DART-ID PSMs support proteins with no other confidently identified peptides, S9 Fig. Ideally we would run our updated PSMs back through our original search engine pipeline (MaxQuant/Andromeda) [7,21], but that is currently not possible due to technical restrictions.
Any interpretation of the DART-ID data on the protein-level was first run through the Fido protein inference algorithm [64], which gives the probability of the presence of a protein in a sample given the pool of observed peptides and the probabilities of their constituent PSMs. The Python port of Fido was downloaded from https://noble.gs.washington.edu/proj/fido and modified to be compatible with Python 3. The code was directly interfaced into DART-ID and is available to run as a user option.
For the data in this paper, protein-level analyses first had their proteins filtered at 1% FDR, where the FDR was derived from the probabilities given to each protein by the Fido algorithm. We ran Fido with the default parameters gamma: 0.5, alpha: 0.1, beta: 0.01, connected protein threshold: 14, protein grouping and using all PSMs set to false, and pruning low scores set to true.

Application to other datasets
In Fig 4 we

Implementation
The DART-ID pipeline is roughly divided into three parts. First, input data from search engine output files are converted to a common format, and PSMs unsuitable for alignment are marked for removal. Second, we estimate the alignment parameters and reference RTs using an by finding the maximum of the posterior distribution (Eq 4). Initial values for the algorithm are are generated by running a simple estimation of reference RTs and linear regression parameters for f ik for each experiment. Third, inferred alignment parameters and reference RTs are used to update the confidence for the PEP of a PSM.
The model was implemented using the STAN modeling language [62]. All densities were represented on the log scale. STAN was interfaced into an R script with rstan. STAN was used with its optimizing function, which gave maximum a posteriori (MAP) estimates of the parameters, as opposed to sampling from the full posterior. R was further used for data filtering, PEP updating, model adjustment, and figure creation. The code is also ported to Python3 and pystan, and is available as a pip package dart_id that can be run from the command-line. DART-ID is run with a configuration file that specifies inputs and options. All model definitions and related parameters such as distributions are defined in a modular fashion, which supports the addition of other models or fits. Full instructions for using the Python program are available at https://dart-id.slavovlab.net.
Code for analysis and figure generation is available at: github.com/SlavovLab/DART-ID_ 2018. The python program for DART-ID, as well as instructions for usage and examples, are available on GitHub as a separate repository: https://github.com/SlavovLab/DART-ID. All raw files, searched data, configuration files, and analyzed data are publicly available and deposited on MassIVE (ID: MSV000083149) and ProteomeXchange (ID: PXD011748).
Supporting information S1 File. DART-ID Post-run Report. A optional HTML report generated by the dart_id Python script. The report gives a summary of the alignment for each experiment, as well as a broad overview of the performance of the run as a whole, by showing aggregate increases in PSMs at a chosen confidence threshold. (ZIP) S1 Table. SCoPE-MS and mPOP Experimental Designs. An excel spreadsheet of the experimental designs of all raw files. Included are parameters for the liquid chromatography and parameters for the mass spectrometer. Also specified is the TMT channel layout for each experiment, with labels for J (T-cells, Jurkat cell line), U (monocytes, U-937 cell line), and H (human embryonic kidney cells, HEK-293 cell line). (XLSX) S2 Table. Mappings of raw files to figures. An excel spreadsheet providing a map that relates figures/analyses to raw files listed in S1 Table. TRUE denotes that the figure/analysis used that raw file, where FALSE denotes that it did not. (XLSX) S1 Fig. Mixture model incorporates spectral confidence to estimate likelihood of observing RTs. In the global alignment process, the likelihood of the alignment function and the reference RT is estimated from a mixture model, which combines the two possibilities of whether the peptide is assigned the correct or incorrect peptide sequence. These two distributions are then weighted by the error probability (PEP). This is similar to the update process, which updates the error probability and incorporates the previous error probability, as well as the two conditional probability distributions. evidence about that peptide sequence across many different experiments. "Aligned RT" is the RT after applying the alignment function, and "Std" is inferred RT standard deviation for the peptide in the given experiment. (b) For each RT observation for a sequence in an experiment, we infer two distributions: one corresponding to RT density given a correct PSM and the other to an incorrect PSM match. These densities are weighted by the 1-PEP and the PEP respectively and summed to produce the marginal RT distribution. (c) The marginal RT distribution is then used to sample B bootstrap replicates of of the observed RTs. Each bootstrapped RT is then used to construct a bootstrapped reference RT for a given sequence. The reference RT is the median of the resampled RTs (in the aligned space). FDR. "DART-ID new proteins" indicates PSMs boosted to below 1% FDR, that have different protein assignments from "Spectra", i.e., this set of proteins and the "Spectra" set of proteins is disjoint. "DART-ID all proteins" contains all PSMs with updated DART-ID FDR < 1% FDR regardless of protein assignment. All PSMs are filtered at < 1% FDR at the protein level. (TIF) Writing -review & editing: Nikolai Slavov.