DeMix-Q: Quantification-Centered Data Processing Workflow*

For historical reasons, most proteomics workflows focus on MS/MS identification but consider quantification as the end point of a comparative study. The stochastic data-dependent MS/MS acquisition (DDA) gives low reproducibility of peptide identifications from one run to another, which inevitably results in problems with missing values when quantifying the same peptide across a series of label-free experiments. However, the signal from the molecular ion is almost always present among the MS1 spectra. Contrary to what is frequently claimed, missing values do not have to be an intrinsic problem of DDA approaches that perform quantification at the MS1 level. The challenge is to perform sound peptide identity propagation across multiple high-resolution LC-MS/MS experiments, from runs with MS/MS-based identifications to runs where such information is absent. Here, we present a new analytical workflow DeMix-Q (https://github.com/userbz/DeMix-Q), which performs such propagation that recovers missing values reliably by using a novel scoring scheme for quality control. Compared with traditional workflows for DDA as well as previous DIA studies, DeMix-Q achieves deeper proteome coverage, fewer missing values, and lower quantification variance on a benchmark dataset. This quantification-centered workflow also enables flexible and robust proteome characterization based on covariation of peptide abundances.

Label-free quantification (LFQ) is one of the most efficient approaches for quantifying proteome differences between multiple states of a biological system. LFQ aims to reproducibly identify and quantify peptides through multiple liquid-chromatography-coupled tandem mass spectrometry (LC-MS/MS) experiments. In the popular data-dependent acquisition (DDA) approach named Top-N DDA, the appear-ance of a peptide-like signal in a "survey" mass spectrum triggers a tandem mass spectrometry (MS/MS) event, targeting the (N) most-abundant precursor ions. Previous studies have shown that, due to the limited speed of a mass spectrometer, the majority of peptide ions detected in MS 1 are not targeted in MS/MS, especially when a nonfractionated complex sample is analyzed (1,2). This low sampling efficiency (Ͻ50%), combined with the stochastic nature of precursor selection and a limited efficiency of MS/MS identification (Ͻ70%) (3), frequently causes the absence of MS/MS identification for an individual peptide in some LC-MS/MS experiments ("runs") within a larger dataset, even when replicate measurements are made (4). This deficiency is known as the missing value problem in LFQ. The problem significantly limits the size of the DDA-acquired proteomics dataset across which reliable quantification can be made for each protein (5,6).
One of the causes of the missing value problem is the traditional focus on the process of identifying a peptide as opposed to its quantification. For historical reasons, peptide sequence identification has been considered the focal point and the most important step in the whole proteomics procedure, while quantification came as almost an afterthought (7,8). This dominant proteomics paradigm can be characterized as the identification-centered approach, also known as a spectrum-centric approach (9). Only gradually the missing value problem has been identified as one of the biggest drawbacks of the DDA approach (4,5). To address the reproducibility issue in MS/MS identification, several alternative data acquisition strategies had been suggested, including targeted (10) and semi-targeted (11,12) approaches. However, none of the improved DDA strategies has solved the missing value problem anywhere close to the data-independent acquisition (DIA) (13,14). The latter approach, however, typically provides somewhat lower depth and breadth of the proteome coverage than the DDA methods.
In our opinion, the DDA-associated missing value problem is caused by the sequential execution of two independent processes: peptide identification by MS/MS and its quantification by MS 1 . At first glance, performing MS 1 -based quantification simultaneously with MS/MS identification should provide an obvious solution to the missing value problem. Since MS 1 spectra contain many more peptide ions than are selected for MS/MS in DDA (or identified in DIA), the peptide's mass information is practically always present when an iden-tification is available (15). This information comes in several domains: accurate position on the m/z scale of the monoisotopic and other isotopic mass peaks and the position and the abundance profile on the retention time scale of the extracted ion trace for the above peaks, as well as the charge-state distribution of the analyte ions. Using this information, one can, at least in principle, identify the peptide ion even when MS/MS data are of low quality (16) or entirely missing (17).
In practice, during the last decade, features including monoisotopic m/z and retention time have been used for peptide identification, a strategy referred to as accurate mass and time tags (18 -20). The accurate mass and time tag performs the feat of "peptide identity propagation" (PIP) from the LC-MS/MS runs with valid MS/MS information to those runs where such information is lacking. Today, one or another variant of the accurate mass and time tag-based PIP is employed in many MS 1 -based LFQ algorithms for analyzing DDA data (6,21). MS feature matching (22,23) and targeted extraction of ion chromatograms (XIC) (24,25) represent examples of such variants. Although these algorithms are not free from certain generic drawbacks (26), they allow large-scale comparison of DDA-analyzed complex proteomes (23,27,28).
However, conventional PIP approaches only alleviate, but do not fully solve, the missing value problem. Given large enough sample cohorts and a set of analyzed peptides, we will encounter missing values. One might consider DIA as an alternative, but the process of signal extraction from DIA data is not fundamentally different from PIP. Therefore, while DIA reduces the occurrence of missing values, it is not completely absent in such data (29). It is, however, much more pronounced with DDA, regardless of which PIP procedure is used. On the other hand, DDA tends to identify more peptides and give deeper proteome coverage from the same sample than DIA, which is easy to understand, given the burden of peptide identification in DIA from severely convoluted data (2). When the size of comparative proteomics datasets becomes larger, the impact of the missing values becomes progressively worse. Resorting to imputations (i.e. qualified guesses) (30) cannot be considered satisfactory unless no other approaches are available.
Is all information stored in survey (MS 1 ) mass spectra fully recovered by conventional PIP algorithms? It appears that some information domains have not yet been fully used, especially the peptide abundances. This fact reflects today's dominance of the identification-centered approach to proteomics. It has become a natural part of every modern proteomics study to report the false discovery rate (FDR) of its lists of identifications, but the discussion of coefficient of variation (CV) 1 or even the FDR in peptide quantification-essential features of any quantification workflow-is still conspicuously missing in many current studies. An unfortunate consequence of such a miss is that sometimes MS/MS-inferred peptides with vastly deviating abundances in run-torun or sample-to-sample comparisons are attributed to the same protein. These deviating peptides with questionable identities can drastically worsen the variances of protein abundances in the whole dataset and thus reduce the statistical power of the experiment (31).
In contrast, in a quantification-centered approach, peptide abundance is the central factor to be investigated (9,15,(32)(33)(34). When peptide abundance is considered together with other parameters, such as RT difference and mass error, for the overall assessment of peptide reliability, deviating abundance behavior may result in exclusion of a given peptide from consideration (35). An expected abundance behavior should, on the other hand, strengthen the certainty of peptide identity. In other words, using only "well-behaved" peptides should enhance the quality of protein quantification by improving certainty in peptide identification and reducing the abundance variance, i.e. CV. The challenge is in inclusion in the quantitative assessment of the "goodness of behavior" of peptide abundances into the overall PIP scoring scheme. In this study, we will meet this challenge by introducing a new quantification-centered label-free workflow, DeMix-Q. It represents a LFQ-extension of the previously developed DeMix identification workflow designed for maximizing proteome coverage by identifying co-fragmented peptides (2). But in principle, DeMix-Q does not require DeMix for peptide identification and is compatible with any other peptide identification methods. Besides reducing quantification variations, De-Mix-Q aims to significantly alleviate, if not eliminate, the missing value problem in comparative studies of many complex proteomes, while preserving the DDA advantage of higher proteome coverage. This is achieved by introducing a hybrid PIP method with a scoring function for quality control, which takes into account deviations from RT and m/z, as well as peptide abundance. For testing the new workflow, we selected the iPRG-2015 dataset (36) as an easily accessible, well-characterized and high-quality reference.

EXPERIMENTAL PROCEDURES
Preprocessing and MS/MS Identification-Raw LC-MS/MS data and the protein database were downloaded from the FTP server (ftp://iprg_study:ABRF329@ftp.peptideatlas.org/) of iPRG-2015 study (36). In the study, three replicates of four samples with the same amount (200 ng) of yeast digest were spiked with different concentrations of six exogenous marker proteins (Supplemental Table 1). The mixtures were digested by trypsin and analyzed by LC-MS/MS using an Orbitrap Q-Exactive mass spectrometer selecting the MS/MS precursors in the Top-10 DDA mode. Our DeMix workflow deconvo-MS/MS, liquid chromatography coupled to tandem mass spectrometry; LFQ, label-free quantification; MS 1 , primary or survey mass spectrometry; PIP, peptide identity propagation; TOPP, The OpenMS Proteomics Pipeline. luted chimeric MS/MS spectra from the detected cofragmentation events for maximizing peptide identifications (2). The MS-GFϩ (37) search engine (v10089) was used for matching the MS/MS spectra against the yeast database (6628 UniProt protein sequences), allowing up to two missed tryptic cleavage sites. Carbamidomethylation of Cys was set as a fixed modification, while acetylation of protein N terminus, oxidation of Met, and deamidation of Asn/Gln were set as variable modifications. A double-pass searching strategy was implemented. From the first-pass searching (10 ppm precursor tolerance), confident MS/MS identifications (Ͻ1% FDR) were used as software lock-masses for mass scale recalibration and mass error estimation. IDPicker (v3.1.643) (38) was used to merge the second-pass identifications (.mzid files) at maximum 1% peptide-spectral match FDR and with minimum two distinct peptides for protein inference. COM-PASS (v. 1.0.4.5) (39) was used to assign peptide sequences to protein groups using the principle of maximum parsimony.
Retention Time and Mass Scale Recalibration-Reliable peptidespectral matches were converted to OpenMS-compatible format and employed for aligning multiple LC-MS/MS experiments using Ma-pAlignerIdentification from the OpenMS proteomics pipeline (TOPP) (40,41). One individual run that gave the largest number of peptide identifications was chosen as a reference. The RT scales in all other runs were transformed to the scale of the reference run by aligning common peptide identifications. Next, using InternalCalibration with 5 ppm mass tolerance, the mass scale in each experiment was recali-brated to theoretical peptide masses. As a result, a new set of mzML files containing only MS 1 (full-range) spectra was generated, in which the scales of RT and m/z for all runs were very similar. This processed dataset was then used as the basis for all following procedures ( Fig. 1).
Feature Detection and Matching Across Runs (Feature-Based PIP)-Here, an LC-MS feature can be defined as a peptide-like XIC pattern assembled from a cluster of raw MS peaks (41,42). Each feature has information of its RT and m/z coordinates, as well as its integrated ion-current and charge state. Recalibrated MS 1 spectra were loaded into FeatureFinderCentroided in TOPP for assembling chromatographic feature maps (m/z tolerance 0.01 Da, min spectra 5, feature min score 0.6). All features listed in the feature maps were de facto quantified independently and reported with integrated ion-current. Features were tentatively associated to peptide sequences using available MS/MS identifications. Afterward, the FeatureLinkerUn-labeledQT pipeline in TOPP was used for grouping features by similarity with a user-defined quality threshold. Here, we used RT difference Ͻ180 s and mass difference Ͻ 5 ppm. This process served as feature-based identity propagation. As a result, features that matched across multiple experiments were linked into a single consensus map (a.k.a. reference map or master map) (40). In this consensus map, each consensus feature contained at least one subelement (the best-matched feature from a single run), with reference information on RT (centroid of the feature chromatographic shape), FIG. 1. Overview of DeMix-Q data processing workflow. Processes are colored in blue for TOPP, yellow for the search engine, and green for the postprocessing programs developed in-house. Internal processes are highly flexible and can be replaced by alternative tools or simply be skipped. In the latter case, DeMix-Q may become a traditional feature-, XIC-, or MS/MS-based quantification workflow. Note that MS1-based quantification procedures, including feature detection and between-run propagation, are independent of peptide identification and can even be done in the absence of MS/MS. m/z, and integrated ion-current (called "intensity" in OpenMS). A fully quantified consensus feature contains the maximum possible number of subelements that equals to the total number of LC-MS runs; otherwise, the consensus feature contains missing values in one or more runs.
Recovering Missing Values by Targeted XIC (Ion-Based PIP) with Quality Control-Ion-based PIP considers only the existence of molecular ions in a given (RT Ϯ ⌬RT, m/z Ϯ ⌬m/z) window, regardless of the chromatographic peak shape, its abundance, and the isotopic pattern. Therefore, ion-based PIP is more sensitive than the featurebased method. However, this advantage comes at the price of lower reliability, and thus future-based approaches are normally preferred. Here, ion-based PIP was used only for recovering remaining missing values after feature-based PIP (i.e. the consensus feature map). For each consensus feature, local maxima (apexes) of the monoisotopic peaks (M) and the 13 C isotopic peaks (M ϩ 1) were extracted in each individual run using EICExtractor of TOPP, with a matching window of RT Ϯ 1 min and m/z Ϯ 5 ppm around the consensus reference location ( Fig. 2A). Since both RT and m/z were recalibrated and aligned, such a narrow matching window did not result in substantial loss of useful data (false negatives). The geometric average of ion intensities of the two peaks ͑ ͱI1I2͒ was used as an estimate of the feature abundance in the corresponding run. This calculation served as a quality checkpoint, which required both isotopic peaks to be traced with nonzero intensities.
Central to the DeMix-Q approach, based on extracted m/z, RT, and ion intensities of the two isotopic MS 1 peaks (M and Mϩ1), a scoring scheme was established and applied to every feature in every LC-MS run, combining five deviation factors ( Fig. 2A): -⌬T 1 , the RT difference between the consensus feature and the apex of the monoisotopic peak (M); -⌬T 2 , the RT difference between the two isotopic apexes (M and Mϩ1); -⌬M 1 , the deviation of the monoisotopic peak (M) from its theoretical mass; -⌬M 2 , the difference of mass deviations between the two isotopic peaks (M and Mϩ1); -CV of extracted abundances in replicate runs. Deviations (⌬T 1 and ⌬M 1 ) between the consensus feature and the extracted monoisotopic peak reflect between-run variances. Relative deviations (⌬T 2 and ⌬M 2 ) between the two isotopic peaks indicate within-run inconsistencies, with an assumption that the extracted 13 C isotopic peak (Mϩ1) should have the same deviations as the monoisotopic peak (M). The last factor (CV) penalizes the features that were not reliably quantified in the replicate runs. Since these five factors have different units and intervals of changes, they were all normalized by their own standard deviations and thus converted to unitless quantities that can be simply combined. One overall score function combining the five deviation factors was formulated as a negative logarithm of pooled variances, with larger variation resulting in a lower score: Here, the five deviation factors were assumed having equal weights for the reason of simplicity. However, one could imagine expansions of this work that assign weights that are optimal by some criteria using more rigorous statistical methods or machine-learning techniques, e.g. Percolator (43).
A target-decoy method was then applied to estimate the false discovery rate (FDR) for a given scoring cutoff. Assuming that complete features grouped by the FeatureLinkerUnlabeledQT are most statistically reliable, a reference feature set ("target") was generated by selecting all fully quantified features with the maximum number of For each reference feature in every individual run, the precursor ion is traced in MS1 spectra. Within a matching window of RT Ϯ1 min and m/z Ϯ5 ppm, the ion with maximum intensity is picked. By comparing the apex of its monoisotopic peak with the consensus reference RT and m/z, the deviation factors ⌬T 1 and ⌬M 1 are obtained. Comparison to the apex of the monoisotopic peak with the apex of the 13 C isotopic peak gives the deviation factors ⌬T 2 and ⌬M 2 . The geometric average of ion intensities I 1 and I 2 represents the feature quantity, which is used to calculate the CV between the replicate runs. Five deviation factors are combined by the scoring function in Equation (1). (B) Score distribution and target-decoy comparison for FDR estimation. Consensus features linked across all runs by OpenMS are considered to be reliable "target" features. "Decoy" features are generated by arbitrarily shifting the target features' RT and m/z values. Any XIC extracted for a "decoy" feature is assumed to be false. In each individual run, the number of decoy hits does not exceed 5% of target hits, corresponding to FDR of Ͻ5%. This threshold is applied to feature-quantity assignments in the process of missing value recovery. subelements. A false feature set ("decoy") was generated from the target set by shifting each target feature outside its original extraction window through alteration of the retention time (ϩ5 min) and m/z (ϩ50 ppm), with small random noise being added. The score distribution of the decoy features formed a null-score distribution (Fig. 2B). FDR was then estimated as the fraction of all assignments that passed a given score threshold: The number of decoy matches was divided by the total number of target matches. In this study, a 5% FDR threshold was applied for XIC-feature assignments. The features that failed to pass the threshold were considered missing (zero-intensity).
Special treatment was then given to "gray zone" features, which were incompletely quantified by feature-based PIP and yet most likely present in ion-based PIP. For such features (subelements), missing abundances were estimated from a k-nearest neighbors (KNN) regression (k ϭ 5), averaging abundances from other quantified features having the most similar XIC patterns in the same run. After filling missing values, the consensus map was further filtered by removing features that failed to be quantified in at least one replicate in each sample. Lastly, the remaining missing values were imputed as having the lowest detectable feature abundance in order to avoid extreme ratios (or divisions by zero) in sample-to-sample comparison. This approach explicitly assumes that every protein is present in every sample. Although this assumption is demonstrably untrue in case of spike-in or knock-down, it provides useful approximation in a great majority of proteomics studies.
Intensity Recalibration-Interrun (batch effects) and intrarun systematic biases (e.g. due to sample loading, column temperature, ESI current stability, etc.) can greatly affect analytical accuracy in labelfree experiments (21,44,45). In order to correct fold-changes induced by systematic biases, a rescaling of feature abundances was performed, using a time-dependent median-shift approach. Chromatographic features from the consensus map were sorted by the retention order, then a sliding window (step size ϭ 50 features) containing 500 adjacent consensus features was used to compare the local median abundance of all features with those of the subset of features from each individual run (one-versus-all). This yielded a set of local median shifts for each run, based on which, a nonlinear medianshift was estimated as a function of retention time by another k-nearest neighbors regression (k ϭ 15). The abundance of every feature in every single run was normalized by correcting the predicted median shift ( Fig. 3 and Supplementary Fig.).
Detecting and Quantifying Differential Proteins-Since most protein groups have multiple peptides quantified in all runs, quantitative proteomic data bear similarity with gene expression microarray data, with peptide abundances being equivalent to probe intensities. Like microarray probes, peptide abundances quantified by LFQ are supposed to reflect the concentrations of their respective proteins and have linear responses to abundance changes. In theory, if one protein has an abundance difference between the samples, all its constituent peptides should show the same level of abundance difference, giving rise to strong covariation between the abundances of all peptides. Thus, a strong covariation of a peptide abundance with that of other same-protein peptides means "well-behaving" of a given peptide and higher certainty in both its identity and quantification.
In the literature, we found no LFQ algorithm that would measure and utilize peptide covariations. However, giving the basic similarity between proteomics and transcriptomics, tools developed for microarray analysis should also work for quantitative proteomics. In this study, an in-house implementation of factor analysis (Zhang et al., manuscript in preparation) was adapted for detecting differential proteins (i.e. proteins with abundances varying from sample to sample, as opposed to background proteins with unchanged abundances).
All identified features were grouped by peptide sequence, with summed abundances from all charge states. By applying the factor analysis to maximize covariation signals of peptides, a signal-to-noise ratio was obtained for a group of peptide "probes." This parameter was used to decide whether a peptide group (as a protein) is "informative": Proteins with signal-to-noise ratio Ͻ 1 were considered "noninformative" and thus excluded from differential analysis. Protein expression values summarized from peptide abundances were labeled by sample identities, and their sample-to-sample abundance changes were tested using one-way analysis of variance: Proteins with p values lower than 0.05 were reported as differential.
Comparison to Traditional Methods-MS/MS spectral counts (SpC) were exported from IDPicker after integrating reliable identifications. MaxQuant analysis was performed with default instrumental parameters and database setting. The option of match-between-runs was enabled for feature-based identity propagation, with 20 min alignment window and 2 min matching tolerance. Peptide abundances were retrieved from the resultant peptide.txt file (intensity column), which was filtered by 1% FDR based on posterior error probability (PEP) values (Supplemental Table 2). Skyline MS 1 -filtering (XIC) results provided in the iPRG-2015 study materials were processed as follows: Nonunique peptides were excluded from the analysis, and the abundances of the first two isotopic peaks (M and Mϩ1) were utilized. The OpenMS results were exported right after generating the consensus map (feature-based PIP), without filling missing values by targeted XIC (ion-based PIP). For each method, peptide quantification variation (CV) was calculated using peptides that were quantified in all runs. As shown in Fig. 4, SpC that does not employ identity propagation was greatly affected by the DDA randomness, yielding more than 40% missing values in peptide abundances. Only about one-fourth (26%) of all peptides could be found in all 12 LC-MS runs. The steeply declining trend of common quantifications with the dataset size makes SpC less suitable for deep proteome profiling in large-scale experiments. In contrast, OpenMS and MaxQuant that apply feature-based PIP reduced the fraction of missing values to smaller but still nonnegligible figures: 15% and 13%, respectively. The number of commonly quantified peptides increased to over 60%, i.e. more than double compared with SpC. Skyline that applies ion-based PIP without matching chromatographic features showed a very high sensitivity, providing negligible 1.5% missing values and over 90% peptides quantified in all runs. DeMix-Q that combines two complementary PIP approaches to achieve highly sensitive signal extraction together with reliable feature matching gave 2.8% missing values and quantified over 86% of peptides across all 12 runs (Fig. 4B).

A Combined Identity Propagation Method Substantially Alleviates the Missing Value Problem-We
Formally, Skyline (MS 1 filtering) outperformed all other methods in terms of missing values. However, as mentioned before, ion-based PIP does not have a quality threshold, thus providing results with uncertain FDR. In contrast, DeMix-Q applies quality thresholds in feature-quantity assignments, which reflects in a somewhat higher fraction of missing values. This drawback should be outweighed by DeMix-Q superiority in quantification: Uncontrolled matching in Skyline may result in false assignment, thus introducing a large abundance of variations. This prediction was tested on the distributions of quantification variances discussed below.
A Quantification-Centered Workflow Provides High Coverage with Low Variance-DeMix-Q quantified in total 26,753 unique peptides representing 2912 proteins, with at least two distinct constituent peptides per protein (Supplemental Table  1). Notably, we found that MS/MS-based identifications explained only one-third (44,024/129,641) of the total chromato-graphic features that were quantified and matched across experiments (Fig. 5). This highlights the great potential of exploiting the hidden majority of quantified features. One way of using these features is for correcting the systematic bias in measured ion intensities, which is likely to be the largest source of peptide abundance variation. In this study, we applied a KNN regression of RT-dependent median shift to centralize the ion intensity variations around zero-fold change ( Fig. 3 and Supplementary Fig.). As one can see from the regression curves, system biases are mostly nonlinear functions of RT. After normalization, all pairwise comparisons of feature abundances between runs showed zero-centered distributions. By correcting the systematic bias, we obtained significantly lower quantification variance compared with other quantification methods. From DeMix-Q, the median CV The lower-left part shows systematic errors in between-run comparisons of feature abundances, before correction. The upper-right part shows the effect of median-shift correction, where the between-run abundance differences (fold-changes) become approximately zero-centered along the whole RT range. Each subfigure has normalized RT (0 -8000 s) as x axis and feature abundance ratio (-2.0 to 2.0 in log2 scale) as y axis. Correction of three replicate runs of one sample is demonstrated; the comparisons between all 12 runs are shown in Supplementary Fig.  calculated for 23,097 fully quantified peptides in all 12 runs was 11.6%, distinctively lower than in other methods (Fig. 6). When using the average values from three replicate experiments for each sample, the median CV of peptide abundances among the four samples (background proteins) was only 6.0%.
In contrast to that, peptides abundance quantified by Skyline showed higher CVs on average and a long tail of peptides with CV Ͼ 40% (Fig. 6). As mentioned above, direct XIC methods do not ensure the correctness of signal extraction.
Wrongly extracted XIC would introduce large variances in peptide quantification and lead to a long tail in the CV distribution.
Accurate and Robust Protein Quantification-So far, we performed the peptide-level quantification by aggregating precursors' ion currents. However, the estimation of protein abundances is not as easy as aggregating peptide abundances. It is widely known-although not often discussedthat peptides originating from the same protein can give vastly different abundances in LC-MS, varying orders of magnitude. Moreover, it is less than certain that the LC-MS signal of all peptides scales linearly and has the same slope with protein abundance variation. As a result, it becomes problematic to reliably estimate relative protein abundance by simply averaging or aggregating abundances of the multitude of peptides attributed to that protein. This is because the abundance variances of a few intensive peptides may significantly affect the result, suppressing the signals from other peptides. Therefore, wrong identity assignment of intensive chromatographic features poses higher risks for quantification than that of less intensive ones. For this reason, an identification-centered approach has a stringent requirement of identity correctness (46), but a quantification-centered method should be able to cope with possible misidentification using quantitative information (35).
As an example, Fig. 7 shows the quantitative behavior of peptides (top-10 and bottom-10, respectively) from the spiked-in protein bovine serum albumin (sp P02769 ALBU-_BOVIN). Although peptides were reproducibly quantified across the runs, some peptides (mostly with low abundance) did not reflect the known difference of protein concentrations (11: 0.6: 10: 500). In particular, the seventh highest-abundance peptide (LGEYGFQNALIVR) showed a strong deviation from both known protein concentrations as well as behavior of other peptides. As a rule, low-abundant peptides showed a smaller dynamic range of abundance differences compared with known values and more-abundant peptides. While the exact nature of such behavior remains unclear and should be thoroughly investigated, it was more likely due to instrumental effects rather than inaccurate data processing.
Considering this example. It is clear that any robust protein quantification algorithm should be able to cope with a fraction of incorrect identifications, as well as with differences in peptide signal responses to the protein abundance changes. One approach would be to design such an algorithm based on specifics (not thoroughly known yet) of the peptide responses in LC-MS. Another, perhaps more pragmatic, ap-proach would be to borrow an existing tried-and-proved algorithm from a related research area. Since transcriptomics is at least a decade older than proteomics, the problem of inconsistences of probe signals was addressed in microarray analysis workflows some time ago (47). With more than nine peptides quantified per protein group on average and with the benefit of solving the missing value problem for a great majority of peptides, our data mimicked a "protein microarray," with peptide abundances posing as probe signal intensities.
Applying the factor analysis and analysis of variance for selecting proteins with varying abundances between the samples, we discovered all six spiked-in proteins with high certainty. In contrast, the same procedure applied to quantification results from three other PIP-based algorithms missed one or more proteins (Fig. 8A and Supplemental Table 1), while also (for Open-MS) yielding more false positives. After protein summarization, the sample-to-sample protein ratios showed a linear correlation with expected values up to over ninefold abundance difference (Fig. 8B).
One could hypothesize that sensitive and reliable identity propagation may eliminate the need for redundancy in MS/MS identifications. We tested this hypothesis by simulating a sparse dataset, keeping MS/MS from only one of the 12 runs and eliminating all redundant identifications. Despite certain reduction in the total number of quantified proteins due to DeMix-Q quality control, all six spiked proteins were still correctly recalled (Fig. 8A). This result more than any previously discussed one demonstrates the possibility of a paradigm change from identification-centered to quantification-centered proteomics. DISCUSSION We demonstrated in this study a new quantification-centered workflow for label-free proteomics that practically solves the DDA-induced missing value problem that haunted shotgun proteomics for years. By taking advantage of the high quality of FIG. 6. Comparison of CV distributions. DeMix-Q provided distinctively lower median CV than other methods, while quantifying the largest number of peptides across all runs. This is achieved primarily due to the introduction of scoring and FDR filtering that prevented large variances from false extractions, and the RT-dependent abundance correction that reduced systematic variability. modern LC-MS/MS, we integrated two types of PIP methods into the workflow. This enabled unbiased and reproducible quantification across many runs even when only a single set of MS/MS identification was available. This workflow is particularly suitable for DDA, where quantified features compose a much larger population than MS/MS-identified species.
The demonstration of the "redundancy of redundant MS/MS peptide identifications" highlighted the potential of PIP in reliable quantification. This result may herald a new strategy in label-free shotgun proteomics, emphasizing the acquisition of higher-quality MS 1 spectra and deeper proteome coverage rather than larger number of redundant MS/MS spectra. One way of exploring this new strategy is by segmenting the mass range for precursor selection (48); another one is by applying the exclusion list aggregated from previous runs to subsequent runs.
In some respects DeMix-Q is similar to a sequential window acquisition of all theoretical fragment-ion spectra-MS processing workflow but propagates MS 1 1 -only approach for whole-proteome analysis in which the peptide identities are purely inferred from a predicted accurate mass and time tag library. Regardless of the similarities and differences between these approaches and DeMix-Q, they all are based on the three fundamental quality factors essential for reliable analysis: stable retention time, high resolution, and high mass accuracy.
Contrary to common belief, proteome analysis applying DIA (e.g. sequential window acquisition of all theoretical fragmention spectra) is not free from the missing value problem. A recent study (29) quantified 80% of the 18,600 yeast peptides belonging to 2333 proteins (including one-hit wonders) in four sequential window acquisition of all theoretical fragment-ion spectra-MS replicate runs. Compared with 86% of 26,753 peptides we quantified here in 12 runs, the DIA study did not demonstrate any advantage. In our case, DDA required less experimental time and produced both deeper proteomics analysis as well as fewer missing values. Theoretically speaking, this is not too surprising given the lower burden on reliable peptide identification and chromatographic feature extraction with DDA. Compared with the typical 2 Th isolation window in DDA, a highly multiplexed 20 Th window used in DIA penalizes identification of low-abundant peptides that give fewer fragment peaks (2,9). The intrinsic advantage of DDA has, however, not yet been fully realized in practice, mainly because of the missing value problem. Releasing the full power of MS 1 in a quantification-centered data processing workflow should lead to deeper and more accurate labelfree proteomics. In principle, DeMix-Q should not be limited to DDA data and could also be adapted to solve the missing value problem in DIA analysis, e.g. by propagating identities of peptide fragments instead of precursors. However, due to the increased spectral complexity and the loss of precursor selectivity, identity propagation in DIA might require a more sophisticated scoring scheme for controlling false discoveries. FIG. 7. Responses from peptide abundances to actual protein concentration differences. The abundances of the 10 most-abundant peptides from a spiked-in bovine serum albumin (left) well correlated with the known protein concentrations, except for the outlier peptide (blue). However, the least-abundant 10 peptides (right) turned out to be less responsive to the actual protein concentration changes. The expected abundances (dark gray lines) were calculated: for sample 4 (that has the highest spike-in amount, 500 fmol)-as the average log2 peptide abundances (33.8 for top-10 and 28.5 for bottom-10 peptides); for samples 1, 2, and 3-as the corresponding log2 fold-changes compared with sample 4 (-5.5, -9.7, and -5.64, respectively).
Finally, we strongly believe that time has come to put quantification in the center of comparative proteomics workflow. Unlike the traditional identification-centered proteomics that is very strict in terms of peptide identification but has a somewhat lax attitude to peptide quantification, new protein quantification algorithms should be able to tolerate incorrect peptide identifications because further abundance-based filtering will eliminate wrong assignments. This may allow for the revision of certain dogmas of identification-centered proteomics, such as 1% FDR for peptide-spectral matches. A deeper understanding of the error propagation in mass spectrometry experiments might allow for more flexible treatment of error rates (49), which could have a significant positive effect on the depth of the quantitative proteome analysis. More advanced quantification algorithms that will take into account the variance and covariance of peptide and protein abundance in multiple experiments urgently need to be developed. Given the data similarity between shotgun proteomics and gene expression microarrays, we can learn much from the more mature area of transcriptomics by microarrays. But of course, sooner or later, superior proteomics-specific algorithms will be developed. (A) Signal-to-noise ratios of calling differential proteins by factor analysis. DeMix-Q showed high sensitivity and specificity for identifying the six spiked marker proteins (signal-tonoise ratio Ͼ 1 and one-way analysis of variance p value Ͻ 0.05). (*) Using nonredundant MS/MS dataset, all six proteins were identified by DeMix-Q as differential. (B) Estimation of protein abundance ratios. Abundances of the six proteins were pairwise compared between the four samples (color coded). The quantification accuracy of DeMix-Q is reflected in the good agreement between the estimated and expected fold-changes ranging from 0.14 to 9.7.