Direct and Absolute Quantification of over 1800 Yeast Proteins via Selected Reaction Monitoring*

Defining intracellular protein concentration is critical in molecular systems biology. Although strategies for determining relative protein changes are available, defining robust absolute values in copies per cell has proven significantly more challenging. Here we present a reference data set quantifying over 1800 Saccharomyces cerevisiae proteins by direct means using protein-specific stable-isotope labeled internal standards and selected reaction monitoring (SRM) mass spectrometry, far exceeding any previous study. This was achieved by careful design of over 100 QconCAT recombinant proteins as standards, defining 1167 proteins in terms of copies per cell and upper limits on a further 668, with robust CVs routinely less than 20%. The selected reaction monitoring-derived proteome is compared with existing quantitative data sets, highlighting the disparities between methodologies. Coupled with a quantification of the transcriptome by RNA-seq taken from the same cells, these data support revised estimates of several fundamental molecular parameters: a total protein count of ∼100 million molecules-per-cell, a median of ∼1000 proteins-per-transcript, and a linear model of protein translation explaining 70% of the variance in translation rate. This work contributes a “gold-standard” reference yeast proteome (including 532 values based on high quality, dual peptide quantification) that can be widely used in systems models and for other comparative studies.


SID-SRM determination of APC subunits
The complete list of APC/C core and regulatory subunit abundance estimates from our study and other proteomic studies is given in Table S1. In addition to abundance estimates for Apc1 and Cdc23, we were also able to estimate an upper limit for 6 core subunits including that of Apc9 at 130 cpc. Apc9 is believed to be present at two copies per complex given its role in stabilisation of the two molecules of Cdc27 found in each APC/C (1). Our quantitative data suggests substoichiometric association of Apc9 in the complex. This raises the intriguing possibility that Apc9 may be the limiting factor in the formation of active APC/C complexes in vivo, where it is known to be essential for APC/C stability and catalytic E3 ligase activity (2); such a scenario would suggest that there are less than 130 functional APC/C complexes per cell. An alternative, but less likely possibility based on current evidence (3), is that Apc9 might function to promote the formation of the more active APC/C dimer recently described (4).
We also determined abundances for the two APC/C regulators Cdc20 and Cdh1, at <130 cpc and 998 ± 156 cpc respectively. The marked differences determined here in the relative amounts of these two co-activator proteins can be explained by a knowledge of their stability and regulation; although Cdh1 levels remain stable throughout the cell cycle, Cdc20 levels are known to fluctuate, consistent with the observation that our measurements derive from an asynchronous cell population (5-7). Our data highlights the diversity of estimated stoichiometry for this complex defined by absolute quantitative proteomics methods (Table S1).

Protein extraction
Accuracy in protein quantification not only relies on the analytical methodology and data analysis pipeline employed, but also, critically, the strategy for protein extraction and sample handling. To minimise potential errors arising from protein loss associated with manipulation of the biological sample, the chemostat grown yeast were lysed following an optimised protocol (8,9) and the total non-clarified extract used for subsequent protein quantification. No particulate material was removed prior to proteolysis, and protein loss (reproducible or otherwise) was minimised.

QconCAT validation and transition
Each tryptic digest was analysed by LC-MS using a nanoAcquity UPLC™ system (Waters Ltd., Elstree, UK) coupled to a Synapt™ G2 quadrupole-time-of-flight mass spectrometer (Waters Ltd., Elstree, UK) to verify completeness of digestion and to quantify the quantity of QconCAT present in the sample. One µL of the digest (corresponding to approximately the protein equivalent of 100,000 cells) was loaded onto a Symmetry C18 trapping column (5 µm packing material, 180 µm x 20 mm) (Waters Ltd., Elstree, UK) using partial loop injection for 3 min in 0.1% formic acid, 0.1 % acetonitrile at 5 µL min -1 . The sample was then resolved on a HSS T3 nanoAcquity C18 analytical column (1.8 µm packing material, 75 µm x 150 mm) (Waters Ltd., Elstree, UK) using a gradient of 97 % A (0.1 % formic acid): 3 % B (0.1 % formic acid in acetonitrile) to 60 % A:40 % B over 60 min at a flow rate of 300 nL min -1 . The column was then washed by increasing the percentage of B to 95 % over 2 min and holding at 95 % B for 2.5 min. The column was then re-equilibrated to starting conditions. The column oven temperature was 35 o C and the autosampler temperature was 7 o C. A lock mass solution of 500 fmol µL -1 of glu-fibrinopeptide B in 0.1 % formic acid in water:acetonitrile [50:50] was infused into the nanoelectrospray ionisation (ESI) source from an auxiliary pump at a flow rate of 300 nL min -1 . All solvents were LC-MS grade.
The column effluent was introduced into a nano-ESI source operated in positive polarity and fitted with a PicoTip emitter (New Objective, Woburn, MA, USA). The mass spectrometer was calibrated immediately prior to sample analysis using the product ion spectrum of glu-fibrinopeptide B (500 fmol µL -1 of glu-fibrinopeptide B in 0.1 % formic acid in water:acetonitrile [50:50]) and operated using a data independent (MS E ) acquisition program with the instrument in V mode. A 1 sec 'low energy' survey scan was performed between m/z 50-2000 and a trap cell collision energy of 6 eV. The 'elevated energy' product ion scan was acquired using the same conditions except that the trap cell collision energy was ramped between 15 and 40 eV over the course of the acquisition. The transfer cell collision energy was 4 eV for both scans, and the lock mass was recorded every 30 sec. The data was processed and database searched using ProteinLynx Global Server v2.5.2 (Waters Ltd., Elstree, UK). The data was processed using the following settings; low energy threshold, 100; elevated energy threshold, 20; intensity threshold, 750; lock mass, m/z 785.8426. The processed spectra were searched against in-house generated database containing the amino acid sequences of all the QconCATs used in the study. The following settings were applied; automatic settings for precursor and product ion mass tolerance; minimum fragment ion matches per peptide, 8; minimum fragment ion matches per protein, 15; minimum peptide matches per protein, 1; fixed modifications, carbamidomethyl Cys, 13 C6 Arg and Lys; variable modifications, oxidised Met; number of missed cleavages, 1; false positive rate, 1 %.

Informatic comparison of Q-peptide and endogenous peptide signal response
Our QconCAT design process involved several steps to minimise the likelihood of confounding issues that could degrade the signal from both Q-peptides and the selected endogenous peptides in the yeast proteins. Specifically, we selected surrogate peptides that were not known to be post-translationally modified and used a prediction too, McPred (10), to prioritise peptides with good cleavage contexts to avoid partial cleavage. In addition, our digestion protocol described here and in the Methods was tested and evaluated over several months to optimise conditions for complete digestion. Similarly, as noted above, we routinely examined every QconCAT digest on the Synapt instrument for completeness.
In addition to these steps, we also compared the SRM XIC values from 532 matched pairs of Class A Q-peptides for concordance. The reasoning behind this is illustrated in Supplementary Figure S11 panel A where we expect to observe equal response relative rates from the two yeast peptide SRM values compared to their Q-peptide equivalents, where the pair are from the same yeast protein. This is indeed what we observe in Fig S11B with ~70% of data points lying within a 2-fold difference. The figure also indicates the expected changes when signal is lost from a given SRM value that could be caused by incomplete digestion, sub-stoichiometric posttranslational modifications, or chromatographic inconsistencies and poor peak selection. In the cases where large deviations from expectation are observed (red dots, labelled Single) we always selected the largest of the two peptide SRM median values for protein quantification, reasoning that loss of signal from a target peptide is considerably more likely than for QconCAT peptides since these were independently validated and checked.

Label-free absolute quantification
The dataset called "Q-Exactive" was based on a single one dimensional reversed phase separation of the previously described yeast digest using an Ultimate 3000 The "SAX" dataset used the on-tip pre-fractionation approach described by Wisniewski and co-workers (12) to increase proteome coverage. Initially 130 µg of whole cell lysate yeast preparation were brought to 250 µL in a solution containing 4% SDS, 100mM Tris/HCl pH 7.6, 0.1M DTT and incubated at 95°C for 5 min. Any DNA in the sample was then sheared by sonication (3 x 10 s pulses) to reduce the sample viscosity and the lysate then clarified by centrifugation at 17,136 x g for 5 min. The sample volume was brought to 400 µL with 100 mM Tris/HCl pH 7.6, the sample cooled on ice, and 100 µL of 100% TCA added to precipitate protein.
Precipitation was carried out overnight on ice after which the precipitate was Further 0.1 M Tris-HCl (pH 8.5) was added, to dilute the urea to 6 M, after which endoproteinase LysC was added at 1:50 enzyme:protein ratio, and the mixture incubated for 4 h at 37 o C. Following this, the urea concentration was further diluted down to 1.5 M with 25 mM ammonium bicarbonate before adding trypsin (1:50 enzyme:protein ratio) for an overnight incubation at 37 °C. The peptide digest was desalted using a Harvard apparatus Macro spin column, and eluted from the C18 matrix using 70% acetonitrile, dried by vacuum centrifugation prior to preparation for anion exchange-based peptide fractionation as described previously (12). Briefly, peptides were separated on a pipette-based anion exchanger 'column', assembled by stacking 10 layers of a 3 M Empore Anion Exchange disk (consisting a polystyrenedivinylbenzene copolymer, modified with quaternary ammonium groups) into a 200 µL micropipette tip. Peptides were initially dissolved and loaded on the column, in 200 µL of Britton and Robinson buffer composed of 20 mM CH3COOH, 20 mM H3PO4, 20 mM H3BO3, and NaOH, at pH 11. For elution of fractions, Britton and Robinson buffer solutions titrated to pH 8, 6, 5, 4 and 2 using NaOH were used. Eluted peptides were also desalted using C18 StageTip plugs as described by Rappsilber and co-workers (13). Each elution was then dried by vacuum centrifugation, re-suspended in 0.05% TFA prior to LC-MS/MS. LC-MS/MS was performed using an Ultimate 3000 RSLC™ nano system (Thermo Scientific, Hemel Hempstead, UK) coupled to a Q-Exactive™ mass spectrometer (Thermo Scientific, Hemel Hempstead, UK) using the previously described method.
The SAX dataset was processed with MaxQuant (v. 1.3.0.5,(14) using the Andromeda search engine(15). The sequence database was the reference proteome set of S. cerevisiae from UniProt (6560 proteins). A fixed carbamidomethyl modification for cysteine and variable oxidation modification for methionine were specified. A precursor mass tolerance of 10 ppm and a fragment ion mass tolerance of 20 mmu were applied and a 1% FDR threshold at both peptide and protein level.
All other parameters were left at default settings. Label-free absolute quantification was performed following the "Top3" methodology described by Silva and coworkers (11) but instead of using an internal standard for calibration, the total content of a yeast cell was assumed to be 60 million molecules.

Analysis of peptide features in sibling pairs
We examined the peptides in sibling pairs from the same parent protein, where both had been classified as Type A, for enrichment in selected features which could explain systematic differences in the log2 ratio of X/Y abundances. In all cases, the peptides were denoted X and Y so that the X value was always the greater of the tow. We noted a statistically significant enrichment in certain features of the X and Y peptides in pairs with log2 ratios above and below the median, most notably increased missed cleavage potential in the native protein context of the lower abundance Y peptides (10), as shown in Supplementary Figure 3. Other features considered amino acid content and predicted post-translational modifications. As noted the most prominent feature that explains the data is an enrichment for missed cleavage potential in the native protein context of the lower abundance Y peptides according to our predictor MC:Pred (10), as well as incidence of dibasic sites (16), and the presence of tryptophan in the X peptide. We are aware of no precedent for this latter observation but observed a modestly significant distinction between X/Y ratios conditioned on this feature (Supplementary Figure 3). However, we elected not to use this directly when selecting peptides for protein level quantification due to the borderline significance and no clear experimental rationale.  (27). We adopted an iterative, sequential approach to the multivariate linear regression modeling using features listed in Supplementary Table 3

Linear regression model building
We also analysed our transcriptome and proteome data following the Ranged Major Axis (RMA) regression modelling of Csardi and colleagues (28), comparing this to the Ordinary Least Squares (OLS) modelling parameters, using the R-toolbox lmodel2. The RMA approach models noise in both the dependent and independent variables (in our case, the proteome and transcriptome respectively) and produces a symmetric model. The best-fit lines are shown in Supplementary Figure S10. The increased slope derived from the RMA modelling is consistent with the Csardi modelling study which considers a variety of datasets, though few with matched transcriptome and proteome from the same yeast. This result suggests that so-called potentiation model of gene expression in yeast is prevalent, where the translational control step follows and amplifies the transcriptional level. We note however, that even against such a model considerable gene-level variation must exist. This is selfevident from Figure 5B where a huge variation in protein:mRNA ratios are observed. To properly reconcile this data requires an accurate comprehensive measurement of both protein degradation and synthesis rates. Although the inclusion of protein turnover improves the quality of the basic model, considerable variance remains unexplained suggesting better data with better coverage are still needed.
A: An S-curve scatter plot showing peptide quantification values in copies per cell (cpc) ranked in ascending order. The A-type peptides are shown in purple and the upper quantification limit for B-type peptides are shown in orange. The error bars on the A-type peptides show the robust standard deviation (rSD) across the biological replicates. B: The distribution of robust coefficient of variance values (rCVs) are shown in histograms at the peptide level, and in C: at the protein level. The median rCVs are 11.4% and 12.6% respectively, for peptides and proteins.
Boxplots showing distinguishing features between sibling peptides for proteins quantified by two A-type peptides. A: The predicted missed-cleavage propensity scores (MC:Pred) are shown for sibling peptides for proteins where sibling A-types agree (log ratio < median log ratio) and those that do not agree (log ratio > median log ratio). Sibling peptides are labelled as either X or Y, so that peptide X quantification > peptide Y in all cases. A significant difference in MC:Pred score is observed for Y peptides in proteins where sibling peptides do not agree versus proteins where sibling peptides do agree (Wilcoxon Rank Test, p<0.001). B: The log ratio of sibling peptide quantifications (using the same XY nomenclature) is shown for proteins quantified by two peptides. Proteins were classified by the presence of tryptophan in the Q-peptides where either none, one or both peptides contain tryptophan. There is a small but modestly significant difference between log ratio for proteins where peptide X (only) contains a tryptophan and those proteins where neither contain tryptophan (Wilcoxon Rank Test, p<0.05). We highlight this test to illustrate the process we carried out to determine if there were any biases in amino acid composition leading to signal loss or disparity. This was the only case, and owing to the modest significance we elected not to use this in the final calculations.

Supplementary Figure S5.
A: Peptide "detectability" prediction scores. The boxplot here shows the predicted detectability score using CONSeQuence of the three observed Q-peptide categories, where the higher score equates to a predicted higher detectability. Peptides were classified by quantification type as before: Type A where reliable data for both the reference and analyte peptide are observed, type B where only data for the reference Q-peptide is available, and type C where we do not see any data for either reference or analyte peptides. Type C peptides are significantly poorly predicted compared to A and B. B: Transcript abundance boxplots for protein quantification categories. Proteins were classified based on their peptide categories. Type A proteins have at least one type A peptide, Type B proteins have no type A but at least one type B, and Type C if neither peptide is type A or B. The boxplots show the respective transcript levels of the three protein categories, plotting transcript copies per cell values as described in the Methods on a log scale. Notably, type B proteins have significantly lower transcript levels compared to type A and C, suggesting they are also likely to be low abundance at the protein level. C: Barcharts showing the proportion of proteins quantified (green) and not quantified (red) across the three protein classifications, broken down for each quantification dataset obtained from PaxDb as well our two label free acquisitions (Q-Exactive and SAX, See Supplementary methods). Of note, there are always proteins that were missed by other methods compared to the successful COPY A-type proteins. Similarly, the relative fraction of COPY B-type quantifications that were unreported by other methods is expanded in comparison to the A-type data, consistent with the fact that they are generally low abundance and hard to quantify by any method.