A “Proteomic Ruler” for Protein Copy Number and Concentration Estimation without Spike-in Standards

Absolute protein quantification using mass spectrometry (MS)-based proteomics delivers protein concentrations or copy numbers per cell. Existing methodologies typically require a combination of isotope-labeled spike-in references, cell counting, and protein concentration measurements. Here we present a novel method that delivers similar quantitative results directly from deep eukaryotic proteome datasets without any additional experimental steps. We show that the MS signal of histones can be used as a “proteomic ruler” because it is proportional to the amount of DNA in the sample, which in turn depends on the number of cells. As a result, our proteomic ruler approach adds an absolute scale to the MS readout and allows estimation of the copy numbers of individual proteins per cell. We compare our protein quantifications with values derived via the use of stable isotope labeling by amino acids in cell culture and protein epitope signature tags in a method that combines spike-in protein fragment standards with precise isotope label quantification. The proteomic ruler approach yields quantitative readouts that are in remarkably good agreement with results from the precision method. We attribute this surprising result to the fact that the proteomic ruler approach omits error-prone steps such as cell counting or protein concentration measurements. The proteomic ruler approach is readily applicable to any deep eukaryotic proteome dataset—even in retrospective analysis—and we demonstrate its usefulness with a series of mouse organ proteomes.

teins have been reported in single mammalian cell types (1). In the past decade, MS-based proteomics has gone from sole identification to the quantification of proteins, which has typically meant relative quantification between samples (2)(3)(4). Apart from the presence of a protein and its relative fold changes between different conditions (5), it is often desirable to estimate absolute quantities such as molar concentrations or copy numbers per cell, which can be compared for different proteins (6). For instance, in systems biology, even a rough estimate of the copy number can help to establish initial parameters for simulation (7). Likewise, clinical protein measurements are typically done in absolute terms of titers, such as milligrams per deciliter. For this purpose various approaches have been utilized, including correlating total MS signals to visualized structures in the cell (8) and extrapolating from spiked-in reference protein mixtures (9) or from endogenous proteins quantified via accurately characterized, isotopically labeled peptide (10) or protein fragment standards (11). Absolute quantification is then achieved through quantification relative to a known reference. In all cases, results scale with the amount of input material or amount of spiked-in standard. Accurate protein concentration measurements are thus an essential and often limiting factor for overall accuracy. Commonly used dye-based protein determination methods rely on the reactivity of few amino acid residues-mainly tryptophan and tyrosine (12) in the case of the Lowry and BCA assays, or a hydrophilic/hydrophobic balance of the proteins in the case of Bradford reagent (13). Systematic errors of up to a factor of 2 may therefore arise from the selection of a non-optimal protein standard (14). An additional, often ignored source of errors is the cross-reactivity of the reagents with non-proteinaceous cell components such as thiols, nucleic acids, and phospholipids.
To convert protein quantities to copies per cell, all methods require knowledge of the number of cells used for the analysis. This can be obtained directly via cell counting or indirectly through knowledge of the total protein amount per cell, which in turn is a function of cell volume and total protein concentration. However, cells are not necessarily uniform; therefore scaling by cell numbers may be inaccurate, as a 25% variation of the diameter of a sphere-shaped cell corresponds to a 2-fold change in cell volume. In tissues, not only are cell sizes variable, but visual counting of cells is also problematic. For instance, up to 5-fold differences in calculated cell volumes have been reported for enterocytes of the intestinal mucosa (15).
Any deviations in protein determination or cell counts will inevitably carry over to the final readout, even when very precise MS methods are used. This limits the overall accuracy, without showing up as a decrease in the precision of the quantification, as measured by standard deviations or coefficients of variation.
In the course of studying the colon cancer proteome, we recently devised a method for estimating absolute amounts of individual proteins or protein classes based on the proportion of their MS signals to the total MS signal (16). We termed the method the Total Protein Approach, because we relate this proportion to a total protein mass. To obtain copy numbers, we specifically used the total protein mass per cell, which needs to be determined or estimated separately.
In this study, we expanded the method by a concept we call the "proteomic ruler" to further allow correct absolute scaling of the readout without additional steps. We made use of the defined amount of genetic information in each cell, encoded in a known amount of DNA. We show that an accurate determination of the DNA content in a proteomic sample helps to directly determine the number of cells. We then demonstrate that the MS signal derived from histones, around which DNA is wrapped in a defined ratio, can be used as a natural standard in a whole proteome dataset. It serves as a proteomic ruler that allows the estimation of total protein amounts per cell. Thereby the quantitative readout can be absolutely scaled to copies per cell without the need for cell counting or protein concentration determination.

EXPERIMENTAL PROCEDURES
Plasma Lysate-The author's blood was capillary-collected via skin puncture of the middle finger. It was immediately supplemented with 0.05 M EDTA and centrifuged at 5000 ϫ g for 1 min to separate blood cells from plasma. Plasma was diluted 10-fold with lysis buffer containing 0.1 M Tris-HCl, pH 8.0, 0.1 M DTT, and 2% SDS, and the mixture was incubated at 70°C for 5 min.
Whole Cell and Tissue Lysates-U87-MG, A549, PC-3, and Hep-G2 cells were grown in DMEM supplemented with 10% FBS and 1% streptomycin. The cells were harvested at 70% confluence and dissolved in lysis buffer at 100°C for 5 min. After being chilled to room temperature, the lysates were briefly sonicated to reduce the viscosity of the sample. Frozen mouse tissues (Pel-Freez, Rogers, AR) were homogenized with T10 basics Ultra-Turrax dispenser in the lysis buffer at a tissue-to-buffer ratio of 1:10. The homogenates were incubated at 100°C for 5 min. Finally, the cell and tissue lysates were clarified by centrifugation at 16,000 ϫ g for 10 min.
Protein Determination-Protein content was determined using a Cary Eclipse Fluorescence Spectrometer (Varian, Palo Alto, CA) as described previously (17). Briefly, aliquots of 1 to 3 l of whole cell lysates were mixed with 2 ml of 8 M urea in 10 mM Tris-HCl, pH 8.5. The fluorescence was measured at 295 nm for excitation and 350 nm for emission. The slits were set to 5 nm and 20 nm for excitation and emission, respectively. Tryptophan was used as a standard. The protein content was calculated from the following relationship: the fluorescence of 0.1 g of tryptophan equals 9 g of total protein, which reflects an average 1.1% weight content of tryptophan in whole lysates of human cells.
Cell Counting-Tissue cultures were trypsinized at 37°C for 2 min, and the released cells were washed with PBS and collected at 1000 ϫ g for 1 min. Then the pellets were suspended in PBS and the cells were stained with 0.2% Trypan Blue (Invitrogen). Cell counting was carried out on an automated cell counter (Countess, Invitrogen).
FASP-based Protein Processing-Aliquots of lysates containing 100 g of total protein were processed according to the multi-enzyme digestion FASP protocol (18). Briefly, protein lysates were depleted from the detergent using 8 M urea in 0.1 M Tris/HCl, pH 8.5, thiols were alkylated with iodoacetamide, and proteins were consecutively digested with endoproteinase LysC and trypsin. Digests of plasma fractions were fractionated using a pipette tip strong anion exchange method into four and two fractions as described previously (19).
FASP-based Cleavage and Determination of RNA and DNA-After collection of the peptides released by trypsin, the material remaining in the filter was washed once with TE buffer (10 mM Tris-HCl, pH 8.0) and then was digested with 0.5 l (0.5 U) of RiboShredder (Epicenter, Madison, WI) in 60 l of TE buffer at 37°C for 1 h to digest RNA. The released ribonucleotides were collected via centrifugation at 14,000 ϫ g. Next the material on filters was washed twice with 80 l of TE buffer, and then it was cleaved with 6 g of DNAse (DN25, Sigma, St. Louis, MO) in 60 l of 10 mM Tris-HCl, pH 7.8, containing 2.5 mM MgCl 2 and 0.5 mM CaCl 2 at 37°C for 1 h. The obtained deoxynucleotides were collected via centrifugation. The RNA and DNA contents were determined by means of UV spectrometry using extinction coefficients of 0.025 and 0.030 (g/ml) Ϫ1 cm Ϫ1 at 260 nm, respectively. The ratio of the spectral densities at 260 nm to 280 nm was ϳ2, indicating an absence of protein contamination that could contribute to A260 measurement.
LC-MS/MS and Data Analysis-Peptides were quantified by tryptophan fluorescence as described above, with the exception that the measurements were performed directly in 0.2 ml of 0.05 M Tris/HCl, pH 8.5, in 5 mm ϫ 5 mm quartz cells. 4-g aliquots of total peptide were loaded onto C 18 reverse phase columns (20 cm long, 75 m inner diameter, in-house packed with ReproSil-Pur C 18 -AQ 1.8-m resin (Dr. Maisch GmbH, Ammerbuch-Entringen, Germany)) with buffer A (0.5% acetic acid). Peptides were eluted with a linear gradient of 5% to 30% buffer B (80% acetonitrile and 0.5% acetic acid) at a flow rate of 250 nl/min over 195 min. This was followed by 10 min from 30% to 60% buffer B, a washout of 95% buffer B, and reequilibration with buffer A. Peptides were electrosprayed and analyzed on Q Exactive mass spectrometers using a data-dependent top-10 method with higher energy collisional dissociation fragmentation. Mouse organ samples were loaded onto a 15-cm reverse-phase column packed with 3-m resin, separated over 320 min of gradient time, and analyzed on an LTQ Orbitrap mass spectrometer using collision-induced dissociation fragmentation. MS data were analyzed using the MaxQuant software environment (20), version 1.3.10.18, and its built-in Andromeda search engine (21). Proteins were identified by searching MS and MS/MS data against the human and mouse complete proteome sequences from UniProtKB (May 2013 version containing 88,820 and 50,807 sequences, respectively). Carbamidomethylation of cysteines was set as a fixed modification. N-terminal acetylation and oxidation of methionines were set as variable modifications. Up to two missed cleavages were allowed. The initial allowed mass deviation of the precursor ion was up to 6 ppm, and for the fragment masses it was up to 20 ppm (higher energy collisional dissociation, Orbitrap readout) and 0.5 Da (collision-induced dissoci-ation, ion trap readout). The mass accuracy of the precursor ions was improved by time-dependent recalibration algorithms of MaxQuant. The "match between runs" option was enabled to match identifications across samples within a time window of 30 s of the aligned retention times. The maximum false peptide and protein discovery rates were set to 0.01. Protein matching to the reverse database and proteins identified only with modified peptides were filtered out. Protein abundances and copy numbers were calculated on the basis of summed peptide intensities of unique and "razor" peptides as reported by MaxQuant using the Perseus plugin described in this study. Finally, we removed all protein groups with fewer than two unique peptides (with the exception of two isoforms of creatine kinase in our plasma analysis), as they were less likely to yield highly accurate copy numbers.
Software Availability-The proteomic ruler Perseus plugin is available as a source code and as compiled binary from the Perseus website.

The Total Protein Approach Gives Accurate Estimates of
Protein Concentrations-Using our Total Protein Approach, we previously demonstrated that a protein's abundance within the cell as a fraction of the total protein is reflected by the proportion of its MS signal to the total MS signal (16).

Protein mass Total protein mass
This proportion can easily be extracted from any MS-based proteomics measurement, and its accuracy will improve with the depth of measurement. The value has to be scaled by a total protein mass, which can conceptually be the entire protein amount of a cell, the protein amount in a given volume of body fluid, or even a fixed unit such as 1 g. In that way we obtain the absolute amount of the protein or protein class per cell, per unit of volume, or per 1 g of total protein. To show that this principle is universally applicable, beyond the cell line and cancer tissue cases that we investigated before (16), we used it to estimate the concentrations of different diagnostically relevant proteins or protein classes in blood plasma after digesting plasma proteins using the FASP method (18). The total protein concentration in plasma varies around a typical value of 70 g/l within a narrow margin (22), so we scaled the MS readout by a total amount of 70 g to obtain grams per liter. We were able to quantify proteins within their expected physiological ranges over at least 5 orders of magnitude ( Fig. 1, supplemental Table S1).
Nucleic Acid Quantification and Cell Counting via FASPbased Sample Preparation-In the case of a body fluid such as plasma, the total protein concentration is a readily accessible scaling parameter, and protein concentrations are meaningful and relevant. In the case of a cellular proteome, however, many applications require quantities of copies per cell, which necessitates cell counting. We wondered whether cell counting could be replaced by accurate DNA quantification when the genome size and ploidy were known. DNA concentration was shown to be proportional to the cell count and was successfully used to normalize enzyme activities, transcript and protein amounts, and metabolome data (23)(24)(25). We hypothesized that DNA quantities could be measured directly from the proteomic sample, provided that the chromatin fraction was retained during sample preparation. In contrast to in-solution or in-gel approaches, the FASP method is reactor based (26) and allows sequential processing of the sample and separation of reaction products. Detergents are washed out at the beginning of the FASP procedure, and RNA and DNA, the major components remaining after protease digestion, can be cleanly released from the filter via RNase or DNase digestion ( Fig. 2A). To test the feasibility of nucleic acid determination in the FASP format after digestion of proteins and elution of peptides, we consecutively digested the material retained on the filter with RNase and DNase. After each cleavage we collected the digestion products and determined their content based on UV absorbance at 260 nm. We observed a linear correlation between the amount of the eluted nucleotides and the amount of the sample. In parallel, we processed samples supplemented with defined amounts of purified calf thymus RNA and DNA. Yields were greater than 95% and were independent of the protein content (Fig. 2B), indicating that post-FASP digestion of a sample with DNase and RNase is a suitable method for determination of the RNA and DNA content in a proteomic sample that does not require additional preparative steps.
Next, we processed aliquots of total lysates prepared from counted numbers of four different human cell lines using two-step LysC/trypsin digestion of the proteins (multi-enzyme digestion FASP) (27). Both the starting protein amounts and the generated peptides were quantified. We then quantified the ribonucleotides and deoxyribonucleotides eluted after RNase and DNase treatment, respectively. The tryptic and LysC peptides obtained in the multi-enzyme digestion FASPprocessed cell lysates (above) were analyzed in 4-h LC-  Table S1). The human genome contains around 3.2 ϫ 10 9 base pairs (28). Multiplying this number by the average mass of a base pair (615.9 Da) and by the ploidy of the respective cell type yields an expected amount of cellular DNA. We used a value of 6.5 pg for a diploid human cell to calculate cell numbers. Dividing the total amount of protein input by these cell numbers, we obtained a protein mass per cell that was very similar to that obtained by dividing the total protein input amount by the counted cell numbers (supplemental Table S2).
Histones Serve as a "Proteomic Ruler" for Absolute Scaling of Proteomic Data-In eukaryotic cells, DNA is packaged in chromatin by histones, and the mass of the DNA is about equal to the combined mass of histones (29). We therefore wondered whether the summed intensity of histones in a deep, eukaryotic proteome could serve as a proxy for the amount of DNA and therefore for the cell number. There are five major histone types, which are expressed in many isoforms and variants that are relevant for many aspects of chromatin biology. For our approach, however, we employed the summed MS signal of all histone-derived peptides, irrespective of which histone they mapped to or how they were assembled in protein groups. This value reflects the cumula-tive histone mass. In this way, we used the MS signal of an entire class of proteins as a proteomic ruler and related it to a quantity that is not directly amenable to mass spectrometry. Our hypothesis of the histone proteomic ruler predicts the following relationship (Fig. 3A In our four-cell-line dataset, the histone MS signal amounted to 2.07% to 4.03% of the total MS signal. Equating this fraction with 6.5 pg as the DNA mass of diploid human cells, we obtained cellular protein masses within a factor of 1.24 Ϯ 0.29 compared with the value obtained via cell counting ( Fig. 3B; supplemental Table S2). This is close to the hypothesized value of 1 and implies that the ratio of histone MS signal to total MS signal allows the estimation of the total cellular protein mass without any additional measurements.
The error of the histone MS signal fraction depends on how accurately the histone MS signal and the total MS signal can be determined. For histones, a large number of various posttranslational modifications (PTMs) have been identified, lysine  2. A, the proteomic workflow. Cells were counted and lysed in a buffer containing SDS. Protein concentrations in the whole lysates were determined, and 100-g aliquots of the whole lysates were successively processed in the proteomic reactor (FASP) format. After detergent removal, proteins were consecutively cleaved with endoproteinase LysC and trypsin. The released LysC and tryptic peptides were subjected to proteomic analysis. Next, RNA and DNA were digested, and the released ribo-and deoxyribonucleotides were spectrophotometrically quantified at 260 nm. Protein contents per single cell were calculated from the cell numbers and the protein concentrations. Alternatively, values of protein mass of single cells were obtained from DNA contents and the protein concentrations. B, determination of the efficiency and yield of RNase and DNase cleavages. Aliquots of mouse liver lysates were processed with the FASP method, and the residual high-molecularweight material was sequentially cleaved with RNase and DNase (labeled "samples digested with DNase and RNase"). The released ribo-and deoxyribonucleotides were quantified spectrophotometrically at 260 nm. To demonstrate the completeness of digestion over the analyzed range, samples were supplemented with constant amounts of 2 g of purified DNA or RNA prior to sample processing (labeled "samples ϩ 2 g RNA/DNA digested with DNase/RNase"). To demonstrate the specificity of the initial RNase digestion, samples were supplemented with DNA and digested with RNase (labeled "samples ϩ 2 g DNA digested with RNase"). acetylation, serine and threonine phosphorylation, and lysine methylation being the most frequent. In most standard proteomics workflows, these modifications are not routinely included in the database search, and we were wondering whether this affects the ratio of histone MS signal to total MS signal, which is critical for our scaling approach. To address this question, we searched the data again with combinations of acetylation, phosphorylation, and methylation set as variable modifications. Although individual histones had changes in their relative abundances, in particular histone H3 (Figs. 4A-4C), the fraction of the cumulative histone to total MS signal changed only by 5% to 10% (Fig. 4D). This indicates that, with the exception of histone H3, the fraction of the MS signal derived from histone peptides that have PTMs is low and can be neglected in the overall data scaling process.
The accuracy of the total MS signal depends on the depth of the proteomic analysis. To estimate the required depth for a robust readout, we ranked all peptides by intensity and calculated the histone-MS fraction as a function of the number of identified peptides (Fig. 4E). Because peptide intensities span many orders of magnitude, the most intense peptides contribute a large part of the total intensity. Within the first few thousand peptides, the histone fraction is overestimated because histones contribute some of the most intense peptides. From a depth of around 12,000 or more peptides, however, the histone fraction stabilizes within tight margins. This depth of analysis is easily attainable with minimal sample fractionation and also with single run analyses on latestgeneration machines (30).
For each protein in the measured proteome, we can now estimate its mass per cell solely from its MS signal as the product of its MS signal fraction and the cellular protein mass. This value easily converts to copies per cell. Ribosomal Proteins as a Proteomic Ruler for Cellular RNA-Next, we investigated whether the proteomic ruler concept is also applicable to cellular RNA. Ribosomal RNA typically represents about 80% of total RNA (31), and in eukaryotic ribosomes there is a ratio of about 1:1 between RNA and protein (32). The summed MS signal for all ribosomal proteins amounted to values between 3.61% and 5.27% of the total MS signal across the cell lines. We compared this result by the biochemical quantification of the total RNA content using the FASP method in relation to the total protein input (supplemental Table S2). Our results were within a factor of 1.01 Ϯ 0.13 of the biochemical measurements, indicating that the MS signal of ribosomal proteins can indeed be used as a proteomic ruler to estimate cellular RNA amounts.
Histone Proteomic Ruler Provides Estimates of Cell Sizes in Tissues-Counting cells in tissue samples is not trivial. However, determining the DNA and RNA content using our proteomic reactor format is equally straightforward as for cell lines. We prepared lysates from mouse brain, liver, and thymus; measured protein, RNA, and DNA contents; and performed proteomic analysis. There was excellent agreement between the total cellular protein mass values derived from the DNA-based method and our histone proteomic ruler approach ( Fig. 3C; supplemental Table S3). This demonstrates that the histone proteomic ruler serves as a good proxy for estimating cellular protein masses in tissues.
The total cellular protein concentration typically lies within a range of 20% to 30% (w/v) (i.e. 200 to 300 g/l) in many cell types and organisms (33). This constraint can be used to convert between cellular protein mass and cell volume. Hepatocytes, the predominant cell type in liver, are roughly cubical cells with a 15-m edge length (34). Assuming a total protein concentration of 200 g/l, this translates to 675 pg of protein per cell. This compares to our estimate of 464 Ϯ 35 pg total protein per average liver cell, which is reasonable given that non-hepatocytes contribute the same amount of DNA or histones but less overall protein mass. Thymocytes are at the other end of the size scale with an average volume of 250 m 3 (35). This translates to 50 pg of protein, as compared with our estimate of 59 Ϯ 31 pg.
To test the applicability of the histone proteomic ruler to the retrospect analysis of existing datasets, we reevaluated wholeproteome measurements of murine dendritic cell populations published by our group in 2010 (36). Samples had been prepared via one-dimensional SDS gel electrophoresis followed by in-gel digestion, an approach distinct from our FASP-based method and incompatible with direct DNA quantification from the proteomic sample. Mature dendritic cells have diameters between 10 and 15 m (37). We compared these cell sizes to our proteomic ruler estimates that ranged between 64 Ϯ 14 and 95 Ϯ 25 pg total protein per cell for the different dendritic cell subtypes (Fig. 3D). These values translated to diameters of 8.5 to 9.7 m for spherical cell shapes, which is expected to be slightly smaller than observed cell sizes, given the numerous dendrites projecting from the cell surfaces. Interestingly, our observed similarities in cell sizes correlate with overall patterns of proteomic similarity on the level of individual proteins that were observed in the original study (36).
Label-free Copy Number Estimations Are Strikingly Close to Precise Spike-in Quantifications-We previously employed spiked-in protein epitope signature tags (PrESTs) of known quantities in combination with isotopic labeling, cell counting, and total protein concentration determination to obtain highly reliable copy number values of selected proteins (11). To assess the accuracy of our proteomic-ruler-derived protein copy numbers, we reanalyzed the same dataset used in the original PrEST-SILAC study and applied our calculations on the "heavy" labeled proteome without considering the ratio information from the "light" PrEST peptides. We recapitulated not only the correct scaling of the total protein mass, but also the copy numbers of the individual PrEST-quantified proteins within an average deviation of 1.5-fold ( Fig. 5A; supplemental Table S4) and comparable precisions judged by the standard deviations from three replicates. We attribute the surprisingly good performance of the proteomic ruler quantifications to the fact that our label-free quantification on average made use of 19.4 peptides along the entire length of the proteins, whereas the PrEST-SILAC quantification used 4.7 peptides on average. This might compensate for some of the principal limitations of the label-free approach. Looking at the deviations of individual quantifications, we saw that the minority of larger deviations occurred exclusively with PrEST-SILAC quantifications based on two or fewer peptides or label-free quantifications based on 11 or fewer peptides (Fig. 5B). This observation underlines the benefits of approaches that rely on multiple independent quantifications instead of single peptide ratios, as commonly used, for example, with AQUA peptides. We conclude that for those proteins quantified with more than a few peptides, the proteomic ruler approach could offer a surprisingly high level of accuracy, making it an attractive alternative to label-based methods.
In addition to the comparison with spike-in quantification data, macromolecular complexes offer another option for validating protein copy numbers. Many obligate protein complexes are well characterized in terms of their composition and stoichiometry with subunits expressed at equimolar levels. Fig. 5C shows that our histone proteomic-ruler-derived copy numbers of members of the pyruvate dehydrogenase complex and the TRiC chaperone closely match the expected 1:1 stoichiometry among subunits.
The Muscle Proteome Is Quantitatively Dominated by Large, Abundant Proteins-As a practical example of the usefulness of "easy" absolute protein quantification, we determined cell sizes and cellular copy numbers of proteins in a panel of other mouse organs (Fig. 6A). Ovaries consist predominantly of small follicular cells and showed the least protein per cell (42 pg). Leg muscle cells, in contrast, had around 675 pg of protein per nucleus. Considering that muscle fibers are syncytial, multi-nucleated cells, the histone proteomic ruler delivered protein amounts per nucleus and not per cell in this particular case. Despite the huge differences in cellular protein amounts, we observed much less variation in the dependence of the abundance of a protein and its molecular mass, irrespective of the tissue of origin. This is reflected in the average molecular mass of a protein, which is calculated as the ratio of the total protein mass per cell to the total number of protein molecules (Fig. 6B). This number is rather similar across tissues, with the notable exception of muscle tissues. The reason for this becomes apparent when we look at the distribution of protein sizes across the dynamic range of the individual proteins (Figs. 6C and 6D). Independent of the tissue of origin, low-abundant proteins had an average molecular mass of around 100 kDa, and this value decreased with increasing cellular abundance of the proteins to around 40 kDa for the most abundant proteins. This dependence was observed in earlier studies and is thought to reflect the evolutionary advantage of decreasing the size of abundant proteins for reasons of biosynthetic cost (38). As a consequence of this trend, the average molecular mass of a protein in a cell is much smaller than the nominal average of the sizes of all proteins when their abundances are not taken into account. Notably, in skeletal muscle cells, filaments and motorproteins such as titin and myosins are notable exceptions to the trend of abundant proteins being smaller, as they are both large (Ͼ150 kDa) and very abundant (Ͼ1 million copies per cell) in this tissue, resulting in a profound increase in the average molecular protein mass in a muscle cell (Fig. 6C, circles).
Plugin for the Perseus Data Analysis Software for Calculation of Absolute Protein Abundances-The calculation of the protein abundances is a simple arithmetic task and can be performed using commonly available table calculation tools. To make the proteomic ruler approach easily usable for a wide community, we have implemented it as a plugin for the Perseus data analysis software. Perseus is part of the freely available MaxQuant suite (20). The proteomic ruler plugin supports all modes of label-free absolute quantification de-scribed in this study and takes user-configurable variables such as the ploidy and the total protein concentration. Optionally, it can incorporate an additional level of protein-specific correction: our copy number calculation assumes a direct proportionality between a protein's cumulative mass in the proteomic sample and the MS signals summed over all peptides derived from it (see Eq. 3). Hence the protein's molar mass serves as a protein-specific normalization factor for copy number estimation. Because the combination of the sequence of a protein, the specificity of the protease used for digestion, and the characteristics of the mass spectrometric analysis can introduce protein-specific biases (39), our plugin allows the user to employ alternative normalization factors, such as the number of theoretically expected peptides that is used by some methods (9,40).
In addition, we have implemented auxiliary functionalities. For instance, molecular weights and numbers of theoretical peptides can be calculated from protein I.D.s in combination with the FASTA database. Moreover, the plugin allows the categorization of proteins according to the expected accuracy of absolute quantification: proteins having a high fraction of theoretical peptides per sequence length and a high num- ber of actually identified peptides, most of which are groupunique, are expected to yield better quantification. DISCUSSION In this paper, we propose that accurate absolute quantification is possible without the use of spike-in standards through the use of a concept we call the "proteomic ruler." Using the MS signal derived from histones and relating it to a known amount of DNA per cell provides accurate estimates of the total protein amount per cell that can be used as scaling factors for calculating cellular copy numbers of any protein of interest. We note that our approach makes a number of assumptions that allow us to omit any spike-in standards. At the same time, it eliminates several experimental steps such as cell counting and absolute protein concentration determination, which are themselves prone to errors, in particular stemming from issues with protein determination assays.
We found the quantitative results of our proteomic ruler approach to be typically within a factor of 2 of precision measurements or literature values. Importantly, this information comes for free, in that it incorporates absolute quantification into any kind of in-depth proteome dataset, even in retrospective analysis. The only prerequisite is a eukaryotic, whole-cell proteome dataset where the chromatin fraction is not over-or underrepresented as a result of sample handling. The latter is a specific requirement for an accurate estimation of the total protein mass per cell, but all whole proteome datasets should aim at an unbiased representation of all protein classes. A reasonable depth of proteomic analysis is needed to ensure a robust contribution of the histone MS signal, but the necessary depth should be readily attainable with many experimental setups. We expect that in the future, more and more proteomics projects will reach the required depth of proteome coverage and will be able to incorporate absolute quantification via the histone proteomic ruler. Additionally, individual protein copy numbers will become more accurate with increased peptide coverage in deep datasets.
Furthermore, we envision a generalization of the proteomic ruler concept beyond using the histone signal to estimate cellular protein amounts. For instance, using characteristic protein classes such as membrane or mitochondrial proteins, it should be possible to infer insights into subcellular architecture solely from proteomics datasets.