Quantifying homologous proteins and proteoforms

Many proteoforms – arising from alternative splicing, post-translational modifications (PTMs), or paralogous genes – have distinct biological functions, such as histone PTM proteoforms. However, their quantification by existing bottom-up mass–spectrometry (MS) methods is undermined by peptide-specific biases. To avoid these biases, we developed and implemented a first-principles model (HIquant) for quantifying proteoform stoichiometries. We characterized when MS data allow inferring proteoform stoichiometries by HIquant, derived an algorithm for optimal inference, and demonstrated experimentally high accuracy in quantifying fractional PTM occupancy without using external standards, even in the challenging case of the histone modification code. A HIquant server is implemented at: https://web.northeastern.edu/slavov/2014_HIquant/

Understanding such systems demands quantifying proteoform abundances. This demand has motivated the development of external standards that can afford high accuracy even for complex proteoforms (Creech et al., 2015). However, their wider use has been limited by expense and applied only to special cases that allow chemical modification of cell lysates, e.g., phosphorylation (Wu et al., 2011) and acetylation (Weinert et al., 2014;Baeza et al., 2014). In the absence of external standards, the quantification of complex proteoform stoichiometries remains very challenging because the ratios between proteoform-specific peptides do not necessarily reflect the ratios between the corresponding proteoforms (Olsen et al., 2010); precursor ion areas corresponding to the same phospho-site in the same sample can differ over 100-fold depending on the choice of protease (Giansanti et al., 2015). This is because a measured peptide level (precursor ion area) depends not only on the abundance of the corresponding protein(s) but also on extraneous factors including protein digestion, peptide ionization efficiency, the presence of other co-eluting peptides, and chromatographic aberrations (Lu et al., 2006;Peng et al., 2012;Giansanti et al., 2015). These extraneous factors break the equivalence between the abundance of a peptide and its precursor ion area and thus make protein quantification much more challenging than DNA quantification by sequencing. This problem is compounded when PTM peptides have been enriched, and thus their intensities scaled by unknown enrichment-dependent factors.

Model
To infer proteoform stoichiometry, we use a simple model that is illustrated in Fig. 1a with proteoforms of histone H3 and in Supplementary Fig. 1 with paralogous ribosomal proteins and phospho-proteoforms of pyruvate dehydrogenase. HIquant explicitly models peptide levels measured across conditions as a superposition of the levels of the proteins from which the peptides originate, Fig. 1a. In this model, shared peptides serve as indispensable internal standards; they couple the equations for different peptides and thus make possible estimating stoichiometries between homologous proteins and proteoforms. The simple example in Fig. 1a generalizes to any number of proteins / proteoforms (M) and any number of conditions greater than 1 (N > 1) as the system in Fig. 1b shows. HIquant solves this system and infers the protein levels (P) independently from the extraneous noise (Z; coming from protein-digestion, peptide-ionization differences, sample loss during enrichment, and even coisolation interference); Z is also inferred as part of the solution and discarded. A related superposition model has been used before with peptides quantified at one condition (Gerster et al., 2014). However for a single condition, the model cannot quantify the proteins independently from the nuisance Z since all problems described by system 1 are under-determined, i.e., have infinite number of solutions (Proof 1; Supplemental Information).
Thus, for a single condition, the model cannot take advantage of the robust corresponding-ion pairs, i.e., ratios between ions with the same chemical composition. In contrast, HIquant infers ratios across proteins and their PTMs solely from the corresponding-ion ratios. This is possible because when N > 1, the system in Fig. 1b often has a unique solution up to a single scaling constant, even when all peptides are shared, e.g., the problem defined by the design matrix in Supplementary Fig. 1c. We characterize the conditions under which HIquant has a unique solution and derive algorithms that use convex-optimization to find the optimal solution given the data; see Malioutov and Slavov (2014) and Supplemental Information.

Validating inference of proteoform stoichiometry
Our model (Fig. 1b) aims to make proteoform quantification insensitive to many systematic biases. For example, incomplete cleavage of a peptide, e.g., only 5% of the peptide is released during enzyme digestion, is fully absorbed into the corresponding nuisance and does not affect inferred protein levels as long as the cleavage is 5% for all conditions/samples. Analogously, if coisolation interference compresses the fold-changes of a peptide, the systematic component of the compression is fully absorbed by the nuisances. Unlike systematic biases, random noise in the data is not absorbed by the nuisances; it can degrade the quality of the inference. Thus, HIquant must carefully evaluate the reliability of interfered proteoform levels. The evaluation uses inference features, such as fraction of explained variance, Eigenvalue spectrum spacing and noise sensitivity; see Supplemental Information.
We sought to experimentally evaluate HIquant's ability to infer the proteoform stoichiometry in samples for which stoichiometry is accurately determined by other methods. The first method included creating and mixing proteoforms. The second method included quantifying histone H3 proteoforms relative to heavy peptide standards with predetermined abundances.
We aimed to create proteoform mixtures with known stoichiometries so that they can be used to assess the accuracy of stoichiometries inferred by HIquant. To this end, the dynamic universal proteomics standard (UPS2) was digested, and the peptides split into two equal parts, A and B. In part A, cysteines were covalently modified with iodoacetamide, and in part B with vinylpyridine, Fig. 2a. We mixed part A and B in predefined ratios (n) and spiked each mixing ratio into an yeast sample. All samples were labeled with TMT, and the relative peptide levels quantified from the reporter ions at the MS2 level.
These alkylated UPS proteoforms have mostly shared peptides (peptides not containing cysteine) and a few unique peptides (peptides containing cysteine). HIquant modeled the relative levels of these peptides as shown in Fig. 1 and solved the model to infer the stoichiometries of the alkylated proteoforms (n), which should correspond to the mixing ratios. A comparison between the actual mixing ratios (n) with the inferred ratios (n) demonstrates a median error below 10 % for ratios inferred by HIquant and a substantially larger error for the ratios between precursor ion areas of unique peptides (Fig. 2b,c); see Supplemental Information.
Next, we sought to evaluate the ability of HIquant to infer stoichiometries of more complex PTM proteoforms, those of histone H3. This system allows rigorous quantification of endogenous proteoform stoichiometries by a previously developed external standards (MasterMix) with known concentrations (Creech et al., 2015). For the test, we used peptides quantified by selective reaction monitoring (SRM) across 7 perturbations. Fractional site occupancies were either estimated based on the external standards or inferred by HIquant only from the relative levels of the indigenous peptides, without using the MasterMix concentrations. The good agreement between these estimates ( Fig. 3) validates the ability of HIquant to infer fractional site occupancy even when the same site may be modified by different PTMs. The estimates from the external standards and from HIquant are very close but also show some systematic deviations. Those deviations may arise due to incomplete protein digestion that is hard to control for with peptide standards, measurement noise corrupting the solution inferred by HIquant or proteforms not explicitly included in the model. The abundances of some proteoforms with quantified peptides is over 1000 fold lower than the abundance of the main proteoforms. They and their corresponding peptides were omitted from the HIquant inference since their quantification requires unrealistically high accuracy of relative quantification; see Supplemental Information and Discussion.

Discussion
The idea of using ratios between chemically identical ions is a cornerstone of quantitative proteomics (Blagoev et al., 2004). It has been used for decades in the context of relative quantification of proteins based on unique peptides (Altelaar et al., 2013) and even applied to the special case of inferring phosphorylation cite occupancy (Olsen et al., 2010). Our work expands and generalizes this idea to all peptides, to stoichiometries of complex proteoforms, and to unlimited number of conditions. Crucially, HIquant allows accurate, efficient, and numerically stable inference resulting in reliability estimates.
HIquant requires and depends upon accurate relative quantification. This limitation is largely and increasingly mitigated by technological developments allowing accurate estimates of correspondingion ratios. However, these technological developments on their own do not allow accurate estimates of PTM site occupancy (Giansanti et al., 2015). HIquant's dependence on the accuracy of relative quantification increases with increasing difference in the abundance of proteoforms. If the levels of two proteins differ by more than 3-6 orders of magnitude, this difference is likely better inferred from the precursor ion areas of the unique peptides. The associated noise (due to variability in protein digestion and ionization) is generally below 100 fold (Peng et al., 2012) and thus smaller than the signal. HIquant's utility is particularly relevant when proteins and proteoforms have comparable abundances (within 10-100 fold difference) but distinct functions (Slavov et al., 2015) and thus accurate quantification is essential for quantifying relatively small differences in abundance. Quantifying proteoforms is an exciting frontier essential for understanding post-transcriptional regulation (Floor and Doudna, 2016;Franks et al., 2017) and defining celltypes from single cell proteomes (Budnik et al., 2017).
The general form of HIquant described in Fig. 1c indicates that HIquant is not limited to proteoforms, even broadly defined. Rather, HIquant can be applied to any set of proteins sharing a peptide. Here we emphasize the application to proteoforms because existing bottom-up methods are better suited for quantifying the stoichiometry between proteins with low homology that generate many unique peptides. For proteins with multiple unique peptides, some of the peptide-specific bias (from variation in protein-digestion and peptide-ionization efficiency) is likely to be averaged out and reduced. However, this bias is a more serious problem for proteoforms with only one or only a few unique peptides (Giansanti et al., 2015). For such proteoforms, HIquant can allow estimating stoichiometries accurately using only ratios between chemically identical ions.

Results
Figure 1 | Model for inferring stoichiometries among proteoforms and paralogous proteins independently from peptide-specific biases. (a) One shared (X 2 ) and three unique (X 1 , X 3 and X 4 ) peptides of H3 proteoforms illustrate a very simple case of HIquant. HIquant models the peptide levels measured across conditions ( x) as a supposition of the protein levels ( p), scaled by unknown peptide-specific biases/nuisances (z). These coupled equations can be written in a matrix form whose solution infers the methylation stoichiometry independently from the nuisances (z). (b) The general form of the model for K proteoforms (or homologous proteins) with M peptides quantified across N conditions can be formulated and solved. In many, albeit not all, cases an optimal and unique solution can be found, even in the absence of unique peptides; see Supplementary   We prepared a gold standard of proteoforms from the dynamic universal proteomics standard (UPS2) whose cysteines were covalently modified either with iodoacetamide or with vinylpyridine. Upon digestion, these modified UPS proteins generate many shared peptides (peptides not containing cysteine) and a few unique peptides (peptides containing cysteine). The modified UPS2 proteins were mixed with one another at known ratios (n), mixed with yeast lysate, digested and quantified by MS. The proteoform ratios that HIquant inferred from the MS data (n) were compared to the mixing ratios. (b) The ratios across the alkylated isoforms of UPS2 inferred by HIquant (n, y-axis) accurately reflect the mixing ratios (n, x-axis). (c) Comparison of the error in proteoform ratios inferred by HIquant and ratios inferred from the precursor ion areas and the reporter ion (RI) ratios.  Figure 3 | HIquant accurately infers stoichiometries and confidence intervals across PTM site occupancies of histone 3. (a) Histone 3 peptides were quantified by SRM across 7 perturbations, and the fractional site occupancies for K4 methylation estimated by two methods: Estimates inferred by HIquant without using external standards are plotted against the corresponding estimates based on MasterMix external standards with known concentrations (Creech et al., 2015). Each marker shape corresponds to the PTM site(s) shown in the legend; methylation is denoted with "me" and acetylation with "ac" followed by the number of methyl/acetyl groups. (b) The validation method from (a) was extended to another set of more complex fractional site occupancies on K9 methylation and K14 acetylation.
Supplemental Figure 1 a R P L 6 Pa ra l o g s  Figure S1 | Model for inferring stoichiometries among proteoforms and paralogous proteins independently from peptide-specific biases. (a) One shared (X 1 ) and two unique (X 2 and X 3 ) peptides from the two paralogs of ribosomal proteins L6 illustrate the simplest case of HIquant. HIquant models the peptide levels measured across two conditions ( x) as a supposition of the protein levels ( p), scaled by unknown peptide-specific nuisances (z). These coupled equations can be written in a matrix form whose solution infers the P 1 /P 2 stoichiometry independently from the nuisances (z). (b) The shared and unique peptides of proteoforms (as illustrated by PDHA1 phospho-proteoforms) can be modeled as in panel (a); (c) The matrix system from (a) generalizes to K proteoforms (and homologous proteins) with M peptides quantified across N conditions. In many, albeit not all, cases an optimal and unique solution can be found, even in the absence of unique peptides. See Supplemental Information for details.