Quantitative Protein Topography Analysis and High-Resolution Structure Prediction Using Hydroxyl Radical Labeling and Tandem-Ion Mass Spectrometry (MS)*

Hydroxyl radical footprinting based MS for protein structure assessment has the goal of understanding ligand induced conformational changes and macromolecular interactions, for example, protein tertiary and quaternary structure, but the structural resolution provided by typical peptide-level quantification is limiting. In this work, we present experimental strategies using tandem-MS fragmentation to increase the spatial resolution of the technique to the single residue level to provide a high precision tool for molecular biophysics research. Overall, in this study we demonstrated an eightfold increase in structural resolution compared with peptide level assessments. In addition, to provide a quantitative analysis of residue based solvent accessibility and protein topography as a basis for high-resolution structure prediction; we illustrate strategies of data transformation using the relative reactivity of side chains as a normalization strategy and predict side-chain surface area from the footprinting data. We tested the methods by examination of Ca+2-calmodulin showing highly significant correlations between surface area and side-chain contact predictions for individual side chains and the crystal structure. Tandem ion based hydroxyl radical footprinting-MS provides quantitative high-resolution protein topology information in solution that can fill existing gaps in structure determination for large proteins and macromolecular complexes.

Hydroxyl radical footprinting (HRF) 1 is valuable for assessing the structure of macromolecules. Single nucleotide reso-lution data enabled by the similar reactivity of the OH radical with each and every backbone position has helped solve important problems in the nucleic acids field, such as understanding RNA folding and ribosome assembly (1)(2)(3)(4)(5). Applications of HRF to probe protein structure are a subset of a family of structural MS approaches, including the use of reversible deuterium labeling or irreversible covalent labeling, including labeling with OH radicals (6 -13). Hydrogen-deuterium exchange MS (HDX-MS) is particularly suited to measure secondary and tertiary structure stability through backbone exchange, whereas HRF-MS has been effective at measuring the relative solvent accessibility of specific amino acid side chains mediated by intramolecular tertiary and intermolecular quaternary structure interactions. Hydroxyl radicals can be generated by a variety of methods in each case the chemistry has been shown to be quite similar and the radicals react with side chains of surface residues resulting in well characterized oxidation products (7,10,11). As up to 18 side chains are potential probes, the overall protein coverage and resolution of the method is theoretically high.
Both HDX-MS and HRF-MS utilize a "bottom-up" proteomics approach where proteins are digested to peptide states after labeling, and mass shifts of the resultant peptides are read-out to pinpoint sites of conformational change. Although this usually can provide 90% or more coverage across the entire protein length, in fact the structural resolution is limited as the size of the peptide fragments and the data report the average behavior of the individual residues across the entire peptide, which are typically in the range of five-20 residues (14). MS2 based quantification is in principle a general solution to the problem of increasing structural resolution, and has been attempted for HDX-MS, but the scrambling of the labels in the gas phase has been difficult to overcome using collision induced dissociation (15,16). Alternative approaches for HDX-MS site localization, like electron transfer dissociation to achieve single residue resolution have potential promise but are typically limited to larger peptides that can access higher charge states easily (17,18). MS2 strategies to enhance the resolution for covalent labeling experiments have been attempted with some success, as scrambling is not a limitation in covalent labeling experiments (7, 19 -21). On the other hand, MS1 based strategies to enhance structural resolution for both HDX and covalent labeling approaches using overlapping protease fragments are also a promising route to providing subpeptide resolution in many cases (7, 20 -27).
In this work, we present a coupled set of high-throughput experimental and computational approaches to extend previous MS2 based HRF-MS strategies and provide a quantitative topographical structure assessment for proteins at the individual side chain level. The combined approach permits quantification of modifications through examination of a tandemion based ladder of peptide fragments and combining the ion abundances from both MS1 and MS2 quantification. The high-resolution information is transformed using the knowledge of the relative reactivity of side chains to predict sidechain surface area for the structurally well-characterized Ca 2ϩ bound form of Calmodulin (CaM). In addition, we explored a statistical approach using random forest regression methods to predict solvent accessible surface area at the residue level. Overall, these studies provide a novel approach to provide high-resolution single-residue surface accessibility data with at least eightfold higher spatial resolution than peptide based measures for accurate protein topography predictions.

EXPERIMENTAL PROCEDURES
Sample Preparation and Synchrotron X-ray Radiolysis-Recombinant bovine calmodulin (CaM) was purified as described previously (28). The CaM protein sample was diluted into 10 mM cacodylic acid sodium salt trihydrate, 0.2 mM CaCl 2 , pH 7.5, to the final concentration of 6 M. Radiolysis experiments were performed at the beamline X-28C of the National Synchrotron Light Source with the ring energy of 2.8 giga electron volts and beam current ranging between 190 -210 mA (29 -31). The sample tubes with 5 l of CaM protein solution were exposed to x-rays for 0, 8, 10, 15, and 20 ms and were immediately quenched with methionine amide at the final concentration of 10 mM to prevent secondary oxidations of CaM.
Proteolysis and Mass Spectrometry-Irradiated protein samples were digested with sequencing grade modified trypsin (Promega Co., Madison, WI) at 37°C overnight at an enzyme/protein molar ratio of 1:20. Digestion was terminated by adding formic acid into digested peptides at the final concentration of 0.1%. Digested CaM samples were loaded onto a 180 m i.d. ϫ 2 cm packed with C18 Symmetry, 5 m, 100 Å trapping column (Waters, Taunton, MA) using buffer A (100% water, 0.1% FA) at 5 l/min to preconcentrate sample and wash away salts. Proteolytic peptides were eluted from the reverse phase column (75 m i.d. ϫ 25 cm) packed with C18 BEH130, 1.7 m, 130 Å (Waters)) with a gradient of 10 to 50% mobile phase B (0.1% formic acid and acetonitrile (ACN)) and mobile phase A (100% water/0.1% formic acid) over a period of 80 min at ambient temperature and a flow rate of 300 nl/min. Peptides eluting from the column were introduced into the nanospray source with a capillary voltage of 2.5 kV. All MS1 and MS2 spectra were recorded at a resolution R of 60,000 and 30,000, respectively, in the positive ion mode using an Orbitap Elite instrument from Thermo.
MS Instrument Method-The overall workflow for setting up the instrument method for MS experiments is shown in Fig. 1A. First, a survey MS experiment on a CaM sample exposed to the x-ray beam for 20 ms was carried out to detect a wide distribution of oxidatively modified species. Using these detected m/z values; we generated a "dynamic inclusion" target list containing a total of 62 unique m/z species and their expected retention time values. The method is characterized as "dynamic" because the target m/z values change as a function of chromatography and expected retention time. In the absence of a survey scan, the list could be populated with theoretical m/z values for all expected peptide isoforms based on knowledge of protein sequence, employed protease, and mass of expected radiolytic products. Although the expected retention time values are not required, adding this dimension to the inclusion list makes the triggering events more efficient. In addition, the retention times of specific modifications are predictable with respect to the retention time of the unmodified peptide. All the parameters (target m/z values, mass accuracy, expected retention time, retention time window) can be defined by the user, and hence tailored to the specific project and instrument being used. The MS instrument method was setup such that the final MS runs triggered MS/MS scans only if a precursor ion is observed within 10 ppm of the target m/z value eluting within 5 min of the corresponding retention time.
Data Processing-ProtMapMS 2.5 was used to analyze the data from the survey MS experiment to generate the list of m/z values of oxidized forms by searching the spectrum against the trypsin digested peptides of Calmodulin (UnitProtKB: P62157) with no missed cleavages (32). The flowchart for data processing is shown in Fig.1B. The algorithm involves repeating the following steps for the predominant charge state(s) for each peptide: 1. Peptide Level Selected Ion Chromatogram (SIC)-The SIC plots are first constructed by extracting peptide ion intensities from MS1 level as in the case of traditional analysis (32). Let RT denote the retention time, and S 1i (t)ϭ ͭ SIC at RT ϭ t for unmodified form, i ϭ 1 SIC at RT ϭ t for modified form, i ϭ 2 Thus, S 1i (t) denotes SIC values at RT ϭ t, where the subscript "1i" denotes MS level (i ϭ 1) and the unmodified (i ϭ 1) and modified (i ϭ 2) forms.
2. Candidate Fragment Ion Generation-The unoxidized form of the peptide is detected by comparing the sequence based theoretical m/z value against the experimentally observed precursor ion m/z values within 10 ppm. The tandem spectra from the matching precursor ions are compared against their theoretical counterparts within 0.03 Da as described previously (32). A representative tandem spectrum, R, corresponding to the highest cross-correlation coefficient is selected (32). The set of fragment ions, F, identified from R are utilized for generating SICs in the next step.
3. Tandem SIC Plots-The dense sampling of the interesting isoforms of the eluting peptides in the experimental setup allows for the generation of smooth tandem SIC plots, resulting in reliable characterization and analysis. In case of a modified peptide, the oxidation can be present either on a given fragment ion or on its complement. For each fragment ion j in F, ion abundances from MS2 level experiments are recorded to generate tandem SIC plots for unmodified and modified fragment forms of the modified peptide. This can be viewed as the "electronic" extraction of multiple transitions in a "pseudo-SRM" experiment (33,34). Let Tandem MS SIC of modified fragment j of modified peptide at RT ϭ t, i ϭ 2 4. Combo SICs-The corresponding SICs from step 1 and 3 are point wise multiplied together for each retention time and divided by the sum of intensities of S 2ij (t) in order to obtain "Combo SIC" plots as defined by the following equation: S cij (t) represents the Combo SIC for the unmodified (i ϭ 1) and modified (i ϭ 2) forms of fragment j of the modified peptide at RT ϭ t. An example of Combo SIC plot construction is shown in supplemental Fig. S1. The Combo SIC for a given fragment ion can be viewed as the peptide level MS1 SIC (denoted by the first term in equation (1)) fractionated according to the corresponding tandem ion intensity (denoted by second term in equation (1)). The signals in the uncorrected S 2ij (t) plots are susceptible to fragmentation variation as a function of the side chain oxidation chemistry (35,36). The combination of the MS1 and MS2 level information as in equation (1) helps to minimize such bias.
5. Dose Response (DR) Curves-In order to draw a DR plot for j th fragment ion, the peak areas under the Combo SIC curves are calculated for each modified and unmodified form. The fraction of the unmodified fragment ion, j, is computed as follows: where n ϭ number of observed modified forms. In the present study, n ϭ 1 because we are focusing on the most abundant (ϩ16) modification. The first terms in both the numerator and the denominator represent the unmodified form of the peptide, the second terms in each case represent the unmodified form of the j th fragment ion originating from the modified peptide, and the last term in the denominator represents the modified fragment ion.
A dose response curve is generated by plotting DR j values above for each interval of hydroxyl radical exposure. The resulting curve typically obeys a pseudo first order reaction as described previously, and the corresponding rate constants (RC) can be calculated for the tandem ion DR plots (32). A prototype version of the data processing steps was developed using Matlab version R2013a. The prototype software and mass labeled spectra of the modified peptides are available at the website http://csb.case.edu. Residue (or segment) specific rate constants are generated by subtraction of successive tandem ion RC values.
Structural Prediction Using Random Forest Regression-Random forest regression was performed with 25 decision trees across 40 residues using Python 2.7 (37). The residue level RC and their reactivity values (Table I and supplemental Table S1) along with the corresponding fractional solvent accessible surface area (fSASA, the ratio between its observed solvent-accessible area and its standard accessible area) from the crystal structure (PDB ID: 1PRW) as calculated by VADAR were used as inputs, and the corresponding fractional solvent accessible surface corresponding to test RC values were output. The rate constants from the b-and y-ion series were averaged to calculate rate constant for each residue. Leave-one-out validation was performed to perform prediction on a single residue, while training on the remaining 39 residues.

RESULTS AND DISCUSSION
Complexity of Oxidative Modifications-CaM bound to Ca ϩ2 was exposed to x-rays, digested with trypsin, and analyzed by UPLC-MS/MS as described in Methods and Fig. 1A and 1B. Ten peptides were identified for an overall sequence coverage of 93%. Peptide 38 -74 was too large to provide adequate fragmentation required for resolving individual residues. Peptide 22-30 exhibited weak signals and no oxidative modifications were observed; thus, the remaining eight peptides comprised of 102 residues were analyzed in detail. The sample was originally exposed to hydroxyl radicals for 0 to 20 ms. Peptide level analysis using ProtMapMS revealed that the trend of oxidation was deviating from the expected first order reaction at the 20 ms exposure. This was likely because of over exposure leading to conformational changes relative to the native form. Hence, the 20 ms data were removed from the analysis so that the data included only points out to 15 ms. Fig. 2A shows SICs extracted from analysis of the doubly charged peptide 1-13, the solid and dotted lines show the unoxidized and ϩ16 oxidized forms, respectively. Examination of peak composition using tandem MS revealed that four out of five major peaks were comprised of multiple ϩ16 species. The peak observed at ϳ27 min is particularly complex, arising from an isomeric mixture of six distinct species labeled including modifications of T5, E6, E7, E11, F12, K13. Note that the extent of modification for each of the residues varies reflecting intrinsic reactivity and accessibility factors particular to each. However, the data are of extraordinary richness; in that modification of almost all side chains across the peptide are reliably detected. However, an additional dimension for separating (and hence, individually quantifying) these species is required in order to extract high-resolution structural information. This motivated us to perform targeted analysis of ions of potential interest. This was judged to be particularly beneficial for examining minimally abundant oxidative products that may elude traditional data-dependent analysis. The target inclusion list was derived from examination of a survey MS1 scan (Fig. 1A) and contained a total of 62 discrete m/z and expected retention time value pairs, including 16 m/z values for unmodified peptide forms with different charge states. In the follow up MS experimental run using the inclusion list, the number of tandem MS scans of the targeted species of interest rose fourfold compared with data-dependent scanning. Although multiple modified forms were targeted initially, we focused on the ϩ16 modifications here (10).
Combo SIC and Complementary Product Ion Pairs- Fig.  2B-2C show Combo (MS2/MS1) SIC plots of two complementary product ion pairs, namely y4/b9 and y5/b8 from peptide 1-13 from the 15 ms exposed sample of CaM. The SIC plots are generated by fractionating the MS1 level SIC using the precursor/fragment ion pair intensities generated from examination of the parent and tandem ions (see Methods). The blue plot shows the parent-unmodified form, whereas the red and green plots are Combo SIC plots from the ϩ16 parent ion. For these tandem ion fragments, the oxidation can be present either on the fragment or on its complement. The y4 ϩ 16-SIC (top 2b, red curve) shows multiple oxidized y4 ϩ 16 species that have potential oxidations on K, F, E, or A (e.g. residues 10 -13). The complemen- tary y4-SIC (top 2b, green) is quite distinct, and is derived from parent peptides that have oxidations elsewhere (e.g. residues 1-9). The complementary b9 ϩ 16 and b9 SICs are exactly a mirror image of the y4 and y4 ϩ 16 SICs, respectively, with the red curves resembling the green and vice versa (top, Fig. 2B-2C). A similar trend is seen for the y5-b8 product ion pair. The symmetrical patterns in the complementary product ions indicate that this approach allows us to observe the same variable (extent of oxidation) from two different perspectives, and shows that we have sufficient ion intensity to support residue specific quantification for both y-and b-ion series. The sum of the complementary chromatograms is equal to the MS1 chromatogram seen in Fig. 2A.
Resolving Sites of Oxidation through Combo SIC- Fig. 2B reveals the appearance and disappearance of specific peaks while moving through the SIC plots of successive ions. For example, the first three green peaks (at 21.8, 22.8, and 23.2 min) for y4 ion (top Fig. 2B) disappear and are partially transformed to red for the y5 ϩ 16 ion (bottom, Fig. 2B), consistent with these modified peptides having I9-based oxidations. This is also confirmed by the trend of the same three peaks changing from red to green in the comparison of the b9 ϩ 16 to b8 ion indicating loss of I9 oxidation (Fig. 2C). We speculate that the varying chromatographic elution times in the three peaks represent different isoforms of the I9 ϩ 16 peptide with modifications at various positions of the side chain. We have previously observed such chromatographic isoforms in the oxidation of Phe (32). As in the case of SRM experiments, the data illustrates that the dynamic inclusion workflow generates tandem ion signatures of adequate specificity and signal to noise to quantitate individual residue contributions that are masked using only SIC extraction at MS1 level (38,39). A complete set of tandem MS SICs for both the b-and y-ion series for peptide 1-13 are shown in supplemental Fig. S2, indicating the reproducibility of SIC across nine unique fragment ion pairs.
Estimating Extent of Oxidation for Individual Fragment Ions Using Combo SIC Plots-The tandem MS signals can be affected by the variation in fragmentation as a function of the side chain oxidation chemistry (35,36). A combination of signals from both MS1 and MS2 levels in the form of Combo plots helps to minimize such bias. supplemental Fig. S1 shows the construction of a Combo SIC plot and its effect to offset the fragmentation bias. The individual Combo SIC plots for each tandem ion enable the calculation of the corresponding level of oxidation by calculating the areas under the curves. Based on this approach we show (Fig. 2D) the percentage of total oxidation as a function of the y-(solid line) or b-ions (dotted line) for peptide 1-13 of the 15ms hydroxyl radical exposed sample of CaM. The complementary product ion pair s are plotted at the same point on the x-axis. Note the two curves are mirror images, one indicating the gain of oxidation (y-ions from left to right) the other the corresponding loss (b-ions from right to left). This view illustrates the division of the overall oxidation of the intact peptide into the contributions from individual residues as one moves across the sequence. For example, based upon the integrated peak areas of y2 and y2 ϩ 16 isoforms from the Combo SICs, 1.5% of the overall ions were observed to be oxidized for y2. By comparison, y3 exhibits ϳ2.5% oxidation, indicating that the addition of E in the y2 to y3 comparison adds 1% to the total. Examination of the b-ion series shows agreement with this view where the percentage oxidation drops by 1% in the comparison of b11 to b10. Thus, the tandem ion ladder reveals single residue contributions to the oxidation process.
Tandem Ion Dose Response Plots Provide Residue Specific Rate Constants-A dose response (DR) curve illustrates the decreased fraction of the unmodified peptide as a function of its exposure time to x-ray radiation and thus OH-radical dose. The DR curve serves two functions, first it provides improved statistics for measuring the oxidation process through multiple measures of the oxidation extent and second, an adherence to pseudo first order behavior of the DR plots gives confidence that the correct overall OH radical dose has been selected such that the biological integrity of the sample is maintained (29,40,41). The effect of overexposure was seen for the 20 ms time points and thus these were excluded from the analysis. Using the Combo-SIC method to determine tandem ion intensity, we calculated the ion intensities corresponding to 0, 5, 10, and 15 ms exposures in order to develop the specific tandem ion DR plots. The unmodified fraction for each tandem ion is calculated and the data is fit to an exponential function to provide the rate constant (RC) of oxidation (29,40,41). Fig. 3 shows examples of tandem ion DR plots for two CaM peptides including DR plots for both y-and b-ion series. The curve marked T shows the DR plot as calculated by the traditional MS1 analysis. The rate of oxidation increases with increasing tandem ion fragment length and the DR plots approach the DR of the intact peptide for the longest observed tandem ions. The DR plots for peptides 31-37, 78 -86, 95-106, 107-115, and 127-148 are shown in supplemental Fig. S3.
Subtraction of successive tandem ion DR plot RC values is used to provide residue or segment specific RC values. Table  I shows 63 independent structure measures from 95 independent differential RC values (from both b-and y-ion DR plots) calculated for eight CaM peptides (102 total residues). This represents an eightfold increase in structural resolution versus peptide level rate assessments for the eight peptides quantitated and residue level coverage for the eight peptides of 62% (with 43% coverage relative to the 148 residues across the entire protein). The use of additional enzymes clearly could provide increased information for peptide 38 -74 increasing the overall coverage dramatically.
The residue specific RC data is a function of its constituent residues' reactivity, solvent accessibility, ionization efficiency, and relative fragmentation efficiencies of the oxidized and unoxidized forms. The highest RC values are seen for M124, (4.2, 4.2, b-ion and y-ion data respectively, all s Ϫ1 ), Y99 (4.0 and 4.7), M144 and M145 (3.7, 4.0 both for y-ions), Y138 (3.5 for b-ion), and M36 (2.6, 2.6) reflecting the idea that sulfur containing and aromatic residues are the most reactive. As the range of OH radical reactivity for the 20 amino acids spans three orders of magnitude with free Cys over 1000 times more likely to suffer hydroxyl radical attack as compared with Gly (10, 42), reactivity will dominate these RC data. The lowest rate values correspond to the residues with low reactivity/and or lowest solvent accessibility.
An examination of peptide 1-13 in detail reveals the important trends in the data. The first three residues A1 (SASA ϭ 51 Å 2 ), D2 (114 Å 2 ), and Q3 (70 Å 2 ) ( Table I, Fig. 3A-3B) represented by the b3 tandem ion DR are of low reactivity consistent with a cumulative RC value of 0.3 s Ϫ1 . Symmetrical increases are seen comparing b3 to b4 and y9 to y10 (differential RCs ϭ 1.0 s Ϫ1 and 0.9 s Ϫ1 , respectively) indicating significant oxidation on L4 (2.5 Å 2 ). However, L4 exhibits low solvent accessibility and has modest reactivity. Such variation could indicate either (1) differences between the solution state and the crystal structure, or (2) oxidation induced variation in side chain chemistry as discussed in the following section. Continuing down the ladder of ions shows a significant shift from b11 to b12 (change in RC ϭ 1.0), which includes the reactive F12 (5 Å 2 ) residue. Note that the value of fraction of unoxidized peptide (ϳ0.94) for both b12 and y11 at the 15 ms exposure time is consistent with the overall ϳ6% oxidation observed in Fig. 2D. Fig. 3C and 3D show the DR plots for b-and y-ion series, respectively, for CaM peptide 14 -21. Although small in absolute terms, the most significant relative shifts in the DR curves are symmetrically experienced by the DR plots that include F16 (b3-b2/y6-y5 ions), L18 (b5-b4/y4-y3 ions), and F19 (b6-b5/y3-y2 ions) residues. This is consistent with the moder-ately reactive nature and modest SASA of these residues as shown in Table I. Low oxidation of the relatively low reactive first two residues E14 (32 Å 2 ) and A15 (18 Å 2 ) is also seen.
Sensitivity of Data to Side Chain Chemistry and Reproducibility of Results-There can be variation in the fragmentation chemistry of oxidized versus unoxidized peptide isomers during the collision-induced dissociation, which can potentially confound the results in absolute terms (35,36,43,44). Such variations can also lead to negative bias, some example cases are shown by negative values of the RC (Table I). Such bias is expected to be systematic and specific to individual residue side chain chemistry and is not reflective of irreproducibility of the data, which is seen to have median standard deviations of 3% of the RC values in triplicate experiments (supplemental Fig. S4). Fragmentation bias is significantly offset using the Combo SIC plots where MS1 based corrections makes the quantitation less sensitive to variations in the fragmentation pattern. When the method is applied in the comparison of two forms of the same protein (e.g. Ϯ ligand), the bias will be identical across the comparison, and the data can be correctly interpreted as a relative change in solvent accessibility as the other variables are held constant. Also, the random forest regression method employed below for structure prediction uses a training approach to include such biases in its predictions.
Applications to Structure Prediction-Improved techniques for detecting and quantifying labeling at the residue level have the potential to drive high-resolution structural modeling (26,29,45). A clear drawback for using the data to provide such modeling is the widely varying reactivity of individual residues. To explore the usefulness of these residue-level data in structure prediction, we first incorporate the concept of protection factors to normalize the data. After normalization, we examined both a biophysical approach based on first principles and a  regression learning approach based on statistical evaluation of the data (random forest regression) to predict structure from the data and then compared and contrasted the methods.
Recently, we introduced the concept of the protection factor (PF), where reactivity measures at the peptide level from MS1 based data were first normalized based on known reactivity data from the literature, such that the rate constants across different peptides could be compared on an absolute scale (46). The corrected rates were used to predict the surface or interior locations of peptides using gelsolin as an example. Taking this approach further, we can convert our single residue rates into single residue PFs. The PFs are calculated using the normalization factors from supplemental Table S1 to convert apparent rates to normalized rates.
Specifically, we derive a PF value for each residue by the following equation: where R i is the relative chemical reactivity for residue-type i to solution generated hydroxyl radicals, using proline as the internal reference, and k fp is the measured rate constant for the residue (46). Table I shows the calculated PFs using both y-and b-ion rate data from CaM. We exclude zero or negative rate constants in the PF calculation. In order to assess the correlations of this single residue data with CaM structure, we calculated both the SASA of each CaM residue as well as the number of residue specific structural contacts. In addition, because the PF is essentially still a (corrected) rate, we take the natural log of the PF, which correlates with the relative free energy of the conformational and chemical barrier to oxidation and in supplemental Fig. S5A-S5B show a comparison of structural data versus log PF. These PF values (derived from 36 residues relating to 26 b-ion and 32 y-ion RC values) have a Pearson's correlation coefficient with SASA of Ϫ0.58 (p value ϭ 9.0 ϫ 10 Ϫ7 ) whereas the correlation coefficient with structure factor is 0.63 (p value ϭ 3.8 ϫ 10 Ϫ8 ). These results are visualized in supplemental Fig. S5C-S5D where the PF is color coded and represented on the CaM structure, with the red colored residues (higher PF) clearly oriented in the interior of the protein and the blue colored residues (lower PF) preferentially outside.
This PF analysis clearly shows the potential power of the residue specific structure data to predict structure de novo, however, the exclusion of zero and negative values of the rate constants is a limitation. Thus, we explored an alternative statistics based approach using multivariate random forest regression (37) and used all single residue data (40 total residues with data from 30 b-ions and 35 y-ions) except that from D, N, Q, T, and S, which have minimal or no observable ϩ16 modified species (10). The regressor was trained using RC and reactivity values as input and fSASA values from the crystal structure as the ouput (see "Methods"). Leave-one-out validation methodology was used to perform fSASA predictions for each residue. Fig. 4A shows the plot of the fSASA value determined by the crystal structure on X-axis, whereas the predicted values are shown on the Y-axis. The data exhibits a Pearson's correlation coefficient of 0.77 (p value ϭ 6.8 ϫ 10 Ϫ9 ), suggesting that the method provides accurate internal predictions. Hydrophobic residues (M, F, W, I, L, and V), which are both the most frequent targets and preferentially in the protein interior are clustered near (0, 0), whereas the charged residues (R, K, and E) are further from the origin. Fig.  4B visualizes the data on the crystal structure. The side chains of residues with predicted fSASAϽ0.13 are shown in red, 0.13ՅfSASAՅ0.21 are shown in purple, and fSASAϾ0.21 are shown in blue. The residues pointing to the interior are colored in red, whereas the more solvent accessible regions for residues along the same helix are colored in purple/blue at multiple locations. Note in particular helices H1 and H2, where the color alternates between red and purple/blue around the helical turns for these exterior helices, clearly showing the ability of the method to provide residue-based resolution. CONCLUSIONS Overall, our approach is very promising for high-resolution protein structure prediction. The demonstrated computational and experimental workflows efficiently quantitate tandem ion based oxidation products from HRF-MS. Dynamic inclusion allowed for enhanced ability to detect and accurately quantify low abundant modifications. The eightfold gain in spatial resolution from peptide to residue level going from MS1 to MS2 quantification and the accuracy and precision of the method make it well suited to providing residue level side-chain sur- face accessibility information for structure modeling. The PF analysis is valuable for de novo structure prediction and has an interesting potential correlation with the free energy barrier to hydroxyl radical attack for a specific site. The random forest regression, which is purely a statistics based approach, is likely more flexible toward some of the bias in the data discussed above. It can clearly be applied to comparisons of structurally related protein forms when some crystallographic or NMR information is available, but may be even more general if data from one protein can be used to predict others. In both cases, the correlations of modeling predictions with structural data are very promising. Further developments would include examination of all modified species and experimenting with alternative fragmentation strategies to further enhance resolution and accuracy. Overall, the approach here is well suited to addressing gaps in protein structure assessment for flexible conformations of proteins and large macromolecular complexes, which are some of the most challenging and interesting problems in structural biology today. FIG. 4. A, Relation between fractional SASA based on crystal structure and prediction using Random Forest regression. Rate constants from b-and y-ions showed a Pearson's correlation coefficient of 0.995, and were averaged to make predictions for each residue. The diagonal line is shown in blue, indicating the ideal behavior. The Pearson's correlation coefficient between the two values across 40 residues is 0.79 (p value ϭ 2.3 ϫ 10 Ϫ9 ). B, Two views of the crystal structure of calmodulin mapped with predicted solvent accessibility. High protection is shown in blue regions, purple regions show medium protection, whereas red regions show high solvent accessibility. Ca 2ϩ ions are shown by white spheres.