Visualisation tools for dependent peptide searches to support the exploration of in vitro protein modifications

Dependent peptide searching is a method for discovering covalently-modified peptides–and therefore proteins–in mass-spectrometry-based proteomics experiments. Being more permissive than standard search methods, it has the potential to discover novel modifications (e.g., post-translational modifications occurring in vivo, or modifications introduced in vitro). However, few studies have explored dependent peptide search results in an untargeted way. In the present study, we sought to evaluate dependent peptide searching as a means of characterising proteins that have been modified in vitro. We generated a model data set by analysing N-ethylmaleimide-treated bovine serum albumin, and performed dependent peptide searches using the popular MaxQuant software. To facilitate interpretation of the search results (hundreds of dependent peptides), we developed a series of visualisation tools (R scripts). We used the tools to assess the diversity of putative modifications in the albumin, and to pinpoint hypothesised modifications. We went on to explore the tools’ generality via analyses of public data from studies of rat and human proteomes. Of 19 expected sites of modification (one in rat cofilin-1 and 18 across six different human plasma proteins), eight were found and correctly localised. Apparently, some sites went undetected because chemical enrichment had depleted necessary analytes (potential ‘base’ peptides). Our results demonstrate (i) the ability of the tools to provide accurate and informative visualisations, and (ii) the usefulness of dependent peptide searching for characterising in vitro protein modifications. Our model data are available via PRIDE/ProteomeXchange (accession number PXD013040).


Chemicals
Bovine serum albumin (BSA) was purchased from ThermoFisher Scientific (Massachusetts, USA) as a 2 mg mL −1 aqueous solution (product number 23209) also containing 0.9% (w/v) sodium chloride and 0.05% (w/v) sodium azide [caution: sodium azide is toxic and reactive]. The solution was lyophilised and stored at −20 °C prior to use. UHPLC-grade solvents (acetonitrile and water) were also from ThermoFisher Scientific. N-Ethylmaleimide (NEM) was purchased from Alfa Aesar (Massachusetts, USA). Iodoacetamide was purchased from Bio-Rad (California, USA). Acetone (HPLC grade) was purchased from Merck (New Jersey, USA). Modified porcine trypsin (sequencing grade) was purchased from Promega (Wisconsin, USA). DL-1,4-Dithiothreitol (DTT) was purchased from Sigma-Aldrich (Missouri, USA). RapiGest SF surfactant was purchased from Waters (Massachusetts, USA). General laboratory reagents were purchased from reputable suppliers and were of sufficient purity for analytical work.

Preparation of BSA adducts
A solution of NEM (50 nmol) in potassium phosphate buffer (100 mM, pH 7.4) (25 µL) [or phosphate buffer alone] was added to a solution of BSA (50 µg) in the same buffer (25 µL). The BSA solution also contained sodium chloride (154 mM) and sodium azide (8 mM) (see above). The mixture was vortexed and left to stand at ambient temperature for 160 min. A solution of DTT (50 nmol) in the phosphate buffer (5 µL) [or phosphate buffer alone] was added to scavenge unreacted NEM, and the mixture was vortexed and left to stand at ambient temperature for 1 h. The reaction medium was exchanged for fresh phosphate buffer (Amicon Ultra-0.5 centrifugal filter units, 10 kDa cut-off; four tenfold concentration-dilution cycles). The phosphate buffer was then exchanged for a 50 mM aqueous solution of ammonium bicarbonate (three tenfold concentration-dilution cycles). The solution was filtered again to achieve a final protein concentration of 0.8 µg µL −1 .

Sample preparation for mass spectrometry
To 50 µL of filter retentate (40 µg of protein) was added a solution of DTT (280 nmol) in 50 mM aqueous ammonium bicarbonate (5 µL). The mixture was incubated at 56 °C for 1 h. After the mixture had cooled to ambient temperature, a solution of iodoacetamide (630 nmol) in 50 mM aqueous ammonium bicarbonate (5 µL) was added. The mixture was left to stand at ambient temperature, in the dark, for 1 h. Cold acetone (240 µL) was added, and the mixture was left to stand at −20 °C for 16 h. The resulting suspension was centrifuged (8000 × g, 10 min, 4 °C) and the supernatant was removed. The pellet was allowed to air-dry for 10 min, before being dissolved in 39 µL of 50 mM aqueous ammonium bicarbonate containing 0.1% (w/v) RapiGest SF surfactant. A solution of trypsin (0.8 µg) in 50 mM aqueous ammonium bicarbonate (1 µL) was added, and the mixture was incubated at 37 °C for 20 h. Ten microlitres of a 5% (v/v) aqueous solution of trifluoroacetic acid were added to precipitate the surfactant, and the mixture was returned to the incubator for a further 45 min. The resulting suspension was centrifuged (15,000 × g, 7 min, 20 °C). A portion of the supernatant was diluted 40-fold with a 0.1% (v/v) aqueous solution of formic acid, and a portion of this diluted material was transferred to an autosampler vial in preparation for nano liquid chromatography and mass spectrometry.
The mass spectrometer was operated under the positive ion polarity mode. The sample spray voltage and the ion transfer tube temperature were set at 2300 V and 300 °C, respectively. Mass spectra were acquired in the 'top speed' data-dependent mode. Master scans were done using the Orbitrap (resolution = 120,000, scan range = m/z 400-1500, automatic gain control target = 4 × 10 5 ) and data were acquired as the profile type. The most intense precursor ions of charge number two to seven were selected for fragmentation. Precursor ions were isolated in the quadrupole (isolation window = 1.6 m/z units) and fragmented by collision-induced dissociation (normalised collision energy = 35%). Fragment ions were analysed in the linear ion trap (automatic gain control target = 1 × 10 4 ) and data were acquired as the centroid type. Dynamic exclusion was used to exclude ions, including isotopologues, for which spectra had already been acquired ('if occurs within' = 30 s, exclusion duration = 60 s).

Enumeration of tryptic peptides
The sequence of mature BSA was digested in silico using PeptideMass [1]. The parameters were equivalent to those used for the corresponding Andromeda search. Methionine oxidation was added as a variable modification in R (version 3.6.0) [2]. The resulting table of peptides was filtered (length ≥ seven amino acid residues, molecular weight ≤ 4600 Da).

Calculation of maximum peptide mass
For each protein of interest, a table of theoretical peptides was prepared as described for BSA. Peptides of length ≥seven amino acid residues were ordered by mass, and the 95 th percentile was computed. The maximum peptide mass was either the 95 th percentile or Andromeda's default value (4600 Da), whichever was the greater.

Contaminant databases
Where necessary, MaxQuant's database of potential contaminants (contaminants.fasta) [3] was edited in R (version 3.4.0 or later) using the 'seqinR' package [4]. For the BSA study, all potential contaminants except BSA were included (244 proteins). For the rat cofilin-1 study, all potential contaminants were included (245 proteins). For the human plasma-protein study, 82 selected proteins were included (streptavidin, porcine trypsinogen; and 80 human proteins including keratins, dermokines, filaggrin and hornerin).

Comparison of dependent-peptide and variable-modification search results
A new modification, 'Hydrolysed NESyl (CDEHKNQRSTY)', was configured in Andromeda [3,5]. The composition of the modification was 'H(9) O(3) C(6) N', the position was 'anywhere' and the type was 'standard'. The specificities were amino acids with known or putative side-chain reactivity towards NEM (known according to Brewer and Riehm [6], or putative according to    were added to link annotations to data; borders were removed; line widths and styles were changed; tick-mark lengths were changed; an x-axis title was added. All changes were made in Inkscape (version 0.91 or later).
The final image was processed as described for Figures S2 and S4-15.   Figures 1, 3 and S3. These figures combine results for multiple data sets or modes of visualisation. Relevant raw graphics were combined in Inkscape. The following changes were made: panels were lettered (Figures 1   and 3); signs were added to annotations ( Figures 1B, 1D, 1F and 3B); selected annotations were removed for clarity ( Figures 1B, 1D, 1F and 3B); line colours were changed ( Figure S3); histograms were rotated and labelled ( Figure S3); other formatting was applied as described for Figure 2. Final images were processed as described for Figures S2 and S4-15. −500.5 Da ≤ 'DP Mass Difference' < +500.5 Da Shapes the Δm distribution for compatibility with R function hist [2]. The interval includes all values of Δm that round to an integer in the range −500 to +500 (boundary values rounded to the more positive integer). These integer values will become the midpoints of the frequency histogram's cells.
'DP Score' > 80 Limits the number of incorrect identifications. The 'DP Score' is a measure of how well the fragment-ion spectrum of a potential DP matches the corresponding theoretical spectrum [7]. The threshold score of 80 is from the method of Lassak et al. [8] 'DP PEP' < 0.01 Limits the number of incorrect identifications. The 'DP PEP' is assumed to be a posterior error probability akin to the one described by Tyanova et al. [3] 'DP Proteins' contains the identifier of the protein of interest (e.g., '4F5S') Excludes features attributed unambiguously to contaminants.
'DP Modification' ≠ 'Carbamidomethyl' Excludes DPs with the 'Carbamidomethyl' modification, on the basis that carbamidomethylation probably occurred during sample preparation.

'DP Modification' does not contain 'Cation'
Excludes metal cation adducts, on the basis that they were probably formed during sample preparation or analysis.
'DP Modification' ≠ 'Loss of ammonia' Excludes DPs with the modification 'Loss of ammonia', on the basis that ammonia was probably lost during mass spectrometry.
'DP Modification' ≠ 'Loss of water' Excludes DPs with the modification 'Loss of water', on the basis that water was probably lost during mass spectrometry.
'DP Peptide Length Difference' ≥ 0 Excludes truncated peptides. Sequence coverage All A non-redundant list of observed peptide sequences is compiled from all.peptides or a variant thereof. Wherever possible, sequences from analyses of untreated protein are used, because these should provide the best sequence coverage (chemical treatments have the potential to 'delete' segments of the sequence). If two analyses of untreated protein are available, then only sequences that appear in both analyses are mapped. Solid 'peptides' (line segments) are mapped onto a dashed 'backbone'.
Mapping of DPs' Δm values to protein segments I, II, III Each dependent peptide is mapped onto the backbone as a rectangle with a partially-transparent border. The sliding window passes a set of coordinates to a matrix (= vertices), which is used by polygon to draw the rectangle. The x-coordinates are backbone loci. The y-coordinates are 0, 0, Δm and Δm.

Axis limits All
The x-axis of the localisation plot auto-adjusts to accommodate the protein.sequence. The y-axis of the localisation plot has fixed limits (delta.mass.axis.limits), and the x-axis of the frequency histogram has these same fixed limits. The y-axis of the frequency histogram auto-adjusts to accommodate the tallest cell, plus an annotation.    To ensure that each DP is allocated its own discrete strip, elements within or adjacent to already-allocated strips are 'reserved' in a separate matrix (= reservations). The script keeps attempting to allocate a strip until it finds one that is fully 'unreserved'.
An image is prepared from protein.probabilities. A second image, prepared from transparency, is overlaid onto the first. In the resulting composite image, potential sites of modification (grey squares) will be visible within each DP (partially transparent strip). As with the other scripts, solid 'peptides' (line segments) are mapped onto a dashed 'backbone'. All potential sites of modification are annotated (one-letter amino acid symbol and number).
The image will normally need to be displayed across multiple panels. The number of panels (= panel.count) is calculated based on the number of characters in the protein.sequence (100 amino acid residues per panel). The different panels are effectively different views of the same image.
Script flexibility/ limitations I, II, III In theory, there is no limit to the number of DPs that Scripts I, II and III can visualise. In practice, the highest number for which visualisation was attempted was 584. In this case, visualisation of the 584 th DP was confirmed.
IV, V Scripts IV and V arrange DPs in rows, up to a maximum of five rows.
Overlapping or contiguous DPs are placed on different rows, which can lead to rows being skipped. Consequently, the number of available rows in a given region of the plot may be fewer than five. If any DPs appear on the bottom row, we would advise checking the relevant data frame (treated_constantly.conjoined.DPs or dependent.peptides) to make sure that all DPs have been visualised.