Assessment and Prediction of Human Proteotypic Peptide Stability for Proteomics Quantification

Mass spectrometry coupled to liquid chromatography is one of the most powerful technologies for proteome quantification in biomedical samples. In peptide-centric workflows, protein mixtures are enzymatically digested to peptides prior their analysis. However, proteome-wide quantification studies rarely identify all potential peptides for any given protein, and targeted proteomics experiments focus on a set of peptides for the proteins of interest. Consequently, proteomics relies on the use of a limited subset of all possible peptides as proxies for protein quantitation. In this work, we evaluated the stability of the human proteotypic peptides during 21 days and trained a deep learning model to predict peptide stability directly from tryptic sequences, which together constitute a resource of broad interest to prioritize and select peptides in proteome quantification experiments.


Experimental peptide stability assessment
The human proteotypic peptide set from the ProteomeTools collection 1 was build using information from prior mass-spectrometric evidence available in ProteomicsDB, 1,2 resulting in a peptide collection that contained 124,875 peptides covering 15,855 human Uniprot/SwissProt annotated genes.These synthetic peptides were mixed in pools of approximately 1000 peptides using an estimated amount of 5 pmol of each peptide plus five quality control peptides that were synthesized along with every peptide pool.Moreover, each pool was spiked with two retention time (RT) standards, 5 pmol from an extended PROCAL (JPT iRT) 3 standard mix consisting of 66 peptides with non-naturally occurring peptides, and 2.5 pmol from a set of 15 13 C, 15 N-labelled peptide standards (Pierce, Thermo Fisher Scientific) (Table ST1B).Each peptide pool was dissolved in water + 0.1% formic acid and kept at 4 ºC in polypropylene vials (C4011-13 11mm PP VIAL CRIMP/SNAP 250 µL, Thermo Fisher Scientific) during the entire experiment.Vial caps were replaced for new ones after every injection (CC-C4011-540OR Cap, Snap, 11mm, Orange, PTFE/Silicone, 0.04", Thermo Fisher Scientific).Each pool was analyzed once every 3.5 days, with a total of 6 time points within a period of 21 days.
Samples were analyzed by shotgun data-dependent acquisition LC-MSMS with an EASY-nLC 300 (Proxeon) and a LTQ-Orbitrap XL (Thermo Fisher Scientific) mass spectrometer.Peptides pools were loaded onto the 2-cm Nano Trap column with an inner diameter of 100 μm packed with C18 particles of 5 μm particle size (Thermo Fisher Scientific) and were separated by reversed-phase chromatography using a 12-cm column with an inner diameter of 75 μm, packed with 3 μm C18 particles (Nikkyo Technos Co., Ltd.Japan).Chromatographic gradients started at 97% buffer A and 3% buffer B with a flow rate of 300 nl/min for 4 minutes and gradually increased to 35% buffer B and 65% buffer A for 30 min.The column was washed with 10 % buffer A and 90% buffer B for 10 min after each analysis.Buffer A: 0.1% formic acid in water.Buffer B: 0.1% formic acid in acetonitrile.
The mass spectrometer was operated in positive ionization mode with nanospray voltage set at 2.25 kV and source temperature at 200°C.Ultramark 1621 for the was used for external calibration of the FT mass analyzer prior the analyses, and an internal calibration was performed using the background polysiloxane ion signal at m/z 445.1200.The acquisition was performed in data-dependent acquisition (DDA) mode and full MS scans with 1 micro scans at resolution of 60,000 were used over a mass range of m/z 350-1500 with detection in the Orbitrap mass analyzer.Auto gain control (AGC) was set to 2E5, dynamic exclusion (60 seconds) and charge state filtering disqualifying singly charged peptides was activated.In each cycle of DDA analysis, following each survey scan, the top ten most intense ions with multiple charged ions above a threshold ion count of 5000 were selected for fragmentation.Fragment ion spectra were produced via collision-induced dissociation (CID) at normalized collision energy of 35% and they were acquired in the ion trap mass analyzer.For MS2 spectra AGC was set to 1E4, isolation window of 2.0 m/z, activation time of 30 ms and maximum injection time of 100 ms was used.All data were acquired with Xcalibur software v2.1.
Acquired data were analyzed using MaxQuant v1.6.0, 4,5with match between runs among the six independent injections of each peptide pool.A custom fasta file was created for containing the all the peptide sequences within the human proteotypic peptide set plus the retention time and quality control reference peptides aforementioned.For peptide identification a precursor ion mass tolerance of 20 ppm was used for the first search and 4.5 ppm was used for the main search was used for MS1 level.Trypsin was chosen as enzyme and up to two missed cleavages were allowed.The fragment ion mass tolerance was set to 0.5 Da for MS2 spectra.Oxidation of methionine was used as variable modifications whereas carbamidomethylation on cysteines was set as a fixed modification.False discovery rate (FDR) in peptide identification was set to a maximum of 5%.Peptides were identified at FDR<5%, and peptide areas were extracted for peptide relative quantification.
The mass spectrometry proteomics data have been deposited at the ProteomeXChange Consortium via the PRIDE repository with identifier PXD035198. 6

Peptide stability clustering model with TensorFlow
Peptide intensities were log-transformed within each pool.Profiles with less than three quantitative points were discarded, and the remaining missing values in each profile were imputed as the average of their neighboring temporal quantitative values.The intensity values of each peptide were centered on the intensity of day 0, which was assigned as the starting intensity for each temporal profile.Duplicated peptide sequences were removed by prioritizing profiles with a smaller number of missing values, more MS/MS identification evidence, and higher intensity values.We used the elbow in the k-means algorithm to estimate the number of clusters, where 2 and 3 are shown to be optimal, 3 clusters are chosen in this article. 7The full experimental data defining the profile of each peptide was fed into a Deep Embedded Clustering (DEC) model. 8We implemented the architecture following the same network architecture of the initial paper, the dimension of the input data is the input size.The encoder and decoder are sequential models, all linear layers except the last ones used sigmoid activation.All layers are densely connected.The initialization of the encoder and decoder's weights was done using a uniform distribution.Given that input is the input data dimension, the autoencoder network dimensions were set to (input, 10,10,6), the model used sigmoid as an activation function, Adam optimizer, 9 and a learning rate of 0.0001.The encoder layers are then connected to the clustering layer, 10 it computes a soft assignment between the embedded points and the cluster centroids, and then assigns an estimated class to each of the data samples in such a way that it can be refined iteratively.In the DEC training, Stochastic Gradient Descent (SGD) 11 is used to simultaneously optimize the cluster centers and the encoder parameters (feature representations) with a momentum of 0.4, while minimizing the mean square error loss between the DEC output and the auxiliary target distribution.The deep autoencoder is finetuned manually for 5 epochs.The batch size is set to 32, with a constant learning rate of 0.0001.We used k-means with 20 restarts, to initialize centroids in the same way described in the original study.The training is stopped when less than the tolerance % of points change cluster assignment between two consecutive epochs.The convergence threshold is set to tolerance = 0.0001.Furthermore, the pre-trained autoencoder and K-means model are used to define a new model that takes the preprocessed dataset as input and outputs both the predicted clustering classes and decoded input data records.
The implementation was done using Python TensorFlow, Keras, and Scikit-learn.

Model Architecture
Given a peptide sequence, the amino acids are replaced with predefined unique numbers ranging from 1 to 21.
Since the peptides are of different lengths, if the peptide length is less than the maximum value, the sequence is padded by adding zeros to the vectors.An embedding layer that maps the amino acids to dense vectors of 128 dimensions was then implemented.The peptide stability prediction model contains two Bidirectional Gated Recurrent Unit (Bi-GRU) 12 layers and an attention layer based on the Hierarchical attention networks, 13 all with dropouts and a spacial dropout. 14The recurrent layers were used to extract important position information from peptide sequences, with a TanH 15 activation function, a dimension of 126, and L2 regularization. 16The attention layer is used to calculate dynamic adaptive weights based on probability distribution.Finally, a sigmoid activation function was implemented into a dense layer.We used a binary cross-entropy loss, an Adam optimizer, and a binary accuracy to compile the model.Implementation was done with Python, Keras, and TensorFlow, prediction model is also available at https://github.com/proteomicsunitcrg/peptide-stability.
Table ST1: A) List of human proteotypic peptides from the ProteomeTools collection measured in this study; B) List of quality control peptides included in each peptide pool of the ProteomeTools collection.Table ST2: Identification and quantification of peptide stability profiles within 21 days (3.5 days per time point).

Figure S1 :
Figure S1: A) Heatmap representing the number of peptides identified in each pool; B) Number of time points with valid identifications; C) Peptide areas distribution before and after median normalization; D) Peptide abundance profiles along time (T1-T6) for protein MICAL3; E) Peptide abundance profiles along time (T1-T6) for protein LMF2; F-G) Number of amino acid containing peptides that are stable, unstable or noisy, either as total counts (F) or relative frequencies (G).

Figure S1 :
Figure S1: A) Heatmap representing the number of peptides identified in each pool; B) Number of time points with valid identifications; C) Peptide areas distribution before and after median normalization; D) Peptide abundance profiles along time (T1-T6) for protein MICAL3; E) Peptide abundance profiles along time (T1-T6) for protein LMF2; F-G) Number of amino acid containing peptides that are stable, unstable or noisy, either as total counts (F) or relative frequencies (G).

Scheme 1 :
Scheme 1: DEC with K-means architecture, the model consists of 3 decode layers, 3 encoder layers, a clustering layer, and K-means clustering.