TopFD: A Proteoform Feature Detection Tool for Top–Down Proteomics

Top-down liquid chromatography-mass spectrometry (LC-MS) analyzes intact proteoforms and generates mass spectra containing peaks of proteoforms with various isotopic compositions, charge states, and retention times. An essential step in top-down MS data analysis is proteoform feature detection, which aims to group these peaks into peak sets (features), each containing all peaks of a proteoform. Accurate protein feature detection enhances the accuracy in MS-based proteoform identification and quantification. Here, we present TopFD, a software tool for top-down MS feature detection that integrates algorithms for proteoform feature detection, feature boundary refinement, and machine learning models for proteoform feature evaluation. We performed extensive benchmarking of TopFD, ProMex, FlashDeconv, and Xtract using seven top-down MS data sets and demonstrated that TopFD outperforms other tools in feature accuracy, reproducibility, and feature abundance reproducibility.


S1. Data Sets
Seven top-down MS data sets were used in the experiments. The first two data sets were generated from SW480 and SW620 colorectal cancer (CRC) cells 1 and are available at MassIVE (ID: MSV000090488). Proteoforms extracted from the sample were analyzed using a Thermo Q-Exactive HF mass spectrometer coupled with a 105-minute nanoRPLC separation system. The top 5 precursor ions in each MS1 spectrum were selected from an isolation window of 4 m/z for MS/MS analysis using higher-energy collisional dissociation (HCD). Both MS1 and MS/MS spectra were acquired at a resolution of 120,000 (at 200 m/z). Three technical replicates were obtained for each cell line.
The third data set was generated from ovarian cancer (OC) samples 2 and downloaded from MassIVE (ID: MSV000080257). In the experiment, five OC patient samples were pooled, and the extracted proteoforms were analyzed using a Thermo Velos Orbitrap Elite mass spectrometer coupled with a 180-minute LC separation system. The top 4 precursor ions in each MS1 spectrum were selected separately with an isolation window of 10 m/z for MS analysis using collisioninduced dissociation (CID). MS1 and MS/MS spectra were acquired at a resolution of 240,000 and 120,000 (at 400 m/z), respectively. A total of 10 MS experiment replicates were obtained.
The fourth and fifth data sets were generated from two patient-derived mouse xenografts derived from basal-like and luminal-B human breast cancer samples 3 , which were downloaded from the CPTAC data portal (https://cptac-data-portal.georgetown.edu/study-summary/S028).
The GELFrEE method 4 was performed for each sample to obtain a fraction containing proteoforms of size up to 30 kDa. Subsequently, six technical replicates were generated for each sample. The samples were analyzed using a Thermo Orbitrap Elite mass spectrometer coupled with an LC system with 90-minute separation. The top two precursor ions in each MS1 spectrum were selected from an isolation window of 15 m/z for MS/MS analysis using HCD. MS1 and MS/MS spectra were acquired at a resolution of 120,000 and 60,000 (at 400 m/z), respectively.
The sixth data set was generated from two human semen samples 5 (PRIDE repository ID: PXD024405). Protamine proteoforms were extracted from the samples and analyzed using a 60minute HPLC separation system coupled with a Thermo Orbitrap Fusion Lumos mass spectrometer. The most intense precursor ions in each MS1 spectrum were selected from an isolation window of 0.7 m/z for MS/MS analysis using electron transfer dissociation (ETD) with a cycle time of 3 seconds. Both MS1 and MS/MS spectra were acquired at a resolution of 120,000 (at 200 m/z). Two technical replicates were acquired for each of the two samples.

S4
The seventh data set was generated using a mixture of five proteins containing bovine carbonic anhydrase (Sigma C2624), equine myoglobin (Sigma M5696), bovine trypsinogen (Sigma T1143), bovine ubiquitin (Sigma U6253), and bovine superoxide dismutase, in which bovine superoxide dismutase was present as a contaminant in bovine carbonic anhydrase 6 . The samples were analyzed using a 50-minute LC separation system coupled with a Thermo Velos Orbitrap Elite mass spectrometer. The top precursor ion in each MS1 spectrum was selected from an isolation window of 15 m/z for MS/MS analysis using HCD. The MS1 and MS/MS spectra were collected at a resolution of 120,000 and 60,000 (at 400 m/z), respectively.

S2.1 Data preprocessing
MsConvert 7 was used to convert raw files into centroided mzML files. Two methods were used to filter out noise peaks in MS1 spectra to speed up proteoform feature detection. The first filtering method was based on peak intensities, in which peaks with intensity lower than a cutoff intensity were removed because most of them do not provide valuable information for feature detection.
To obtain the cutoff intensity, a histogram of the intensities of all MS1 peaks in the data file was generated, and the noise intensity level, denoted by h, was set to the middle value of the bin with the highest frequency 8 , and the cutoff intensity was set to 3h. The second filtering method was based on the number of consecutive spectra in which a peak is observed. Peaks that appear in only one MS1 spectrum, not several consecutive MS1 spectra, tend to be noise ones. Therefore, a peak in an MS1 spectrum was removed if it was not observed in its neighboring MS1 scans within an m/z error tolerance of 0.01.

S2.2 Seed envelope identification
We obtained isotopic envelopes of proteoforms from single spectra and then used them as seeds to find envelope sets and envelope collections. Experimental isotopic envelopes in single MS1 spectra were identified based on the methods in MS-Deconv 8, 9 with eight steps. (1) A peak in the spectrum is selected as the base peak of the envelope. (2) A theoretical isotopic distribution is computed using the Averagine model 10 with a given charge state so that the m/z value of the highest intensity peak in the envelope equals the m/z value of the base peak. theoretical intensities are lower than a cutoff intensity, which is set to the intensity of 3h. (7) After all candidate envelopes are generated from the spectrum, a dynamic programming method is used to report a group of theoretical and experimental envelope pairs that fit the spectrum. (8) The envelope pairs are further filtered using a cutoff value (0.5 in the experiments) for the Pearson correlation coefficient (PCC) between the peak intensities of theoretical and experimental envelopes.

S3. Feature detection in TopFD
We ranked the experimental and theoretical envelope pairs reported from all MS1 spectra in an LC-MS run in the decreasing order of the total peak intensity, which is the sum of the peak intensities of the theoretical envelope. The theoretical envelope with the highest intensity was selected as the first seed envelope, which was then extended to neighboring scans to obtain an envelope set. Theoretical envelopes were used for the extension because they tend to have fewer errors in m/z values and peak intensities than experimental envelopes.

S3.1 Extending a seed envelope to an envelope set
To obtain an envelope set, a seed envelope E of a proteoform was matched to experimental peaks in its neighboring spectra to extend the RT range of the proteoform. Let S 1 , … , S i-1 , S i , S i+1 , …, S n be all MS1 spectra in the increasing order of RT, in which the seed spectrum S i contained the seed envelope E. We first checked if the spectrum S i-1 contained a matched experimental envelope of E. The isotopic peaks in E were matched to the experimental peaks in the spectrum to obtain an experimental envelope with an m/z error tolerance of 0.008. If two or more experimental peaks were matched to one theoretical peak, the one with the highest intensity was selected. Peaks in E were scaled to fit the peak intensities of the experimental peaks using the method in Section S2.2. The scaled peaks in E with an intensity lower than the cutoff intensity of 3h (see Section S2.1) were removed from the envelope along with the corresponding matched experimental peaks.
An experimental envelope was matched to the theoretical one if at least two of the three highest theoretical peaks matched experimental peaks. We searched for matched experimental envelopes in the neighboring spectra S i-1 , … , S 1 until we found two continuous spectra without a matched experimental envelope. The extension was also performed for the other direction in the neighboring spectra S i+1 , … , S n .
The RTs of the first and last spectra reported by the extension method are called the initial start and end RTs of the proteoform, respectively. If a spectrum in the initial RT range contains a matched envelope, the corresponding trace intensity value is the sum of the intensities of the top S6 three highest scaled theoretical peaks and 0 otherwise. The trace intensities of all MS1 spectra in the initial RT range are called the extracted envelope chromatogram (XEC) of the seed envelope.

S3.2 Adjusting RT boundaries
XECs of envelope sets were smoothed using a moving average filter with a window size of 2.
Let t c be the RT of a seed scan and t s be the start RT of an envelope set. To adjust the start RT, we found all local minima in the XEC between t s and t c and ranked them in increasing order of intensity. Let t min be the RT with the lowest XEC value i min . If there was a local maximum with RT t max and trace intensity i max such that i max > 2.5i min and t min was between t max and t c (t s < t max < t min < t c ), then the start RT t s was set to t min (Supplementary Fig. S14). The process was repeated until all the local minima had been checked. The process was performed to adjust the start and end RTs of reported envelope sets. This allowed us to fix errors in RT boundaries when the extended envelope set contained peaks from two or more neighboring envelope sets. All matched experimental envelopes in the adjusted RT range were reported as an envelope set of the proteoform.

S3.3. Correcting charge states
Because of noise peaks, some seed envelopes reported from single spectra had an incorrect charge state. To correct charge states, we summed up peak signals from several scans in an envelope set to obtain a better signal-to-noise ratio of peaks. For each peak in a seed envelope, the corresponding aggregate envelope peak was obtained by summing up the intensities of matched experimental peaks across all spectra within the RT range of the envelope set.
We used aggregated envelopes to fix one common type of error in charge states, in which a charge state c is mistakenly reported as charge state 2c. This type of error is called a double charge error. The main reason for double charge errors is that some noise peaks are randomly matched to theoretical peaks with charge state 2c.
The peaks in the aggregated envelope were ranked in the increasing order of their m/z values.
The sums of even and odd index peaks were obtained for both theoretical and aggregate experimental envelopes. In an envelope with a double charge error, the peaks with odd or even indices are usually caused by noise peaks, which are characterized by their low intensities. Let of the seed envelope was halved, and the new seed envelope was used to obtain an envelope set.

S3.4 Extending an envelope set to an envelope collection
After an envelope set with charge state c was reported, an envelope collection was obtained by exploring the neighboring charge states to find isotopic envelopes with the same monoisotopic mass. To find an envelope set with charge state c-1, the theoretical envelope E c-1 for charge state c-1 was obtained using the seed theoretical envelope of the envelope set with charge state c.
Next, we extended E c-1 to obtain the start and end RTs using the methods in the previous section.
If we failed to find at least two matched experimental peaks for the top three highest theoretical peaks in E c-1 in spectra S i-1 , S i , and S i+1 , then the envelope set for charge state c-1 was set to empty. We searched for envelope sets with charge states c-1, c-2, …,1 until two continuous empty envelope sets were found. Similarly, envelope sets were searched for charge states c+1, c+2 … until two continuous empty envelope sets were found. All identified non-empty envelope sets were added to the envelope collection.

S3.5 Removing envelope collections from experimental data
To identify overlapping peaks shared by multiple envelope collections, we scaled peaks in a seed envelope to fit the peak intensities of its matched experimental envelope (see Section S2.2).
If the intensity of an experimental peak was at least 4 times higher than that of the corresponding scaled theoretical peak, the peak was considered an overlapping one; otherwise, non-overlapping.
To remove an envelope collection, the intensity of an overlapping experimental peak was reduced by the intensity of its matched theoretical peak, and non-overlapping experimental peaks were removed directly.

S4.1 Refining monoisotopic masses of envelope collections
For an experimental peak p in an envelope collection, the m/z error between p and its matched theoretical peak is represented by e(p) and the intensity of its matched theoretical peak is represented by h(p). The weighted average m/z error of all peaks p in the envelope collection is , and the weighted average mass error of the envelope collection is the product of the ∑ ( )ℎ( ) ∑ ℎ( ) average m/z error and the charge state of the seed envelope. The refined monoisotopic mass of an envelope collection was obtained by subtracting the weighted average error mass from its original monoisotopic mass.

S8
Once envelope collection F was reported, we checked if it could be merged with another envelope collection. Two envelope collections were merged if (1)  was an error tolerance of 10 ppm and 1.00235 Da is an estimate of the mass difference of neighboring isotopic peaks in an envelope 11 , (2) their RT ranges overlap was more than 80% of F, and (3) their change states ranges were not separated by more than 2 charge states.

S4.3. The neural network model for ECScore
The neural network model for ECScore takes eight attributes of an envelope collection as the input (Supplementary Table S11). The neural network model consists of four hidden layers (200 neurons in each layer) and an output layer. The activation function is the Leaky Rectified Linear Unit with a negative slope coefficient of 0.05 for the hidden layers and the sigmoid function for the output layer. L1 kernel regularization is applied to hidden layers with a regularization factor of 1×10 -6 . The neural network model was implemented using TensorFlow (version 2.7.0). In model training, the loss function was binary cross-entropy and the Adam optimizer 12 with a learning rate of 1×10 -5 was used. The training process was stopped if the validation loss did not improve for 30 epochs, and the model with the smallest validation loss was reported. To deal with the class imbalance problem in training data, class weighting by the inverse class frequency was used.

S5. Determining artifact masses
For a mass x and a maximum shifted mass of 10 neutrons, the set X of shifted and unshifted masses of x consists of 21 masses x+1.00235d for d=-10, -9, …,10, where 1.00235 Da is an estimated mass difference between two isotopologues introduced by a neutron 11 . A mass y is an isotopologue of mass x if y matches a mass in X with an error tolerance of 10 ppm. And y is a low Supplementary Figure S14. An illustration of adjusting the RT boundaries of an envelope set. The XEC of the envelope set contains peaks from two neighboring envelope sets. The XEC is smoothed by employing a moving average filter with a window of 2. t C is the RT of the spectrum with the seed envelope, and t S is the original start RT of the envelope set. A local minimum is located at t min and a local maximum is located at t max . The intensity ratio of XEC at t max and t min is greater than 2.5, so the start RT is adjusted to t min . Supplementary Table S1. Summary of bottom-up feature detection tools Tool Description msInspect 13 msInspect identifies candidate peaks for feature detection by locating local maxima in each scan. Peaks that elute over several spectra in an LC-MS map are extracted. Using the peptide isotopic distribution, the co-eluting peaks are grouped to report a peptide feature. Kullback-Leibler divergence is used to evaluate the similarity between observed and experimental isotopic distributions.

Supplementary Tables
centWave 14 centWave uses each peak in experimental data as a candidate seed peak. If a peak in a spectrum is observed in neighboring spectra within a certain m/z range, these peaks are grouped to obtain a mass trace. A continuous wavelet transform is applied to each mass trace to determine its retention time boundaries.

MaxQuant 15
In MaxQuant, candidate seed peaks are obtained by finding local intensity maxima in each spectrum. A Gaussian distribution is fitted to the seed peak and other peaks with similar m/z values in the scan to obtain an m/z-intensity curve. Afterward, these m/z-intensity curves are connected in neighboring scans based on their central positions to get the 3-dimensional retention time profile of the feature. Finally, deconvolution is performed based on 3dimensional retention time profiles extracted from the LC-MS map.
Dinosaur 16 Dinosaur collects centroided peaks with similar m/z values in consecutive spectra to build mass traces. Subsequently, deconvolution is performed to obtain peptide features.

DeepIso 17
DeepIso is comprised of two deep-learning-based modules. The first module scans the experimental LC-MS map along the retention time axis and determines the charge state and retention time range of a feature. The second module scans the experimental LC-MS map along the m/z axis to group peaks that belong to the same isotopic distribution to report peptide features.

MSTracer 18
MSTracer extracts mass traces and then groups mass traces whose local maxima are located at similar RT and whose intensities match the isotopic distribution of a peptide. A support vector regression model is employed to evaluate overlapping mass trace groups and report the best-scoring one. A deep-learning model is used to report the quality score for each peptide feature. S17