Cumulative Neutral Loss Model for Fragment Deconvolution in Electrospray Ionization High-Resolution Mass Spectrometry Data

Clean high-resolution mass spectra (HRMS) are essential to a successful structural elucidation of an unknown feature during nontarget analysis (NTA) workflows. This is a crucial step, particularly for the spectra generated during data-independent acquisition or during direct infusion experiments. The most commonly available tools only take advantage of the time domain for spectral cleanup. Here, we present an algorithm that combines the time domain and mass domain information to perform spectral deconvolution. The algorithm employs a probability-based cumulative neutral loss (CNL) model for fragment deconvolution. The optimized model, with a mass tolerance of 0.005 Da and a scoreCNL threshold of 0.00, was able to achieve a true positive rate (TPr) of 95.0%, a false discovery rate (FDr) of 20.6%, and a reduction rate of 35.4%. Additionally, the CNL model was extensively tested on real samples containing predominantly pesticides at different concentration levels and with matrix effects. Overall, the model was able to obtain a TPr above 88.8% with FD rates between 33 and 79% and reduction rates between 9 and 45%. Finally, the CNL model was compared with the retention time difference method and peak shape correlation analysis, showing that a combination of correlation analysis and the CNL model was the most effective for fragment deconvolution, obtaining a TPr of 84.7%, an FDr of 54.4%, and a reduction rate of 51.0%.


S1 Sample composition
* Compounds that were not included in the suspect list due absence of reference spectra in the databases. 5 S10 S2 CNL Model 6 Figure S1: The total number of TP and TN counts for each CNL bin (i.e., 0.001 Da). S11 Figure S2: The total number of TP fragment m/z counts for each bin from 0 to 1000 m/z with a step size of 0.001 Da. A) shows the full distribution and B) shows the distribution from 0 to 100 Da. S12  Figure S3: Receiver operator curve for the TP and FP rates of the CNL model for the database fragments, using different mass tolerances and score CN L thresholds represented by different shapes and colors, respectively. Additionally, the black line represents the 1:1 ratio between TPr and FPr. S13 Figure S4: Receiver operator curve for the TP and FD rates of the CNL model for the database fragments, using different mass tolerances and score CN L thresholds represented by different shapes and colors, respectively. Additionally, the black line represents the 1:1 ratio between TPr and FDr.

S14
S3 Real samples 7 Figure S5: Receiver operator curve for the TP and FD rates of the CNL model for the real samples, using different mass tolerances and score CN L thresholds represented by different shapes and colors, respectively. Figure S6: A) shows all detected signals within a single measurement for buprofezin, B) shows the signals that were removed according to the CNL model with a mass tolerance of 0.005 and a score threshold of 0.00, and C) shows the cleaned spectrum.     Figure S11: Receiver operator curve for the TP and FD rates of the CNL model for real samples with a varying score CN L threshold. Here the difference for results with low spiked concentration (circles) and high concentration (squares) can be seen. Additionally, the black line represents the 1:1 ratio between TPr and FDr.

S18
S3.2 Matrix Influence 9 Figure S12: Receiver operator curves for the TP and FD rates of the CNL model for real samples with a varying score CN L threshold. Here the difference for results with no added matrix (A), 100 times diluted tea (B), and 10 times diluted tea (C) can be seen. Each subplot contains multiple ROC curves, each for a different concentration of added standards. Additionally, the black line represents the 1:1 ratio between TPr and FDr.

S19
S3.3 CNL Range Influence 10 Figure S13: Receiver operator curve for the TP and FD rates of the CNL model for real samples with a varying score CN L threshold. Here the performance for 4 CNL ranges can be seen. Additionally, the black line represents the 1:1 ratio between TPr and FDr.

S20
S3.4 Collision Energy Influence 11 Figure S14: Receiver operator curve for the TP and FD rates of the CNL model for real samples with a varying score CN L threshold. Here the difference for results with low collision energy (circles) and high collision energy (squares) can be seen. Additionally, the black line represents the 1:1 ratio between TPr and FDr.

S21
S4 Comparison with Conventional Method 12 Figure S15: Receiver operator curve for the TP and FD rates of the apex retention time difference method for real samples with a varying maximum retention time difference. Additionally, the black line represents the 1:1 ratio between TPr and FDr.
Figure S16: Receiver operator curve for the TP and FD rates of the correlation method for real samples with a varying minimum correlation. Additionally, the black line represents the 1:1 ratio between TPr and FDr.