OrbiFragsNets. A tool for automatic annotation of orbitrap MS2 spectra using networks grade as selection criteria

We introduce OrbiFragsNets, a tool for automatic annotation of MS2 spectra generated by Orbitrap instruments, as well as the concepts of chemical consistency and fragments networks. OrbiFragsNets takes advantage of the specific confidence interval for each peak in every MS2 spectrum, which is an unclear idea across the high-resolution mass spectrometry literature. The spectrum annotations are expressed as fragments networks, a set of networks with the possible combinations of annotations for the fragments. The model behind OrbiFragsNets is briefly described here and explained in detail in the constantly updated manual available in the GitHub repository. This new approach in MS2 spectrum de novo automatic annotation proved to perform as good as well established tools such as RMassBank and SIRIUS.• A new approach on automatic annotation of Orbitrap MS2 spectra is introduced.• Possible spectrum annotation are described as independent consistent networks, with annotations for each fragment as nodes, and annotations for the mass difference between fragments as edges.• Annotation process is described as the selection of the most connected fragments network.


a b s t r a c t
We introduce OrbiFragsNets , a tool for automatic annotation of MS2 spectra generated by Orbitrap instruments, as well as the concepts of chemical consistency and fragments networks. Orb-iFragsNets takes advantage of the specific confidence interval for each peak in every MS2 spectrum, which is an unclear idea across the high-resolution mass spectrometry literature. The spectrum annotations are expressed as fragments networks, a set of networks with the possible combinations of annotations for the fragments. The model behind OrbiFragsNets is briefly described here and explained in detail in the constantly updated manual available in the GitHub repository. This new approach in MS2 spectrum de novo automatic annotation proved to perform as good as well established tools such as RMassBank and SIRIUS.
• A new approach on automatic annotation of Orbitrap MS2 spectra is introduced.
• Possible spectrum annotation are described as independent consistent networks, with annotations for each fragment as nodes, and annotations for the mass difference between fragments as edges. • Annotation process is described as the selection of the most connected fragments network.

Introduction
Strong interest in new chemicals is shown by constantly growing public mass spectra databases, such as the MassBank [ 1 ] and the Global Natural Products Social Molecular Networking (GNPS) [ 2 ], as well as all the effort being spent on metabolomics [ 3 ] and natural products research [ 4 ]. The scientific community is now uncovering the novel biological and environmental effects caused by those unknown substances.
High-resolution mass spectrometry (HRMS), coupled with liquid chromatography (LC) or gas chromatography (GC), is a cuttingedge technology able to detect thousands of substances within a couple of minutes [ 5 ]. It is now essential for many scientific disciplines, from organic synthesis [ 5 ] to environmental sciences [ 6 , 7 ].
Instruments such as the Fourier-transform ion cyclotron resonance mass spectrometer (FTMS) [ 5 ] and Orbitrap [ 8 ] use the current image generated by ions inside an ion trap to determine a highly accurate mass-to-charge ratio ( m/z ) for MS2 product ions. Those fragmentation patterns might be informative enough to discover new molecules. However, the tricky relationship between the fragments and the chemical structure requires more research [ 9 ].
The current benchmark for elucidating the structure of small molecules ( < 1000 Da) is nuclear magnetic resonance (NMR) spectroscopy [10][11][12]. Nevertheless, NMR spectroscopy is extremely sensitive to impurities [ 12 ]. Purification makes the analysis with NMR spectroscopy challenging, expensive, time-consuming, and almost impossible to perform with complex mixtures of trace organic chemicals [ 13 ]. LC -HRMS, unlike NMR spectroscopy, detects individual molecules in complex mixtures using minor preliminary treatment steps, but requires a deeper data analysis.
Schymanski et al. [ 14 ] defined five confidence levels in identifying small molecules with HRMS. Level 1 is a confirmed structure. It requires the MS2 spectrum and retention time to match with standards in the same laboratory. Level 2 refers to a probable structure. It needs an MS2 spectrum match with a database. In level 3, the MS2 spectrum is ambiguous regarding isomers, as well as in level 4, where the confidence is just sufficient to define the molecular formula. Finally, reporting accurate mass is enough for lowest level of confidence, level 5.
The confidence in this classification system relies on data analysis and additional experiments after the HRMS analysis. With HRMS, data analysis conveys from level 5 to level 2, from cleaning the spectrum to comparing with databases. HRMS data is complicated to process, and different software alternatives disagree on crucial steps such as feature detection [ 15 ]. Annotation links the estimated m/z with a chemical formula. It increases the spectrum quality, facilitates interpretation [ 16 ], and makes comparing with databases more accessible [ 17 ].
There are several software alternatives for the automatic annotation of MS2 spectra. The RMassBank [ 16 ] and SIRIUS [ 18 ] are among the most popular. While the MassBank requires the RMassBank formatting for uploading spectra [ 1,16 ], users prefer SIRIUS [ 18 ] after its compatibility with other tools such as MzMine [ 19 ], openms [ 20 ], and GNPS [ 2 ].
The RMassBank ranks the possible annotations for the peaks with their frequency appearance across different spectra on different adducts. In contrast, machine learning and comparison with databases define the annotations in SIRIUS. Both alternatives rely on data outside the target spectrum and reject possible annotations with a mass error. The user defines the mass error as a parameter, forcing certain accuracy on the m/z estimation, but the statistical treatment of experimental data would yield a better result [ 7,21 ].
According to Brenton and Godfrey [ 21 ] an excessive report of long numbers as significant digits in the HRMS literature reflects an inappropriate use of concepts such as accurate mass, exact mass, and uncertainty. The massive amount of data generated on a single HRMS run demands automatic approaches and clarity in those concepts [ 22,23 ].
Brenton and Godfrey [ 21 ] and Marty [ 23 ] recommend statistical evaluation of the m/z estimations across multiple measurements. However, every m/z peak in the Orbitrap data implies dynamic measurements [ 8 ]. The rawest data in an Orbitrap instrument is the image current generated by several ions in the ion trap for a certain period of time [ 8 ]. Then with the fast-Fourier transformation, data is converted into a distribution of frequencies related to the m/z for every ion [ 8 ]. This distribution of m/z resembles the peaks in the MS2 spectrum ( Fig. 1 ) and is subject to statistical treatment by itself.
Unfortunately, this data distribution is lost while centroiding, as the first step in any workflow for HRMS data processing. This process aims to make the data analysis more manageable, but it sacrifices the uncertainty that is key for later tasks such as annotation ( Fig. 1 ).
This article describes the basic data flow and usage for OrbiFragsNets, our software for the automatic annotation of DDMS2 (data dependent MS2) Orbitrap data. We expanded the concept by Brenton and Godfrey's [ 21 ] estimating confidence intervals on the raw data from only one spectrum. Then, we used the specific uncertainty for every peak to generate and filtrate the chemical space [ 22 ] of possible annotations. We employed networks theory to represent the different annotations on the MS2 spectrum as networks. Finally, we introduced the concept of chemical consistency, expressed it as the edges of the fragments networks, and used it to rank and select the best annotation. OrbiFragsNets aims to be an offline method for the automatic annotation of MS2 spectra generated by Orbitrap instruments.

Model description
OrbiFragsNets is written in Python3 and it can be executed from the any Linux terminal or a Jupyter-notebook. All the functions and the recommend Jupyter-notebook are available in the GitHub repository ( https://github.com/EdwinChingate/OrbiFragsNets ). The following paragraphs and Fig. 2 describe the model and data flow inside OrbiFragsNets, but the reader can find a detailed description of all the functions in the repository manual.

Model assumptions and considerations
i. Every signal in the raw data is treated as a single data-point ii. m/z for each peak follows a normal distribution. OrbiFragsNets confirms it with a Shapiro-test.
iii. The fragmentation process is always a decomposition reaction, from an ion into another ion and a neutral molecule (equation 1). The MS2 spectrum contains the signals from both ions and the m/z difference corresponds with the neutral molecule m/z.OrbiFragsNets describes it as a vector equation (equation 2). In the following sections this assumption will be refereed as "chemical consistency ".
iv. Each peak in the MS1 spectrum and the MS2 spectrum corresponds with a unique fragment and have a unique annotation. This assumption is easy to achieve with DDA, as it requires filtering the precursor ions in the MS1, prior to fragmentation. v. Only molecules with one charge are considered in the chemical space for annotation. vi. Only the product ions smaller than the precursor ion are consider as part of the MS2 spectrum. vii. OrbiFragsNets' annotates using only the most abundant isotopes for the elements potassium, sodium, carbon, chlorine, sulfur, phosphorus, fluor, oxygen, nitrogen, and hydrogen; and the 13, and 34 isotopes for carbon, and sulfur, respectively.  Tables 3 and 4 contain 16 and 18 fragments, respectively. The maximum number of edges in a n-nodes network is n * (n-1)/2, in case all fragments in the network are consistent between them. OrbiFragsNets generates as many networks as combinations of annotations for the individual fragments. For example, five fragments with five candidates for annotation will generate up to 3125 networks.

Fig. 2.
Internal workflow on the automatic annotation of MS2 spectra by OrbiFragsNets. OrbiFragsNets uses information the MS1 spectrum to suggest annotations (colorful boxes) on the target m/z and define a chemical space of annotations for the peaks in the MS2 spectrum. Then all the possible set of annotations are expressed as fragments networks, where nodes correspond with the annotations (colorful boxes) and the edges (broken lines connecting colorful boxes) with the chemical consistency. Finally, the most chemically consistent fragments network is selected as the annotation for the whole MS2 spectrum as well as the best molecular formula for the precursor ion.

Usage
The process in Fig. 2 is fully automatic and the user of just needs to prepare the data in the right format (.mzML), charge it into the program and execute the function OrbiFragsNets. Beyond the following tutorial for annotation of a MS2 spectrum with a known m/z , all functions in OrbiFragsNets can be used independently, and adapted to scan a full data-set of unknown substances.
i. Get your raw data (.raw) from your experiment into your data analysis computer. ii. Use ProteoWizard msConvert to convert your files into the .mzML format. Adusumilli and Mallick [ 24 ] already gave a detailed description of this step. Keep your data as raw as possible, in profile mode; avoid using filters in msConvert. Here is an example for the conversion from the linux terminal: ' msconvert "$file" -mzML ' iii. Move your .mzML files to the folder 'Data' inside your working folder. iv. Open your Jupyter-notebook OrbiFragsNets_Notebook.ipynb . v. Execute the cell "Charge libraries ". vi. In the cell "Charge your data file ", assign the name of the file that you want to analyse (.mzML) to the string variable 'DataSet-Name' and execute the function ChargeDataSet . Your file will be stored in the variable DataSet . vii. Open the file MaxAtomicSubscripts.csv , inside the folder Parameters, with a text editor such as gedit in Ubuntu, or any spreadsheet software such as gnumeric or calc from open office, while making sure to keep the extension as .csv. Define the maximum number of atoms you expect in your molecule for every element or high values ( > 50) while sacrificing some seconds on the execution of MoleculesCand . viii. Open the file ParametersTable.csv , the same way as MaxAtomicSubscripts.csv , and define your desired parameters. The repository manual provides a detailed description on each parameter, but the most relevant is the NoiseTresInt . Data from Orbitrap instruments is robust regarding noise signals after the clustering and the statistical test, but NoiseTresInt strongly affects the execution time in the function NumpyMSPeaksIdentification . ix. Define the m/z for your target molecule as the protonated adduct ( + H + ), and execute the function OrbiFragsNets. Or as an anion and change the value for the parameter 'ionization' in the ParametersTable.csv from ' + ' to '-'. x. Use the following cell "ShowDF " to visualize your results table. xi. Export your table as an .xlsx or .csv file in the cell "Export results ". The columns 'Predicted m/z (Da)' and 'Relative intensity (%)' can be used searching for similar spectra in the MSBank, as well, as the column 'Molecular formula' would assist its interpretation. Table 1 displays the automatic annotation, for our own sulfamethoxazole Orbitrap analysis, by OrbiFragsNets . The estimated exact m/z for protonated sulfamethoxazole (C 10 SO 3 N 3 H 12 + ) is 254.0599374277 Da. Table 1 contains the conventional information for the MS2 sulfamethoxazole spectrum in the columns "Meassured m/z (Da) " and "Relative intensity (%) ", but also the uncertainty for each m/z in the "Confidence interval (ppm) " column, and the number of signals clustered on each peak in the "#Data points " column. The annotation for each fragment is in the column "Molecular formula ", the corresponding m/z for that annotation in "Predicted m/z (Da) ", and the mass error, between the predicted m/z and the measured m/z , in the column "Mass error (ppm) ".
Testing orbifragsnets Tables 2 , 3 , and 4 include the m/z values taken from spectra for carbamazepine, sulfamethoxazole, and atenolol, respectively, as well as their corresponding annotations by OrbiFragsNets , RMassBank [ 16 ] and SIRIUS [ 18 ]. To make a fair comparison between the three software tools, we took the already annotated spectra from the MassBank and annotated with OrbiFragsNets and SIRIUS [ 18 ]. For the annotation we only considered the fragments whose contribution to the total intensity was higher than 1%.
It is important to mention that using already centroided data undervalue the lack of pre-processing and the estimation of confidence interval for specific fragments in specific matrixes in OrbiFragsNets. OrbiFragsNets estimates the confidence interval for each    Tables 2-4 show that our offline tool OrbiFragsNets can perform as good as SIRIUS. The few fragments only annotated by RMassBank, are not part of the fragmentation process according to SIRIUS, and similarly not consistent with the other fragments according to OrbiFragsNets . The GitHub library is constantly updated and will provide more examples of annotation and comparison with other tools. The next step in developing this methodology is a mathematical formulation of the chemical consistency as a hypothesis and a statistical analysis to connect the grade of our fragments networks with the accuracy of the predictions.
The annotation of the MS2 spectra could become the first step in the data processing for identification of unknown compounds. Automatic annotation of MS2 spectra increases the confidence in the alignment of features when comparing samples, as there is already certainty on the identity of the feature, furthermore, the confidence in approaches such as molecular networking would increase, as no mass error is needed when the spectrum can be aligned directly.
Spectra annotation as a first step in data processing could an be an advantage over conventional workflows, and our approach offers the basics for a de novo workflow. Chemical consistency is independent of any external data and therefore is valid for unknown substances, too. Self-consistent annotated spectra also offer an alternative to input data for structural elucidation using machine learning.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.