Fitting Structure-Data Files (.SDF) Libraries to Progenesis QI Identification Searches

Progenesis QI (PQI) is a multiplatform bioinformatics tool that facilitates the identification workflow for metabolomics experiments. PQI uses fragmentation data provided by MassBank of North America (MoNA) libraries, among others, for metabolite annotation. However, PQI does not officially support MoNA libraries and other libraries based on structure-data files (.sdf). This paper describes the development and application of a software named MoNA to Progenesis QI Library Converter, allowing PQI and MoNA by correcting the fragmentation data of the library for Progenesis readability. We evaluated several public experimental datasets, including human plasma, plant extracts, cultured cells, bacteria, rat serum, and rat hippocampus. The results showed that it is mandatory to proceed with file conversion of each library to allow PQI to access fragmentation information from .msp (main spectra profile) files. This step is highly recommended to improve the identification level of the metabolites.


Introduction
Metabolite identification is the most challenging and important part of an untargeted metabolomic investigation. 1 This step is critical for turning instrumental data into meaningful biological information 2 and remains the main bottleneck in the field. 1 Fragmentation information from tandem mass spectrometry (MS/MS) experiments is crucial in metabolomics, improving the metabolomics standard initiative (MSI) identification level from merely putatively characterized compound classes to putatively annotated compounds, allowing more confident metabolite annotation. 3][6] Unlike the expected peptide fragmentation patterns, the large structural heterogeneity from MS/MS metabolomics experiments makes the metabolite fragmentation pattern challenging.Until this day, they are not well established and are mostly unknown. 7rogenesis™ QI (PQI, Waters Corporation™ © Nonlinear Dynamics) 5 is a data processing tool for highresolution mass spectrometry (MS).It deals with full scan, data-dependent analysis (DDA), and data-independent analysis (DIA) through peak alignment, peak picking, and mining.
PQI is a commercial proprietary software not limited to instruments and data from Waters Corporation Furthermore, it applies a search-based approach for experimental feature annotation by matching its physicochemical properties and spectral similarity with public/commercial spectral libraries. 3,5,6Using this identification method, PQI compares experimental data with the data downloaded from the libraries and evaluates the identification quality using up to five similarity parameters: (i) mass similarity (in ppm, (parts per million)), (ii) isotope similarity (in ppm; 0-100%), (iii) retention time similarity (in minutes; 0-100%), (iv) collision crosssection (CCS) similarity (in percentage or Å 2 ; 0-100%), and (v) fragmentation score (by cos(θ) similarity method; 8,9 0-100%). 10hese parameters consider methodological and instrumental information.If a given parameter is unavailable or disabled, or the external library does not include it, this given parameter will assume a value of 0% and will not be considered for matching.Each parameter represents 20% of the final score calculation.For example, if only mass similarity, isotope similarity, and fragmentation score are used, the maximum achievable score will be 60%.Yet, this approach relies on libraries structure-data (.sdf) and the main spectra profile (.msp) file formats, used to annotate the mass of precursor and adduct, and check fragmentation patterns, respectively.
The MassBank of North America (MoNA) plataform 11 is a well-known curated, centralized, and collaborative public source of experimental and in silico fragmentation spectra, which associates compounds in both .sdfand .mspformats.With that, it is possible to find public records for compounds from Human Metabolome Database (HMDB), 12 LipidBlast, 13 Global Natural Social Molecular Networking (GNPS), 14 and others.The MoNA users can also submit their novel spectra for broad sharing, and the corresponding curated spectra can eventually be downloaded as well. 15etabolite annotation processed by PQI cannot, by default, access .mspfiles downloaded from the MoNA database.However, an incompatibility between PQI and external .SDF libraries, such as MoNA, was noticed, implying the loss of fragmentation data.Therefore, this work aimed to develop and apply a computational tool named SDF to Progenesis QI Library Converter to enable the correction of these libraries and their compatibilization with PQI annotation searches.The application and code are publicly available on GitHub repository. 16

Datasets assessed
The list of studies assessed in this work refers to six liquid chromatography-mass spectrometry (LC-MS) metabolomics experiments using Data Dependent Acquisition (DDA) and Data Independent Acquisition (DIA) datasets, comprising human plasma, plant extracts, rat hippocampus and serum, chicken cells, and bacteria.All studies have already been published in peer-reviewed journals and have their .rawdata, study information, and the list of identified metabolites made publicly available in the MetaboLights data repository. 17The chosen datasets have the following identifiers: (i) MTBLS1584, 18 (ii) MTBLS1783, 19 (iii) MTBLS1115, 20 (iv) MTBLS496, 21 and (v) MTBLS952. 22

Data processing, tool development, and application
The SDF to Progenesis QI Library Converter (SDF2PQI) was developed as a console application in CODE::Blocks 13.12 (open source) integrated development environment and C programming language using Minimalist GNU for Windows (MinGW) implementation of GNU Compiler Collection (GCC), and it is publicly available on GitHub. 16C-MS raw data were processed using the Progenesis MetaScope search engine.For molecular feature annotation, the following libraries, were used: HMDB, LipidBlast, Fatty Acid ester of Hydroxyl Fatty Acid (FAHFA), Oxidized Phospholipids, Vaniya/Fiehn Natural Products Library, Plant Specialized Metabolome Annotation (RIKEN PlaSMA), ReSpect Bruker Sumner, MetaboBASE Plant Library (MetaboBASE), Lipid Maps and GNPS.
For the PQI Metascope identification process, quality control (QC) samples were chosen when available with a tolerance of 15 ppm for both precursor and fragments; otherwise, all the samples of the study were used, with a tolerance of 100 ppm for precursor and fragment.
All the computational processing was carried out on a processing station equipped with Intel ® Core™ i9-9900K CPU@3.60 GHz, 64 GB of RAM, and Windows 10 Enterprise 64-bit operational system.

Results and Discussion
PQI and MoNA have a similar data structure in .sdffiles separated in different fields, as presented in Figure 1.Each field plays a different role in the annotation processing for every compound.Fields 1 and 2 indicate the name of the molecule and the information about the software/instrument used to generate the record.Field 3, called the "count line", describes the number of atoms and bonds for a given compound.The following field, number 4, is the "atoms block" and provides information about the coordinates of the atom on the x, y, and z-axis and the atom symbol to be used (e.g., C for the carbon atom).It is also possible to include information about the charge of the molecule in the same field.
Field 5, the "bonds block", informs the bonds among atoms, designating the position of the atom and bond type.Field 6, called "terminator" is included to indicate the end of the given compound record.The last, field 7, is called the "additional data field", and similarly to an XML file, it contains a header but, in this case, it must begin with "> <ID_Info>", followed by an identification code, showed as "00001" for PQI and as "MMS553002" for MoNA (Figure 1c).
The comparison of Progenesis and MoNA data structures (Figures 1a and 1b) reveals that the record of the .sdffile from MoNA follows the requirements of Progenesis QI for fields 1 to 6.For the seventh field, highlighted in red, the number of "space characters" differs from Figure 1a (n = 1 space characters) and 1b (n = 2 space characters), which is one of the causes of the incompatibility.
According to the documentation of Dassault Systèmes ® on .sdffiles, this extra character should not exist, and the "M END" must be followed by a blank line. 23hrough the analysis of an in-house MS/MS library obtained from one hypothetical molecular feature (i.e., extracted ion chromatogram peak), the expected template for .mspfragmentation files was unraveled.We observed that the fields "Name", "Precursor_type", and "Formula" were not considered in the identification process and, therefore, left blank (Figure 2).
To enable PQI to correctly verify the correspondence between the experimental MS/MS and the external library, the .msp's"DATABASE_ID" field (Figure 2a) must match with the seventh field in the .sdffile, namely, "> <ID>" (Figure 1a, field 7).If there is no match, the field "DATABASE_ID" is displayed as "DB#" in the .mspfiles (Figure 2b) and might be ignored by PQI in the annotation searches.
The mismatched fields force PQI to skip them, which causes the error in the matching process between the experimental MS/MS spectra, as shown in Figure 3.
After conceiving the PQI requirements for the downloaded .sdfand .mspfiles and identifying the data format and patterns, we developed a console application that iterates each line while searching for specific pre-defined patterns, replacing it to match the required SDF syntax, allowing files from the MoNA library to be correctly used by PQI.
To exemplify the utility of our tool, we selected six datasets available at the MetaboLights platform 17 containing public MS data.These data files were submitted to the identification of PQI process using both unconverted and converted libraries.Table 1 displays the overall results for positive and negative ion modes using different libraries for different datasets evaluating the results in terms of the number of MS/MS library-matched molecular features.Figure 4 shows the annotation results of selected molecular features for different datasets before and after library correction.
Figure 4 reveals the conversion of the library, enabling the correspondence between the experimental and external library spectra (mirror plot).Moreover, the identification quality parameters (namely Score and Fragmentation Score, respectively) increased, indicating an improvement in identification quality.The processing time varies on the chosen library.To verify the average processing time, we used the MassBank library as an example, one of the most comprehensive ones available on the MoNA repository, comprising about 72,439 mass spectra at the time of the experiment (2022-10-21).To estimate time consumption over the file size, we used a downloaded .zipfile of ca.29.5 MB, with an unpacked .mspfile of 197.2 MB, containing around 4,287,601 lines (depending on the number of indexed molecules).This file was fully converted with our tool to a compatible one in   Fragmentation mass spectra of selected library-matched molecular features from different datasets.The left panels refer to identification processing results using raw library files, and the right panels refer to processing using converted library files.Library-matched fragments are highlighted in red.about 2:30 min, and the following identification process took 1:10 min, resulting in 591 matches for 4,169 searched features.

Conclusions
This work described the successful development and application of a tool that corrects SDF library formats for PQI annotation searches.This tool, as an additional identification resource, enabled a significant increase in the number of MS/MS library-matched molecular features.Nowadays, identifying unknown molecules based on MS/MS library search remains a burden to overcome.Despite all the improvements in instrumentation, the number of identified compounds remains limited and is highly dependent on instrumental settings (i.e., duty cycle, MS/MS acquisition speed, precursor ion isolation width, accumulation time per single MS/MS spectrum, intensity threshold, collision energy, activation mode, instrumental design of tandem mass spectrometer), and should be optimized according to each experimental requirements.Moreover, it heavily depends on the availability of public mass spectral data and authentic analytical standards. 15hus, the presented tool is not intended to overcome this barrier.
The strongest feature of the tool we present here is to offer a simple solution for the users of Progenesis QI to access any library contained in MoNA, as well as other .sdfbased libraries, such as the ones from the HMDB and Lipid Maps platforms. 24With this tool, any PQI user will have all MoNA libraries available to increase the quality and quantity of feature annotation.Therefore, the novelty relies not on computational issues but on its application, and this is precisely where the scientific merit of our work is.

Figure 1 .
Figure 1.Templates from Progenesis QI (a) and MoNA (b) platforms for the compound records in .sdffiles.(c) The seventh field, named "data field", highlighted in red, was zoomed in to indicate the different number of "space characters" causing the observed failure in the matching processing.

Figure 2 .
Figure 2. Lipid blast single compound template for .mspfiles obtained from (a) Progenesis QI and (b) MoNA.The key issue for Progenesis QI annotation processing to match with MoNA library is highlighted in red.

Figure 4 .
Figure 4. Fragmentation mass spectra of selected library-matched molecular features from different datasets.The left panels refer to identification processing results using raw library files, and the right panels refer to processing using converted library files.Library-matched fragments are highlighted in red.

Table 1 .
The total number of MS/MS library-matched molecular features for positive and negative ion modes using raw and converted libraries for different datasets Number of MS/MS-matched features before (raw) and after (converted) the library conversion; b HMDB: Human Metabolome Data Base; c LipidBlast; d FAHFA: Fatty Acid ester of Hydroxyl Fatty Acid; e Oxidized Phospholipids, f GNPS: Global Natural Products Social Molecular Network Library; g Vanya/ Fiehn Natural Products Library; h ReSpec: RIKEN MSN Spectral Database for Phytochemicals; i MetaboBASE; j RIKEN PlaSMA: Plant Specialized Metabolome Annotation; k Pathogen Box.DIA: data independent analysis; DDA: data dependent analysis. a