Mass Spectrometry Data Repository Enhances Novel Metabolite Discoveries with Advances in Computational Metabolomics

Mass spectrometry raw data repositories, including Metabolomics Workbench and MetaboLights, have contributed to increased transparency in metabolomics studies and the discovery of novel insights in biology by reanalysis with updated computational metabolomics tools. Herein, we reanalyzed the previously published lipidomics data from nine algal species, resulting in the annotation of 1437 lipids achieving a 40% increase in annotation compared to the previous results. Specifically, diacylglyceryl-carboxyhydroxy-methylcholine (DGCC) in Pavlova lutheri and Pleurochrysis carterae, glucuronosyldiacylglycerol (GlcADG) in Euglena gracilis, and P. carterae, phosphatidylmethanol (PMeOH) in E. gracilis, and several oxidized phospholipids (oxidized phosphatidylcholine, OxPC; phosphatidylethanolamine, OxPE; phosphatidylglycerol, OxPG; phosphatidylinositol, OxPI) in Chlorella variabilis were newly characterized with the enriched lipid spectral databases. Moreover, we integrated the data from untargeted and targeted analyses from data independent tandem mass spectrometry (DIA-MS/MS) acquisition, specifically the sequential window acquisition of all theoretical fragment-ion MS/MS (SWATH-MS/MS) spectra, to increase the lipidomic annotation coverage. After the creation of a global library of precursor and diagnostic ions of lipids by the MS-DIAL untargeted analysis, the co-eluted DIA-MS/MS spectra were resolved in MRMPROBS targeted analysis by tracing the specific product ions involved in acyl chain compositions. Our results indicated that the metabolite quantifications based on DIA-MS/MS chromatograms were somewhat inferior to the MS1-centric quantifications, while the annotation coverage outperformed those of the untargeted analysis of the data dependent and DIA-MS/MS data. Consequently, integrated analyses of untargeted and targeted approaches are necessary to extract the maximum amount of metabolome information, and our results showcase the value of data repositories for the discovery of novel insights in lipid biology.

peak-picking required for targeted approaches is acceptable. This study highlights the importance of mass spectrometry data repositories to deepen our understanding of lipids in algal species.

Overview of Data Analysis Workflow
Since the MetaboLights database and repository was launched in 2012 by the European Bioinformatics Institute (EMBL-EBI) as the first repository for metabolomics data, data submission has continuously increased (~2.5 TB data was available in June 2019), and accessibility and awareness have been enhanced through the efforts of MetabolomeXchange (http://www.metabolomexchange.org) and OmicsDI [16]. RIKEN DropMet (http://prime.psc.riken.jp/menta.cgi/prime/drop_index) has also been launched in 2009 to share MS-based metabolomics data from RIKEN, in which~300 GB of data from 29 studies are currently available; a part of this repository, i.e., the algae lipidomics data, was used in this study.
On the other hand, data processing tools like MS-DIAL [6,9], XCMS [13], and MZmine 2 [14] have continuously been updated with database curations like Metlin [17] in XCMS. Since the LipidBlast library was released in 2013 as the first public in-silico library for lipids [18], the fork libraries for quadrupole/time-of-flight mass spectrometry (QTOF-MS) with collision-induced dissociation (CID) and orbital ion trap MS with higher-energy collisional dissociation (HCD) data have been developed in MS-DIAL [6,7] owing to the continuous effort. The annotation described in this study can be executed in the MS-DIAL version 3.66 or higher. All programs, i.e., MS-DIAL, MRMPROBS, and the related spectral libraries are available on the RIKEN PRIMe website (http://prime.psc.riken.jp/).

Mass Spectrometry Data
The DDA and DIA lipidomics data obtained in positive and negative ion modes of nine algal species were downloaded from the RIKEN DropMet website (http://prime.psc.riken.jp/menta.cgi/ prime/drop_index; ID, DM0022). Briefly, the extraction of algal lipids was performed using a biphasic solvent system of cold methanol, methyl tert-butyl ether (MTBE), and water followed by lipid separation via reversed-phase liquid chromatography. Both DDA and DIA data were acquired using a QTOF mass spectrometer (TripleTOF 5600+, SCIEX). For DIA (SWATH-MS/MS), a 21 Da isolation window was used for selecting precursor ions shifting over an m/z 100-1250 mass range. Further details are provided in the previous study [6].

Software Programs
MS-DIAL version 3.06 and MRMPROBS version 2.44 were used herein. All programs including the latest version are freely available on the RIKEN PRIMe website (http://prime.psc.riken.jp/).
The same parameters in MS-DIAL were used for DDA and DIA data analyses: retention time begin, 0 min; retention time end, 100 min; mass range begin, 0 Da; mass range end, 5000 Da; accurate mass tolerance (MS1), 0.01 Da; MS2 tolerance, 0.025 Da; maximum charge number, 2; smoothing method, linear weighted moving average; smoothing level, 3; minimum peak width, 5 scan; minimum peak height, 1000; mass slice width, 0.1 Da; sigma window value, 0.5; MS2Dec amplitude cut-off, 0; exclude after precursor, true; keep isotope until, 0.5 Da; keep original precursor isotopes, false; exclude after precursor, true; retention time tolerance for identification, 4 min; MS1 for identification, 0.01 Da; accurate mass tolerance (MS2) for identification, 0.05 Da; identification score cut-off, 70%; using retention time for scoring, true; relative abundance cut off, 0; top candidate report, true; retention time tolerance for alignment, 0.05 min; MS1 tolerance for alignment, 0.015 Da; peak count filter, 0; remove feature based on peak height fold-change, true; sample max/blank average, 5; keep identified and annotated metabolites, true; keep removable features and assign the tag for checking, true; replace true zero values with 1/10 of the minimum peak height over all samples, false. Lipid annotation was performed automatically using the in-silico MS/MS spectral library described below, and the result was manually curated with the confirmation of the characteristic product ions and neutral losses to reduce false-positive annotations.
The parameters in MRMPROBS were set as follows: MS1 tolerance, 0.01 Da; MS2 tolerance, 0.025 Da; smoothing method, linear weighted moving average; smoothing level, 1; minimum peak width, 5 scan; minimum peak height, 200; retention time tolerance for identification, 0.1 min; amplitude tolerance for identification, 15%; minimum posterior, 70%; the abundance ratios in reference library were automatically generated by MS-DIAL. The results of metabolite annotation and peak picking were manually curated using the graphical user interface of MRMPROBS.

In-Silico MS/MS Spectral Libraries
The diagnostic ions used to characterize lipid classes were determined using authentic standards, experimental MS/MS spectra of biological samples, or MS/MS spectral information reported in the literature. The MS/MS spectra of PMeOH, LPS, and LPG were confirmed using the standard compounds PMeOH 16:0-16:0, LPG 18:1, and LPS 18:1 (Avanti Polar Lipids, Inc., Alabaster, AL, USA). The DGCC and LDGCC spectra were examined in the DDA-MS/MS data of Pavlova lutheri because these lipids were previously discovered in P. lutheri [19] and the corresponding literature's MS/MS spectrum was utilized to create an in-silico MS/MS library [20]; the MS/MS spectra that have electronically been described in a peer-review journal but not recorded in publicly and commercially available spectral databases such as MassBank and NIST are referred to as the literature's MS/MS. The in-silico MS/MS spectral libraries for oxidized phospholipids were developed considering our previously published data [21]. The library creation for GlcADG was based on the literature's MS/MS spectrum [22]. Information regarding ion abundances in the MS/MS spectral libraries was based on our LC-MS/MS experimental conditions and the detailed analytical conditions were described in a previous study [7]. Briefly, the MS data were acquired in information-dependent mode (IDA), i.e., DDA, using SCIEX TripleTOF 5600+ or 6600 systems. The mass range, collision energy, and collision energy spread were set to m/z 70-1250, 45 V, and 15 V, respectively.

Novel Lipid Characterizations in Algae with Enriched In-Silico Spectral Libraries
The global lipid profiling of nine algal lipids was achieved in 2015 and 15 lipid classes were characterized [6]. These classes include free fatty acid (FFA), di-and triacylglycerols (DAG and TAG), seven phospholipid classes (phosphatidylcholine, PC; phosphatidylethanolamine, PE; phosphatidylglycerol, PG; phosphatidylinositol, PI; phosphatidylserine, PS; lysophosphatidylcholine, LPC; and lysophosphatidylethanolamine, LPE), mono-and digalactosyldiacylglycerol (MGDG and DGDG), sulfoquinovosyldiacylglycerol (SQDG), diacylglyceryltrimethylhomoserine (DGTS), and its lyso-type form (LDGTS). Of these, the most common lipid classes in the photosynthetic membranes of plants, cyanobacteria, and algae, which include PG, SQDG, MGDG, and DGDG, have been characterized in all algal species. In contrast, N,N,N-trimethylammonium cation-containing lipids, i.e., PC and DGTS, were characterized as species-specific lipid classes. For example, Chlamydomonas reinhardtii only contains DGTS, while Chlorella species only contain PC as their characteristic positively charged membrane lipids. Since the specificity of lipid metabolism is highly influenced by genetics, evolution, and the environment of living organisms, increasing lipidomics coverage is an emerging requirement in biology.
Although further investigation of these discovered lipids is required to determine whether the lipid class is endogenously biosynthesized in a specific algal species [26], our results indicated that the reanalysis of the published data with the updated annotation workflow could provide new insights and hypotheses not previously reported.  For example, DGCC and LDGCC lipids were only characterized in Pavlova lutheri and Pleurochrysis carterae. DGCC is well-known as a major betaine lipid of non-plastid membranes in P. lutheri [19], while this lipid class has never been reported in P. carterae. Thus, further investigation in P. carterae is required to define its exact stereochemistry. The MS-based lipidomics platform does not resolve the stereochemistry of acyl chains and sometimes lipid classes cannot uniquely be assigned, although a large variety of lipid molecules can be covered by tracing the lipid class-specific product ions and neutral losses. For example, DGTS and diacylglyceryl hydroxymethyl-N,N,N-trimethyl-β-alanine (DGTA), the major lipid class in P. lutheri, are characterized by the same diagnostic ions (m/z 144.102 and m/z 236.149) [23] under our experimental conditions, so the annotation must be determined by considering the genetic background in mass spectrometry-based metabolite annotations [9]. GlcADG, also known as diacylglyceryl glucuronide (DGGA), was observed in Euglena gracilis, P. lutheri, and P. carterae. GlcADG is known to be accumulated in response to phosphorus starvation in Arabidopsis thaliana and Oryza sativa [24], and this lipid class is commonly observed in several algal species. A previous study has also reported its existence in P. lutheri. Furthermore, the rare phospholipid PMeOH class was characterized in E. gracilis, although it could also be detected as an artifact of extraction [25]. Finally, several oxidized fatty acid-containing phospholipids, including OxPC, OxPE, OxPG, and OxPI, were characterized in Chlorella variabilis. Although further investigation of these discovered lipids is required to determine whether the lipid class is endogenously biosynthesized in a specific algal species [26], our results indicated that the reanalysis of the published data with the updated annotation workflow could provide new insights and hypotheses not previously reported.

Strategy to Link Untargeted-and Targeted Analyses for Increasing Lipid Coverage
We further demonstrated the increased lipid profiling coverage by integrating untargeted and targeted analysis approaches (Figure 2). Although MS-DIAL involves a deconvolution algorithm, MS2Dec, to process the DIA-MS/MS data, MS2Dec requires at least two data point peak-top differences of co-eluted peaks for chromatogram deconvolution [6,12]. Moreover, MS-DIAL uses MS 1 chromatogram traces for metabolite quantification, while the MS/MS chromatogram can effectively be used to annotate and quantify the target metabolite in DIA-MS/MS data [15]. Therefore, the MRMPROBS program was used, in which the user-friendly GUI was available for data curation in addition to its favorable algorithm aspects [10,11], to compensate for the shortcomings of the MS-DIAL program. First, both DDA-and DIA-MS/MS data were analyzed using MS-DIAL. Second, all spectrometrically 'matched' candidates to an MS/MS spectrum were obtained and the function was newly developed. In this study, we utilized the DDA-MS/MS spectral data to examine the co-eluted metabolite profile because of the better spectrum quality than that of the DIA-MS/MS spectra. Third, the reference format file, which contains (1) metabolite name (2) retention time, (3) precursor ion m/z, and product ion m/z list, and (4) ion abundance ratios, was generated to cover all matched candidates. Finally, the DIA-MS/MS data were analyzed by MRMPROBS using the reference library where the peak left and right edges of the metabolite peak in each MS/MS chromatogram trace were manually refined. showcased. The upper-and bottom spectra show the experimental and in-silico MS/MS spectra, respectively. The string character indicates the abbreviation of fatty acids, and the ester link of the fatty acids is also described by string characters. NL refers to neutral loss. The algal species where the classified lipids are observed is also highlighted.

Strategy to Link Untargeted-and Targeted Analyses for Increasing Lipid Coverage
We further demonstrated the increased lipid profiling coverage by integrating untargeted and targeted analysis approaches (Figure 2). Although MS-DIAL involves a deconvolution algorithm, MS2Dec, to process the DIA-MS/MS data, MS2Dec requires at least two data point peak-top differences of co-eluted peaks for chromatogram deconvolution [6,12]. Moreover, MS-DIAL uses MS 1 chromatogram traces for metabolite quantification, while the MS/MS chromatogram can effectively be used to annotate and quantify the target metabolite in DIA-MS/MS data [15]. Therefore, the MRMPROBS program was used, in which the user-friendly GUI was available for data curation in addition to its favorable algorithm aspects [10,11], to compensate for the shortcomings of the MS-DIAL program. First, both DDA-and DIA-MS/MS data were analyzed using MS-DIAL. Second, all spectrometrically 'matched' candidates to an MS/MS spectrum were obtained and the function was newly developed. In this study, we utilized the DDA-MS/MS spectral data to examine the co-eluted metabolite profile because of the better spectrum quality than that of the DIA-MS/MS spectra. Third, the reference format file, which contains (1) metabolite name (2) retention time, (3) precursor ion m/z, and product ion m/z list, and (4) ion abundance ratios, was generated to cover all matched candidates. Finally, the DIA-MS/MS data were analyzed by MRMPROBS using the reference library where the peak left and right edges of the metabolite peak in each MS/MS chromatogram trace were manually refined. Figure 2. Integrated strategy of untargeted and targeted analyses to increase the coverage of annotated lipids. First, the data dependent (DDA) and independent (DIA) acquisition data are analyzed using the untargeted analysis pipeline where peak-picking, MS/MS assignment, and peak alignment are performed. Second, all potential lipid candidates that exceeded the cut-off of mass spectral similarity are obtained. Third, a reference library containing the target metabolite name, retention time, precursor ion m/z, product ion m/z list, and ion abundance ratios is automatically generated. In our study, the ratio "100" indicates that the trace is used to quantify the metabolite, and the diagnostic ion for characterizing the metabolite is described by "Q". Finally, the MS/MS chromatograms are analyzed by the targeted analysis pipeline where the chromatogram traces are evaluated with the reference libraries combined with manual curations. Figure 2. Integrated strategy of untargeted and targeted analyses to increase the coverage of annotated lipids. First, the data dependent (DDA) and independent (DIA) acquisition data are analyzed using the untargeted analysis pipeline where peak-picking, MS/MS assignment, and peak alignment are performed. Second, all potential lipid candidates that exceeded the cut-off of mass spectral similarity are obtained. Third, a reference library containing the target metabolite name, retention time, precursor ion m/z, product ion m/z list, and ion abundance ratios is automatically generated. In our study, the ratio "100" indicates that the trace is used to quantify the metabolite, and the diagnostic ion for characterizing the metabolite is described by "Q". Finally, the MS/MS chromatograms are analyzed by the targeted analysis pipeline where the chromatogram traces are evaluated with the reference libraries combined with manual curations. Importantly, smoothing level 1 was used in the MRMPROBS program and it was set to 3 in the MS-DIAL program. The higher smoothing level allows for the determination of peak left and right edges in the automated data analysis pipeline owing to the reduced noise level. Therefore, the higher smoothing value was used in the untargeted analysis. Conversely, the co-eluted peaks are often merged as a single peak. Because all chromatogram peaks could manually be checked and modified in the GUI, the lower smoothing value was used in the targeted analysis software.

Showcase of Newly Resolved Lipid Profiles by MS/MS-Centric Data Analysis
We showcased the methodology by profiling three co-eluted lipid molecules, DGDG 16:2-18:1, DGDG 16:1-18:2, and DGDG 16:0-18:3, with the same precursor m/z and similar retention times (Figure 3a). In the analysis of E. gracilis, only DGDG 16:2-18:1 was annotated correctly in MS-DIAL based on the spectral match score, while the MS/MS spectrum was partially interpreted as DGDG 16:1-18:2 and 16:0-18:3. Using the principles of MS-DIAL, the spectra of co-eluted metabolites are annotated using the representative metabolite with the highest spectral matching score, although it could be manually modified. Therefore, MS-DIAL generated the MRMPROBS reference format for the three lipid candidates, and the lipids were quantified using the MRMPORBS program. As a result, DGDG 16:0-18:3 was determined to be the major component of the co-eluted lipids in Chlamydomonas reinhardtii, Auxenochlorella protothecoides, C. sorokiniana, C. variabilis, and Dunaliella salina, DGDG 16:2-18:1 the major component in E. gracilis, and DGDG 16:1-18:2 in Nannochloropsis oculate and P. carterae (Figure 3b), where the lipid differences clearly reflected the differences in the phylum. Importantly, these differences could not be resolved using MS-DIAL untargeted analysis (Figure 3c) because the program uses MS 1 -centric peak quantification, i.e., the red traces in the chromatograms shown in Figure 3b. This indicated that the DIA-MS/MS data enabled the increased coverage of metabolic profiling, as described elsewhere [27], and the two program suites MS-DIAL and MRMPROBS provided a solution to fully utilize the information-rich MS/MS spectral data for comprehensive metabolome analyses.

Comparison of Untargeted-and Targeted Analysis Results
We characterized 1437 molecules in nine algal species, and the total count of annotated lipids was 40% higher than that (1023) of the lipids annotated using the previously developed methodology (Figure 4a, Table 1, Supplementary Data 1, 2, and 3). Moreover, we examined correlations among the quantification methods, including MS 1 -centric peak height in DDA and DIA-MS/MS data, and MS/MS-centric peak height and area in the DIA-MS/MS data. As expected, the correlation between peak height and area in DIA-MS/MS data were high (Figure 4b, right-bottom), but the correlation between peak heights in DDA and DIA-MS/MS data could be affected by the mass spectrometry settings [6] where the total scan cycle times were different in DDA (650 ms) and DIA-MS/MS (730 ms) data acquisition (Figure 4b, top-left). These differences were also caused by different LC-MS analysis days, where the MS sensitivities differed for each lipid class. Surprisingly, our results indicated that the correlations between MS 1 and MS/MS chromatograms were highly dependent on the lipid classes. This indicates that the sensitivity of the product ions is different for each lipid class, and the correlation value is high when abundances are compared within each lipid class. The dynamic range using MS/MS chromatograms was narrower than that of the MS 1 chromatograms, and the saturation behavior was observed in the correlation plots, especially for TAG lipids, between MS 1 -and MS/MS-centric quantifications. In fact, the SWATH-MS/MS channels of the TripleTOF 5600+ instrument have a lower linear dynamic range compared to MS 1 . These results suggest that MS 1 -centric metabolite quantification is slightly superior to that of the MS/MS-centric quantification, while the annotation coverage in MS/MS-centric analyses outperformed the untargeted analysis pipeline.

Comparison of Untargeted-and Targeted Analysis Results
We characterized 1437 molecules in nine algal species, and the total count of annotated lipids was 40% higher than that (1023) of the lipids annotated using the previously developed methodology using MS/MS chromatograms was narrower than that of the MS 1 chromatograms, and the saturation behavior was observed in the correlation plots, especially for TAG lipids, between MS 1 -and MS/MScentric quantifications. In fact, the SWATH-MS/MS channels of the TripleTOF 5600+ instrument have a lower linear dynamic range compared to MS 1 . These results suggest that MS 1 -centric metabolite quantification is slightly superior to that of the MS/MS-centric quantification, while the annotation coverage in MS/MS-centric analyses outperformed the untargeted analysis pipeline. Finally, a detailed investigation of fatty acid properties revealed the uniqueness of the acyl chains in each algal species (Figure 4c). Importantly, we used the fatty acid counts included in the lipid classes instead of the ion abundances because lipid ionization efficiency is highly dependent on the lipid class and retention time. In Plantae, 16:0, 16:1, 16:2, 16:3, 18:0, 18:1, 18:2, and 18:3 are known to be major acyl chains [28], while 16:4 and 18:4 are highly distributed in Chlorophyceae including C. reinhardtii and D. salina compared to Trebouxiophyceae including Chlorella species. Moreover, the acyl chain of 18:0 is not often observed in glycerolipids and glycerophospholipids in nine algal species. In Chromista and Protozoa, polyunsaturated fatty acids (PUFAs) are enriched in addition to the common 16 and 18 carbon length series. These results show that various PUFAs, such as 20:4, 20:5, 22:5, and 22:6, were observed in E. gracilis while 20:5 and 22:6 were enriched in P. lutheri, and 20:4 and 20:5 in the lipids of N. oculate. These observations were achieved using the MS/MS chromatogram traces for lipid quantification, and the approach is effective for investigating lipid profiles in living organisms to deepen our understanding of lipid metabolism and its connection with gene expression and enzyme activities.
In general, untargeted analysis searches are conducted for as many metabolites as possible to generate hypotheses in biology, though the rate of false-positive annotations becomes higher than that in targeted analysis; therefore, data analysts should devote much time and effort in curating annotation results. On the other hand, the targeted analysis focuses on a limited number of metabolites with less false-positive rate, though curation is still needed to modify the peak-picking results. Although the automated pipelines with the estimation of false discovery rate (FDR) in annotations have also been developed in metabolomics [29][30][31] for large-scale datasets like cohort studies, data analysts should recognize the pitfall in annotations that may be mentioned as false-positive metabolites in the metabolome and lipidome data sheet for statistical analyses. Therefore, the integrated analysis by untargeted and targeted techniques is important to reduce misleading results in biological studies. In fact, the metabolites of interest obtained from the integrated results must be validated using authentic standard compounds and further biological experiments to compensate for the lack of current MS instruments and informatic techniques providing limited stereochemical and isomer information.

Conclusions
Consequently, the reanalysis of published data was demonstrated where 17 lipid classes were newly characterized in nine algal species in addition to the 15 lipid classes annotated previously. In effect, the coverage of lipid classes was doubled by updating the computational mass spectrometry techniques and mass spectral libraries, and our reanalysis indicates the value of MS data repositories where the raw data could be utilized as a benchmark for new software programs and data-driven hypothesis generation. The lipidomics workflow is also executable with hydrophilic interaction chromatography (HILIC) [32] or supercritical fluid chromatography (SFC) [33], in which the molecules can be separated based on the specific chemical properties of each lipid class, enabling efficient exclusion of false-positive annotations from incorrect lipid classes. Ion mobility MS provides another diagnostic criterion, viz. collision cross-section (CCS), to increase the confidence in lipid annotation [34]. Although we only showcased the increase in lipid profiling coverage, this strategy could also be applied to more diverse metabolites with experimentally acquired spectral libraries. There are three types of spectral databases: (1) completely open-access, i.e., all records are browsable and downloadable (e.g., MassBank [35], PlaSMA [9], Fiehnlib [36], and GNPS [4]), (2) limited access, i.e., browsable but not downloadable (e.g., Metlin [17] and mzCloud (https://www.mzcloud.org/)), and (3) licensed databases such as NIST and Wiley, and the integrated databases cover the MS/MS spectra of approximately 12,000 unique metabolites [9,26]. Importantly, all these databases have increasingly been updated by the continuous effort of mass spectrometrists; therefore, success similar to that obtained herein is achievable by reanalyzing public data using the upgraded databases.
In conclusion, the science of metabolomics and lipidomics now enters a new era owing to state-of-the-art analytical techniques and informatics platforms where metabolic profiling is semi-automatically executable [37]. Therefore, MS data repositories will become increasingly important to reach a 'standard' in genomics and transcriptomics data sciences. Our computational workflow could be used as a pipeline for metabolomics and lipidomics data processing and the understanding of metabolism is deepened by advances in computational metabolomics.