Addressing big data challenges in mass spectrometry-based metabolomics

Advancements in computer science and software engineering have greatly facilitated mass spectrometry (MS)-based untargeted metabolomics. Nowadays, gigabytes of metabolomics data are routinely generated from MS platforms, containing condensed structural and quantitative information from thousands of metabolites. Manual data processing is almost impossible due to the large data size. Therefore, in the ‘‘omics’’ era, we are faced with new challenges, the big data challenges of how to accurately and eﬃciently process the raw data, extract the biological information, and visualize the results from the gigantic amount of collected data. Although important, proposing solutions to address these big data challenges requires broad interdisciplinary knowledge, which can be challenging for many metabolomics practitioners. Our laboratory in the Department of Chemistry at the University of British Columbia is committed to combining analytical chemistry, computer science, and statistics to develop bioinformatics tools that address these big data challenges. In this Feature Article, we elaborate on the major big data challenges in metabolomics, including data acquisition, feature extraction, quantitative measurements, statistical analysis, and metabolite annotation. We also introduce our recently developed bioinformatics solutions for these challenges. Notably, all of the bioinformatics tools and source codes are freely available on GitHub ( https://www.github.com/ HuanLab ), along with revised and regularly updated content.


Introduction
Small molecule metabolites play critical roles in numerous cellular activities and provide both direct and indirect readouts of various phenotypes. [1][2][3] The study of the entire collection of metabolites in a given biological system is termed metabolomics. Over the past few decades, metabolomics has been developed as a powerful and indispensable biotechnology in the postgenomic era of biology. In particular, metabolomics has been in demand across many research disciplines to gain a global view of metabolic changes for biomarker discovery and

Huaxu Yu
Huaxu Yu received his BSc in Chemistry at Zhejiang University in 2018. With a passion for mass spectrometry and metabolomics, he joined Prof. Tao Huan's group in 2019 to pursue a PhD in analytical chemistry. His research focuses on quantitative metabolomics. By developing novel bioanalytical workflows, bioinformatics software, and statistical methods, he seeks to facilitate fundamental and applied aspects of mammalian, plant, and microbial research. He is the author of 14 publications and the developer of two R packages. mechanistic understandings. [4][5][6][7] Given the wide chemical coverage, metabolomics has also been demonstrated as an important tool in exposome research to study the concurrent exposure to a wide variety of xenobiotics and understand their combined toxic effects in health and disease. [8][9][10] Among the various analytical instruments used to perform untargeted metabolomics, mass spectrometry (MS) is the most prominent choice. In particular, liquid chromatography (LC) coupled to high-resolution MS systems can routinely detect and quantify thousands of metabolic features from as little as 10 mg of tissue, 50 mL of urine, and half a million cells. 7,11,12 The high sensitivity and throughput of MS also generate a large amount of data. For instance, a typical LC-MS-based metabolomics study can generate over 10 GB of data in a 30-sample analysis. The gigantic amount of metabolomic information cannot be manually processed and interpreted as we usually do in traditional analytical chemistry. Furthermore, the large amount of data makes it challenging to perform metabolomics data acquisition, feature extraction, quantification, statistical analysis, metabolite annotation, data sharing, meta-analysis, and others. Fig. 1 summarizes the common big data challenges in mass spectrometry-based untargeted metabolomics. Addressing these big data challenges is vital to improving metabolomics data quality, obtaining confident biological insights, and minimizing biased biological claims. This Feature Article focuses on four major aspects of big data challenges in LC-MSbased metabolomics, including (1) data acquisition, (2) feature extraction, (3) quantitative and statistical analysis, and (4) metabolite annotation. It also reviews our recent developments and publications addressing these big data challenges from the past two years (2020-2022) ( Table 1).

Data acquisition
There are three common strategies to collect metabolomics data from MS: full-scan, data-dependent acquisition (DDA), and data-independent acquisition (DIA). [27][28][29] Full-scan mode acquires MS spectra composed of mass-to-charge ratios (m/z) and signal intensities of ions. Since the entire acquisition time is assigned to obtaining MS1 data, full-scan mode provides the detailed chromatographic peak shape and is suitable for quantification. However, full-scan cannot generate tandem MS (MS/ MS) data, thus making confident metabolite annotation impossible. In this regard, DDA and DIA modes were developed to obtain the MS/MS spectra after each MS1 spectrum. Particularly, DDA and DIA modes differ in ion isolation windows, which select ions for fragmentation. DDA opens narrow m/z windows (usually 1-5 Da) to acquire the MS/MS spectra for the most intense metabolic ions (i.e., precursor ions), producing a pure MS/MS spectrum for each selected precursor ion, but often lacks the MS/MS acquisition of low-abundant metabolic features. However, MS/MS spectra collection takes time away from MS1 data collection and reduces the number of MS1 data  points for constructing chromatographic peaks. Moreover, in the analysis of complex biological samples, the MS/MS acquisition speed is insufficient for comprehensive MS/MS coverage of all detected metabolic features. Many strategies have been developed to improve the efficiency of MS/MS collection, such as iterative MS/MS, gas-phase fractionation, and data-setdependent acquisition, among others. [30][31][32][33] In comparison to DDA, DIA opens wide m/z windows (usually 4100 Da) to obtain the fragments of multiple precursor ions, covering all metabolic ions, but requires sophisticated spectrum deconvolution tools to assign fragments to their parent ions. Although the pros and cons of each acquisition mode are intuitively known in metabolomics, there are few systematic comparisons to guide metabolomics practitioners to determine which mode fits a given study the best. 34,35 To address the knowledge gap, we systematically compared the three abovementioned data acquisition modes, focusing on their performance in metabolomics profiling 36 and ability to identify statistically significant features. 37 In the comparison of metabolomics profiling, we assessed the number of features, MS/MS spectra coverage and quality, quantitative precision, and data processing convenience (Fig. 2). Our results show that the most metabolic features are extracted from full-scan data, which is 53.7% and 64.8% more than DIA and DDA, respectively. In terms of MS/MS spectra, DDA generates higher quality MS/MS spectra that match MS/MS libraries better, whereas DIA has higher MS/MS spectral coverage. Regarding the quantitative precision, no significant difference was observed among these three acquisition modes. In the comparison of significant features discovered across these acquisition modes, we concluded that the consistently discovered ones are mostly true positive features (i.e., real metabolic features). 37 They have a strong correlation in abundance among all three modes and present similar statistical performance. On the other side, many uniquely discovered significant features are false positive features from background noise and system contamination. 37 Although DDA slightly underperforms full-scan and DIA in significant metabolic feature discovery, it is the most convenient method to obtain high-quality MS/MS spectra for metabolite annotation.
Following the comparison of these data acquisition modes, we believe that a better data acquisition strategy that integrates the advantages of the existing methods is essential for advancing LC-MS-based metabolomics. We thus developed datadependent assisted data-independent acquisition (DaDIA). 13 The DaDIA workflow performs DIA analyses of biological samples and DDA analyses of the pooled quality control (QC) samples analysed at regular intervals between biological samples throughout the analytical sequence (Fig. 2). The DIA analyses provide high coverage of metabolic features and MS/ Directly measuring the optimal feature extraction parameters Integrated Feature Extraction, 15 JPA 16 Extracting metabolic features of both high and low confidence EVA 17 Evaluating feature fidelity using chromatographic peak shapes ISFrag 18 De novo annotation of false positive metabolic features generated from insource fragmentation Quantitative comparison and statistical analysis MRC 19 Correcting fold change compression and inflation in MS-based metabolomics MAFFIN 20 Post-acquisition sample normalization PHPA_precision 21 Correcting computational variation caused by peak height or peak area-based quantification PowerU 22 Improving the statistical power of MS-based metabolomics ABC Transformation 23 Improving data normality with feature-specific data transformation Metabolite annotation HNL, CSS, and McSearch 24 Concept, algorithm, and web platform to perform spectral similarity analysis and molecular networking MS2Purifier 25 Recognizing and removing contamination fragment ions in experimental MS/ MS spectra SteroidXtract 26 Extracting steroid-like metabolic features based on their unique MS/MS patterns MS spectra, and the DDA analyses generate high-quality MS/MS spectra to improve the overall confidence of metabolite annotation. We further developed an R package, DaDIA.R, to automate the data processing and metabolite annotation of DaDIA data. Since DaDIA takes full advantage of DDA and DIA, it achieves a much higher coverage of metabolic features with better spectral quality. DaDIA was applied to a study comparing the metabolic alteration in the plasma of leukemia patients before and after receiving chemotherapy. Our results demonstrated that the DaDIA workflow can efficiently detect and annotate approximately four times more significantly altered metabolites than the conventional DDA workflow.

Feature extraction
Extracting metabolic features from raw LC-MS data is a longstanding challenge in untargeted metabolomics. The key is to accurately recognize the chromatographic peaks of real metabolic features and also efficiently clean up the false positive metabolic features coming from the background noise and artificial contaminants. 38 Our lab developed a suite of bioinformatics tools to make this process convenient and intuitive (Fig. 3). Over the past few decades, various metabolic feature extraction algorithms, including centWave, 39 GridMass, 40 and others, 41,42 have been developed to automatically recognize metabolic features in raw LC-MS data. These algorithms have also been implemented in commonly used open-source data processing software, such as XCMS, MS-DIAL, MZmine 2, OpenMS, El-MAVEN, and others. [43][44][45][46][47][48] Although the feature extraction process is automated, properly setting over a dozen different feature extraction parameters is difficult. Conventionally, the strategy of design of experiments (DOE) has been implemented to optimize feature extraction parameters. However, DOE-based optimization requires the testing of many parameter combinations, which is time-consuming and ineffective. [49][50][51][52][53][54] After reviewing the well-established metabolomics data processing software, we concluded that four univer-sal chromatographic peak attributes are critical to feature extraction: mass tolerance, peak height, peak width, and instrumental shift. By measuring these peak attributes directly from the raw LC-MS data, it is possible to attain optimal peak picking parameters defined as universal parameters. This is facilitated by the development of the novel concepts of rank-based intensity sorting, zone of interest, and many others. These concepts were then implemented into Paramounter, an R program that automatically and accurately extracts the distributions of these universal parameters from the raw LC-MS-based metabolomics data before feature extraction. 14 Our results showed that Paramounter-based direct measurement of feature extraction parameters performs better than conventional DOE-based approaches. It is also more efficient and convenient to use. The proposed universal parameters and the development of Paramounter address a critical need in metabolomics data processing. It is important to note that this work can potentially extend to optimizing parameters for gas chromatography-mass spectrometry (GC-MS)-based metabolomics data. Another notable challenge in feature extraction is that conventional peak picking algorithms are unable to completely extract features with low abundance or poor chromatographic peak shapes. In particular, many real metabolic features have valid MS/MS spectra but cannot be extracted by conventional peak picking algorithms. In this regard, we designed a peak picking algorithm that can directly extract metabolic features based on their available MS/MS spectra in the raw DDA-based LC-MS data. We also combined this MS/MS spectra-based peak picking algorithm with conventional peak shape-based peak picking to build an integrated workflow for a more comprehensive extraction of metabolic features. 15 The proposed integrated feature extraction algorithm extracted 25% more metabolic features from a human urine sample than the conventional centWave-based feature extraction algorithm with the same parameter settings. Furthermore, we created a targeted feature extraction algorithm for use with a targeted list of metabolites with known m/z and retention time. Combining the Fig. 3 The pipeline of metabolic feature extraction. A suite of bioinformatics tools has been developed in our lab to address the challenges of feature extraction, including optimizing feature extraction parameters, extracting low-quality metabolic features, evaluating feature quality, and removing insource fragment features. peak shape-, MS/MS spectra-and targeted list-based peak picking strategies, we constructed JPA (short for joint metabolic feature extraction and automated metabolite annotation), an R package that not only extracts the metabolic features using the integrated strategy but also performs the automated metabolite annotation. When the three algorithms were applied together on a mixture of 134 endogenous metabolite standards, JPA demonstrated superior feature detection sensitivity by reaching a limit of detection (LOD) thousands of times lower than the conventional centWave peak picking algorithm. Moreover, JPA also surpassed the conventional centWave algorithm by detecting 2.3-fold more exposure chemicals from a standard mixture containing 505 drugs and pesticides. 16 On the other hand, enhancing feature extraction sensitivity usually comes with an increase in false positive features. False positive metabolic features can reduce the confidence of downstream statistical and biological interpretations. 38 A common practice to find false positive features is to manually check the peak shapes of the extracted ion chromatograms (EICs) of metabolic features. Typically, real metabolites are more likely to have Gaussian-shaped chromatographic peaks. Manual checking of EICs is very effective in filtering out false positive noise peaks, as an experienced analytical chemist can easily differentiate between true and false positive features by simply looking at their peak shapes. However, metabolomics data contains thousands of metabolic features, and manual inspection of their EICs is extremely labour-intensive and timeconsuming. Previous work developed a strategy to send the data to a smartphone application so that users can manually check the peak shapes, but the time spent on manual checking was not clearly reduced. 55 To replace the tedious process with a labour-free task, we developed an artificial intelligencebased program using a convolutional neural network (CNN) model, well-known for its efficient performance in image classification. 56 CNN is a type of deep learning algorithm, along with artificial neural networks (ANN), recurrent neural networks (RNN), and many others. 57 Compared to traditional machine learning (e.g., Support Vector Machine and Random Forest), the biggest advantage of these deep learning algorithms is that they can learn high-level features from data without manual feature extraction and are very efficient with large-scale data sets. 58 Notably, the metabolomics community has recognized the potential of deep learning in metabolomics. 58,59 Previous research efforts have incorporated deep learning in metabolic feature extraction, metabolite annotation, and the predictions of retention time and collision cross section. [60][61][62][63][64][65][66][67][68][69] Regarding classifying good and bad chromatographic peak shapes, a previous work applied CNN to the classification of GC-MS chromatographic peaks. 70 In another study, CNN was used to determine the chromatographic peak shapes in LC-MS-based metabolomics data. 71 However, that workflow involves multiple R and Python scripts that users have to run individually. Due to their small training data size, users would also need to retrain the model for different LC-MS conditions, such as different spectra acquisition rates.
In our work, we aimed to develop a robust and easy-to-use CNN model for chromatographic peak recognition. To minimize data overfitting and ensure the robustness of the model, we trained our CNN model with over 25 000 manually inspected plots of true and false chromatographic peaks generated from 22 different LC-MS-based metabolomics studies. Furthermore, we created a Windows application named EVA (short for evaluation of chromatographic peak shapes) for the convenience of metabolomics researchers with limited programming experience. 17 Evaluated using metabolomics data from different MS instruments and acquisition rates, EVA was proved to achieve over 90% classification accuracy when referenced against manual checking results. Notably, another work was published later following a similar CNN strategy. Future work is needed to make performance comparisons. 72 Removing false positive features with poor peak shapes is not the last stop, as metabolic features with good EIC peak shapes still might not be real metabolites. Another common type of false positive feature originates from in-source fragmentation (ISF). In LC-MS analysis, ions generated during electrospray ionization (ESI) are always accompanied by ion fragmentation, which leads to ISF ions. ISF is a naturally occurring and inevitable phenomenon that is independent of ionization voltage. 73 Annotating ISF features as real metabolites by mistake is detrimental to the downstream biological interpretation. Previous works have developed strategies to recognize ISF via manual efforts, stable isotopes, or reference standards. [74][75][76][77][78] However, many metabolic features have diverse ISF patterns and might not have a standard MS/MS spectrum available for such manual checking. To provide an automated workflow for de novo recognition of ISF features, we developed the MS/MS library-free R package ISFrag to seek ISF features based on three patterns: (1) ISF ions coelute with their precursor ion, (2) the m/z of ISF ions appear in the MS/MS spectrum of their precursor ion, and (3) ISF ions and their precursor ion are similar in fragmentation patterns and thus have highly correlated MS/MS spectra. 18 Notably, ISFrag can be used on LC-MS data generated from full-scan, DIA, and DDA modes as long as at least one DDA analysis is performed to provide the MS/MS spectra required by ISFrag. Our results show that ISFrag achieves 100% accuracy in recognizing the ISF features that fit all three abovementioned patterns from the data of a standard mixture containing 125 endogenous metabolites. ISFrag allowed us to successfully recognize falsely annotated metabolites in a human urine dataset, determining them to be in-source fragments.

Quantitative measurement and statistical analysis
Besides the number of detected metabolic features, quantitative accuracy and precision are additional key drivers for delivering successful metabolomics analyses. Our lab has developed a set of bioinformatics tools to improve quantitative accuracy, precision, and statistical performance (Fig. 4). In untargeted metabolomics, the absolute quantification of every detected metabolic feature is not possible. Instead, the relative MS signal ratio, or signal fold change, between biological sample groups (e.g., normal vs. diseased) is widely used for quantitative comparison. In this case, quantitative accuracy in untargeted metabolomics is more about whether the concentration ratio can be accurately determined. Our recent works discovered that the MS signal intensity ratio can have clear quantitative biases. 19,79 Particularly for the ESI-MS analytical platform, the measured MS signal intensity ratio can be significantly lower or higher than the concentration ratio, which are termed fold change compression and inflation, respectively (Fig. 4a). Mechanistically, fold change compression and inflation are caused by the non-negligible intercept values of the linear calibration curves. Our urine metabolomics study showed that 72% of metabolic features have compressed fold changes and 16% of features have inflated fold changes. Surprisingly, only 12% of features possess unbiased MS signal intensity ratios with 10% relative error or lower. Even worse, these biased ratios exist in the linear range of ESI responses and cannot be corrected even after careful injection volume optimization. In this respect, we developed the metabolic ratio correction (MRC) workflow, an integrated analytical workflow with automated data processing tools to correct the biased MS signal intensity ratios. In addition to the routine analytical sequence, MRC workflow analyses serial diluted QC samples to construct calibration curves for each metabolic feature (Fig. 4a). The measured MS signal intensities in biological samples are then converted to QC loading amounts for downstream quantitative comparison and statistical analysis.
The fair comparison of samples with equal amounts or concentrations is also critical to quantitative accuracy, which is achieved by sample normalization. Sample normalization is especially important for biological samples with significant biological dilution effects, such as urine, saliva, and feces. [80][81][82][83] In general, sample normalization can be applied either before or after data acquisition. Pre-acquisition sample normalization measures a certain quantity that reflects the total sample amount or metabolite concentration. The samples are then reconstituted to appropriate final volumes based on the measured quantities to make the total concentration consistent between the samples. For example, creatinine level generally reflects the urine concentration and is commonly used for normalizing urine samples. 80,84,85 However, many biological sample types lack reliable quantities that represent the total sample amount for normalization. 84 Post-acquisition sample normalization is an alternate strategy that is data-driven and does not require reliable quantities for different sample types. Certain assumptions are made about the metabolomics data structure, and then normalization factors are calculated for adjusting the measured signal intensities. In essence, the accuracy of the assumption determines the post-acquisition sample normalization performance. For instance, the mass spectrum total useful signal (MSTUS) algorithm assumes equal total signal intensities among samples. However, the MS signal intensities of different metabolic features can vary by several magnitudes owing to different concentrations and ionization efficiencies. Therefore, the MSTUS algorithm can be dominated by high-intensity metabolic features and fail to reflect the change in the total metabolome. Probabilistic quotient normalization (PQN) addresses the issue of drastically different signal intensities and thus can be more useful. In the PQN-based workflow, quotients of all metabolic features are calculated against the reference sample, and the median of the calculated quotients is used as a normalization factor. 86 After quotient calculation, all metabolic features are equally considered for normalization regardless of their original MS intensities. However, the median quotient only correctly represents the normalization factor when the numbers of up-and down-regulated metabolic features are equal. 20 However, it is quite common to see different numbers of up-and down-regulated metabolic features in metabolomics. In addition, given the abovementioned issue of signal ratio bias, calculating quotients directly using MS signals might not represent metabolic concentration changes accurately. Even so, the detected metabolic features are not pre-processed before normalization in conventional post-acquisition normalization algorithms, which causes significant bias.
In this regard, we developed MAFFIN (short for maximal density fold change normalization with high-quality metabolic features and corrected signal intensities), an accurate and robust post-acquisition sample normalization workflow for MS-generated metabolomics data that is independent of sample type (Fig. 4b). 20 MAFFIN first selects high-quality metabolic features by evaluating multiple orthogonal quantification criteria and then corrects their MS signal intensities for normalization. Then, we created an efficient method to calculate normalization factors, which is based on the maximal density fold change (MDFC) computed by a kernel density approach. 87 Unlike the PQN algorithm, which relies on balanced up-and down-regulated metabolic features, MDFC normalization assumes that the unchanged metabolic features dominate the fold change frequency. Hence, it is not influenced by the balance of up-and down-regulated metabolic features. Using simulated data, we show that as long as the percentage of unchanged metabolic features is larger than 25%, MDFC is a good representation of the true normalization factor. 20 Using twenty publicly available and two in-house metabolomics data sets, we confirmed that MAFFIN outperforms four commonly used post-acquisition normalization methods, including total intensity, median intensity, PQN, and quantile normalizations, in terms of reducing intragroup variations. The biological application of MAFFIN on a human saliva metabolomics study reduces the unwanted variation introduced by the biological dilution effect, leading to better data separation in principal component analysis (PCA) and more significantly altered metabolic features.
Quantitative precision is another key factor for a successful metabolomics study. Our recent work recognizes that besides the well-recognized analytical and biological variations, untargeted metabolomics encounters additional quantitative variation, termed computational variation. 21 The computational variation is caused by automated computational data processing steps, where the software cannot accurately determine chromatographic peak heights/areas for metabolic features with poor chromatographic peak shapes (Fig. 4c). 16 Using various biological sample types, we systematically investigated how sample concentration, LC separation conditions, and data processing software contribute to computational variation. Our results suggest that the computational variation is largely determined by the data processing software. In addition, the magnitude of the computational variation is consistent across different samples when their metabolic concentrations are similar. We further developed PHPA_precision, a tool to minimize the computational variation in metabolomics studies by properly selecting between peak height or area for the peak intensity calculation method. This bioinformatics solution helped reduce the computational variation of 71% (652/915) of metabolic features, and over 31% (206/652) of the corrected features showed distinctly changed statistical significance.
Following the quantitative comparison, our lab also attempted to understand metabolomics data distributions in order to improve the performance of statistical analyses. Currently, parametric statistical models, such as Student's t-test, are widely used to extract the significantly changed metabolites. However, the requirements on data normality for these statistical analyses are often violated due to the nonlinear ESI responses in MS-based metabolomics. As a result, the statistical power can be reduced and some significantly changed metabolites are thus missed. Although nonlinear ESI response has been well-known for decades, its impact on data distribution and statistical analysis has not been systematically studied. To address this knowledge gap, we used both Monte Carlo simulations and real metabolomics data sets to quantitatively assess the diminished statistical power caused by nonlinear ESI responses (Fig. 4d). Our urine metabolomics data demonstrated that over 80% of metabolic features present nonlinear ESI response patterns, causing either left-skewed or right-skewed MS signal distributions. 22 In addition, clear relationships between the degree of reduced statistical power and sample size/effect size were observed. To address this issue, we developed PowerU, a data processing tool to minimize the non-normality induced by nonlinear ESI response. 22 Applying PowerU to a metabolomics study of mouse gut microbiome led to 105 extra metabolic features being discovered as significant, which largely reduces the chance of missing important biomarkers.
Besides nonlinear signal response, many other factors contribute to the overall non-normal metabolomics data distribution, including intrinsically non-normally distributed concentration data, sample collection, and sample preparation. As a result, the metabolomics data distributions are often diverse and complicated. However, despite the thousands of metabolomics publications every year, the study of metabolomics data distribution is limited. Additionally, in routine metabolomics practice, data transformation is commonly used to shape the various non-normal data distributions for statistical analysis. However, the most popular transformation approaches, log and square root transformations, 88 do not consider the data structure and treat all the metabolic features equally. Therefore, there is no guarantee that the data normality can be improved after applying those transformations. Recently, our work explored and modeled the metabolic feature intensity distributions using three large and publicly available data sets, which confirmed that the non-normal distribution is common and varied in untargeted metabolomics research. The metabolomics data were modeled into nine types of beta distributions, among which two low-normality types are particularly common. Given the diverse data distributions, we proposed adaptive Box-Cox (ABC) transformation, a featurespecific data transformation approach for improving data normality (Fig. 4e). 23 A power parameter, lambda, is tuned based on the data structure of each metabolic feature to ensure improved data normality after transformation. Tested on a series of Monte Carlo simulations, ABC transformation outperforms the two abovementioned conventional data transformation methods for both positively and negatively skewed data distributions. However, it is important to recognize that any nonlinear data transformation will change feature-to-feature relationships. For the correlation analysis of a metabolic feature pair, it is recommended to use the original quantitative data rather than the transformed data. Additionally, data transformation methods can alter the overall data distribution pattern. Especially in our feature-specific data transformation workflow, different features can be subjected to different transformation functions. Consequently, the visualization of the overall metabolic changes (e.g., principal component analysis) might be distorted. 23

Metabolite annotation
Metabolite annotation is the last step before sending biological researchers a list of significantly changed metabolites for biological interpretation. According to the metabolite annotation confidence levels proposed by the Metabolomics Standards Initiative (MSI), level 1 identifications refer to metabolite structures confirmed by chemical standards with MS1, MS/MS, and retention time matched. 89 However, due to the limited chemical standards available, the majority of metabolic features remain unannotated in MS-based metabolomics studies, forming the ''dark matter'' in untargeted metabolomics. 90,91 To properly annotate unrecognized metabolites, there are generally three complementary strategies, including (1) known-to-unknownbased propagated annotation, 92 (2) de novo annotation, 93 and (3) in silico fragmentation-based annotation. 94 Our lab recognizes the limitations of each strategy and develops bioinformatics solutions to address them (Fig. 5).
First of all, annotation of unrecognized metabolites often relies on searching for known metabolites with similar chemical structures. The structural similarity can be reflected by MS/MS spectral similarity, which is the key in known-to-unknown based metabolite annotation. Therefore, the development of a proper algorithm to compute spectral similarity is of great importance. Previous developments of Global Natural Products Social Molecular Networking (GNPS) and NIST Hybrid Similarity Search (HSS), among others, have been proposed. [95][96][97] These algorithms consider the matching of both the m/z of fragment ions and the m/z differences between fragment ions and their precursors (i.e., neutral losses). They can reflect a certain degree of spectral similarity between metabolites and their one-step reaction biotransformed derivatives. However, these conventional algorithms show limited capability in capturing the common core structural component embedded in the metabolites, as the core structural information cannot be captured using fragment ions or neutral losses.
To create a spectral similarity algorithm that considers the core structural information, we proposed the concept of hypothetical neutral loss (HNL), which is defined as the mass difference between a pair of fragment ions in an MS/MS spectrum (Fig. 5a). 24 These mass differences are hypothetical as (1) some HNL values of an experimental spectrum may not represent real metabolite substructures but are merely arbitrary values; and (2) some HNLs are not even generated during the fragmentation process. We demonstrated that HNL values contain core structural information that can improve access to shared structural units between two MS/MS spectra. We thus developed the Core Structure-based Search (CSS) algorithm, which considers conventional fragment ions, neutral losses, and more importantly, HNL values. Compared to existing spectral comparison algorithms, CSS shows a significantly improved correlation between spectral and structural similarities, paving the way for more accurate and informative molecular networking analysis. Furthermore, by combining the CSS algorithm, an HNL library, and a biotransformation database, we developed Metabolite core structure-based Search (McSearch), a web-based platform to facilitate the annotation of unknown metabolites by referencing the MS/MS spectra of their structural analogs.
During spectral similarity analysis, as well as de novo spectra interpretation, the spectral quality of experimental MS/MS matters. However, MS/MS data collected from LC-MS analyses are often contaminated because the selection of precursor ions is based on a low-resolution quadrupole mass filter. A consequence of the wide m/z isolation window is that precursor ions of other chemicals with similar m/z values can also get through the mass filter into the collision cell for fragmentation. The fragmentation of unwanted precursor ions generates contamination fragmentation ions (CFIs), which show up with true fragmentation ions (TFIs) from the targeted precursor ions, leading to ''chimeric'' MS/MS spectra. This issue has been recognized in metabolomics with the development of RAMSY. 98 To recognize and remove CFIs in experimental MS/ MS spectra, we proposed a peak correlation-based approach (Fig. 5b). 25 The primary premise is that TFIs should coelute with their parent ions with highly correlated LC chromatographic patterns, but CFIs do not necessarily follow the patterns. On top of that, we developed MS2Purifier, a machine  99 These two approaches use complementary algorithms to remove contamination fragment ions, and the combined usage may lead to better spectra purification.
On the other hand, in silico fragmentation is a powerful solution that generates predicted MS/MS spectra for a broad range of chemicals without reference standards. 94,100-106 Particularly, combining in silico structural databases with machine learning approaches further enhances the confidence of unknown identification. 93,107,108 To achieve in silico MS/MS prediction, fragmentation rules are usually implemented, of which an important one is the even-electron rule. It states that even-electron precursor ions should follow heterolytic cleavages and predominately generate even-electron fragment ions with very few radical fragment ions (RFIs). 109 However, our study of over one million low-energy collision-induced dissociation (CID) MS/MS spectra for 27,613 unique chemical compounds in the NIST20 MS/MS spectral library shows that over 60% of MS/MS spectra of even-electron precursors contain at least 10% RFIs by ion-count (total number of ions) in positive and negative ESI modes (Fig. 5c). 110 This work indicates that the even-electron rule is widely disobeyed, and strictly following the even-electron rule may lead to the non-comprehensive prediction of MS/MS spectra.
Last but not least, in many metabolomics studies, biological researchers are interested in not the entire metabolome but specific classes of chemicals that are essential to the biological process. For instance, steroids are a class of molecules that play a critical role in many physiological systems and diseases, yet many steroids are unrecognized and unreported in the literature. The ability to unbiasedly and accurately detect and quantify both known and unknown steroids is of great significance. However, the recognition of unknown steroids is a big challenge. To address this question, our lab proposed a biology- driven solution. 26 In that work, we developed a CNN-based bioinformatics tool, SteroidXtract, to recognize steroid molecules in MS-based untargeted metabolomics using their unique MS/MS spectral patterns (Fig. 5d). Our results demonstrate that SteroidXtract can confidently identify a broad range of both known and unknown steroids in biological samples, greatly accelerating a variety of steroid-focused life science research. Compared to conventional statistics-driven untargeted metabolomics data interpretation, our work offers a novel automated biology-driven approach that prioritizes biologically significant molecules with high throughput and sensitivity.
In general, the prediction of chemical classes directly from MS/MS data alone does not work well for all chemical classes. This is mainly due to the limited reference spectra available, which leads to the problem of compound class imbalance. Our SteroidXtract work addressed this issue by data augmentation, the creation of artificial training data from the existing steroid MS/MS spectra. 26 However, achieving a system-level chemical classification using data augmentation has not been tested. Moreover, many chemical classes do not have clear or specific MS/MS spectral patterns, lowering the prediction sensitivity and specificity. As such, achieving generic chemical class prediction requires other structural and spectral information. The recent publication of CANOPUS (class assignment and ontology prediction using mass spectrometry) makes it possible to perform system-level compound class predictions directly from molecular fingerprints. 111 CANOPUS was trained using support vector machine and deep learning algorithms to build the connections between fragmentation patterns, molecular fingerprints, and chemical classes. A key advantage of this design is that it separates the prediction of fingerprints using MS/MS spectra and the prediction of chemical classes using fingerprints. Therefore, these two models can be trained using separate datasets. This allows the prediction of chemical class using fingerprints, not limited to the data that have available reference MS/MS spectra, and it can utilize the entire chemical database for training. For the application of CANOPUS, an inputted MS/MS spectrum is processed to generate a fragmentation tree and predicted molecular fingerprints that are then used to predict the hierarchical compound class of the represented metabolite. There are also other structural classification approaches that rely on MS/MS clustering or chemical database searching. 112,113 Future research may go towards in-depth global metabolite annotation and structural analog discovery with the aid of compound class-enhanced molecular networking. Additionally, comparative metabolomics on the compound class level may also provide a more comprehensive and intuitive mechanistic insight behind biological questions. 111,114 Conclusion and future perspectives In this Feature Article, we elaborated on our bioinformatics solutions that address the major big data challenges regarding data acquisition, feature extraction, quantitative and statistical analysis, and metabolite annotation. A successful metabolomics study depends on the careful consideration of all these challenges given a data acquisition platform. Therefore, it is important to have a metabolomics data processing workflow that reasonably combines the newly developed computational tools. In addition, more advanced data acquisition techniques come with new data challenges. For instance, ion mobility-mass spectrometry (IM-MS) adds collision cross section (CCS) as another dimension of data complexity, imaging MS generates metabolomics data with spatial distribution in tissue samples (i.e., spatial omics), and so on. Hence, new bioinformatics tools are needed to reveal the meaningful biological information hidden in those high-dimensional data. Finally, tools for multi-omics data integration and visualization are still greatly needed. We hope that this paper can help researchers become more aware of big data challenges in MS-based metabolomics and encourage the metabolomics community to further develop bioinformatics solutions to address them.

Author contributions
T. Huan, J. Guo, H. Yu, and S. Xing wrote the manuscript jointly.

Conflicts of interest
There are no conflicts to declare.