University of Birmingham How close are we to complete annotation of metabolomes?

The metabolome describes the full complement of the tens to hundreds of thousands of low molecular weight metabolites present within a biological system. Identiﬁcation of the metabolome is critical for discovering the maximum amount of biochemical knowledge from metabolomics datasets. Yet no exhaustive experimental characterisation of any organismal metabolome has been reported to date, dramatically contrasting with the genome sequencing of thousands of plants, animals and microbes. Here, we review the status of metabolome annotation and describe advances in the analytical methodologies being applied. In part through new international coordination, we conclude that we are now entering a new era of metabolome annotation.

How close are we to complete annotation of metabolomes?
Mark R Viant 1,2 , Irwin J Kurland 3 , Martin R Jones 1 and Warwick B Dunn 1,2 The metabolome describes the full complement of the tens to hundreds of thousands of low molecular weight metabolites present within a biological system. Identification of the metabolome is critical for discovering the maximum amount of biochemical knowledge from metabolomics datasets. Yet no exhaustive experimental characterisation of any organismal metabolome has been reported to date, dramatically contrasting with the genome sequencing of thousands of plants, animals and microbes. Here, we review the status of metabolome annotation and describe advances in the analytical methodologies being applied. In part through new international coordination, we conclude that we are now entering a new era of metabolome annotation.

Introduction
Metabolomics is the multidisciplinary field of research concerned with the study of metabolomes, the complement of naturally-occurring and exogenous (e.g. environmental pollutants), low-molecular-weight (typically <1500 Da) metabolites present within biological systems [1]. Comprising of precursors, intermediates and products of biochemical pathways, metabolites constitute some of the terminal products of higher cellular processes and collectively provide a 'fingerprint' of the complex interplay between genome and environment. From an analytical perspective, both the measurement and identification of whole metabolomes presents a considerable challenge, not least due to the vast structural heterogeneity of metabolites, their large number (e.g. an estimated 200 000 structurally-distinct secondary metabolites across the plant kingdom [2]) and their wide concentration ranges (estimated to span 12 orders of magnitude [3]). As a point of clarity, the formal definitions of metabolite annotation and identification, as developed by the Chemical Analysis Working Group of the Metabolomics Standards Initiative (MSI) [4], are shown in Table 1. The categorical scoring system defines 'identification' as the most rigorous (level 1) while 'annotation' does not require such exhaustive analytical validation (levels 2 and 3). Currently there are no completed lists of experimentally-derived metabolites that describe the metabolome of any model organism, not even as putative annotations.
Meaningful biological inferences may only be drawn from metabolomics datasets where peaks can be structurally identified as named metabolites, that is it is only when we are empowered to move beyond discussing unidentified peaks to rigorously identified metabolites that we can fully engage in describing metabolic pathways and integrate metabolism with other levels of biological hierarchy. For over a decade, molecular identification has remained the principal technical bottleneck in metabolomics [5,6]. Hence, for metabolomics to deliver its full potential in fields from medicine to ecology, innovations in analytical workflows are urgently required. Yet based on the literature, it is readily apparent that the core metabolomics workflow has changed little over the past 15 years, typically comprising of sampling, measurement of metabolites by mass spectrometry (MS) and/or nuclear magnetic resonance (NMR) spectroscopy, data processing and statistical analyses, with a view to discovering peaks of biological importance [7,8]. Those peaks are typically searched against databases, providing limited putative annotation. Rarely do investigators undertake the challenging and time consuming step of identifying peaks that are not present in databases [9], using methods that are common to natural products chemistry such as fractionation, high resolution accurate mass MS, and 1D and 2D NMR for structure determination. Typically, a significant proportion of detected peaks are not annotated or identified, dependent on the analytical platform used and sample type. Hence it is appropriate to conclude that all experimental metabolomics studies to date would have generated additional biological insights were metabolite identification a more tractable process, that in turn may have allowed for more complete metabolome network derivation. Metabolite identification remains a colossal challenge and a step change is needed. Here, we review the status of metabolome annotation, introduce the important role of model organisms, and describe the analytical methodologies being applied.

Can model organisms help metabolome annotation?
A critical question is how such a transformative change will occur in metabolomics, to address this more than decade long problem. We believe a combination of approaches is required, including new analytical strategies, computational algorithms and database resources, and also a concerted effort by the metabolomics community to solve this bottleneck. This latter point has recently been recognised with the formation, in 2015, of a scientific task group of the international Metabolomics Society to progress the characterisation of metabolomes by initially focusing on a few model organisms [10 ]. The value of model organisms across biology and medicine is huge [11]. While seemingly disparate, research into bacteria, yeast, insects, worms, fish, rodents and plants has shown that the core biochemical operating principles have been conserved across all living organisms. Hence findings derived from non-mammalian model animals, for example can shed light on biological processes in humans ( Table 2).
The Model Organism Metabolomes (MOM) task group's philosophy is to leverage upon the critical mass of research activity and knowledge that exists for model organisms, that is to encourage the community to focus their metabolite identification efforts on systems we know the most about already (i.e. have species-specific metabolite databases [12][13][14]), that have sequenced genomes (hence can create genome-wide metabolic reconstructions to predict metabolism; [15]), and that when the metabolomes are successfully identified this knowledge will be of greatest value to the community [10 ]. The two primary aims of the MOM task group are to integrate disparate model organism-focused research groups into an interactive community, and to share, discuss and develop the analytical and bioinformatics strategies to progress the identification of model organism metabolomes, resulting in best practice documents ( Figure 1). Ultimately, this task group has set a grand challenge: to identify and map all 'system' metabolites onto metabolic pathways, to develop quantitative metabolic models for model organisms, and to relate organism metabolic pathways within the context of evolutionary metabolomics, that is phylometabolomics [10 ]. Efforts have begun to optimise analytical methods for metabolome identification, for example in Escherichia coli [16], Saccharomyces cerevisiae [17 ] and Caenorhabditis elegans [18], as well as to mine the literature for existing knowledge, for example in S. cerevisiae [19]. An atlas of tissue-specific metabolomes has also been initiated for Drosophila melanogaster, including both polar and lipophilic metabolites [20].

Extending our analytical strategies to progress metabolome identification
With the ambition to more deeply characterise the complete metabolomes of model organisms, what recent developments in analytical chemistry have been applied? Unlike for genomics, where disruptive technologies are relatively common [21], the analytical methods used in metabolomics have changed relatively little over the last decade. What has occurred recently is a considerable growth of targeted methods for studying swathes of metabolism, likely driven by the very frustration of limited peak annotation in non-targeted metabolomics, as discussed above. For example, a number of targeted LC-MS/MS assays have been developed to profile from a few tens to a couple of hundred metabolites in rice [22,23,24 ] and mammalian samples [25,26]. While benefitting from yielding metabolic data that is identified and often quantitative, all of these studies only scrape the surface of the thousands of metabolites estimated to comprise a metabolome. Hence, to an extent, this shift to targeted assays is a distraction (except for cases where How close are we to complete annotation of metabolomes? Viant et al. 65 Table 1 Summary of levels of confidence in metabolite 'identification', as defined by the Chemical Analysis Working Group of the Metabolomics Standards Initiative [4] Level of confidence So what is the current status of non-targeted metabolomics for fully characterising metabolomes? Both gas chromatography (GC) and liquid chromatography (LC) methods continue to be developed, including multidimensional chromatography and the application of multiple columns for the separation of different classes of metabolites. For example, a 'broad spectrum' GC-MS method has been developed to measure non-volatile 66 Omics   metabolites in tropical fruits [27]; a total of 92 peaks were detected of which the authors identified only 45. Utilising a comprehensive GCxGC approach, coupled with headspace solid phase microextraction (HS SPME) and a time of flight (ToF) MS, Alves et al. reported the identification of 257 volatile metabolites from S. cerevisiae distributed over more than a dozen chemical families [28]. For a recent review of advanced multi-dimensional separations in mass spectrometry, see Ref. [29]. The inherent requirements of GC-MS for the thermal stability and volatility of the analytes, or derivatives thereof, means that, alone, this technique is unable to facilitate comprehensive metabolome annotation. Instead, the majority of non-targeted metabolomics studies continue to employ LC, and there is an increasing trend towards the application of several column types in a given study; for example Tufi et al. not only used two LC columns (C 18 and HILIC) but also GC-MS to study a freshwater snail Lymnaea stagnalis [30 ]. This more comprehensive analytical approach was applied specifically to obtain a broader picture of the hydrophilic and lipophilic metabolome.
The application of multiple analytical platforms, as applied to L. stagnalis [30 ] is an emerging trend in metabolomics. Geier et al. [31 ] applied three different platforms to analyse C. elegans, including 1D 1 H NMR spectroscopy, GC/MS and UPLC-MS. The deeper integration of NMR and MS data in automated metabolite identification pipelines is an emerging topic [32]. An even broader range of metabolites were measured in 31 varieties of rice using HS SPME GC-MS, primary polar metabolites by GC-ToF-MS, both polar and semi-polar compounds by 1 H NMR and direct infusion MS, and multi-elemental analysis using ICP-MS [33 ]. While more time intensive and costly, deeper characterisation of organismal metabolomes currently requires such a multi-platform strategy. Fortunately, as long as the metabolic knowledge is captured in relevant open access databases, such as MetaboLights [34], then this strategy only needs to be conducted once. A related project to deeply annotate the metabolome of the NIH model species Daphnia magna (waterflea) is underway in the primary authors' laboratory, applying multiple extraction methods, LC-MS/MS and MS n methods, GC-Orbitrap MS, and 1D and 2D NMR spectroscopy. Progress has also been reported in the integration of several platforms to enable metabolite identification by UHPLC-SPE-NMR-MS [35]. In addition, approaches such as ion mobility mass spectrometry [36] and ultrahigh resolution mass spectrometry [37] hold considerable promise for contributing to metabolome annotation projects.
Another methodology that has considerable potential for aiding metabolite identification is stable isotope labelling, for example to probe the sulfur metabolome of Arabidopsis [38,39]. Isotopic ratio outlier analysis (IROA) is another isotope labelling technology that is designed to generate specific 13 C isotopic patterns in metabolites for both high resolution LC-MS and GC-MS [17 , [40][41][42][43]. Unlike other stable isotope labelling methods, rather than utilising natural abundance and 98-99% enrichment for the control and experimental populations, respectively [44][45][46][47][48], IROA uses an enrichment level of 95% and 5% 13 C. This leads to more observable isotopic peaks in the mass spectra in predictable and diagnostic patterns. Recent studies have demonstrated the promise of IROA for metabolic phenotyping in model organisms, including for prototrophic S. cerevisiae [17 ,49] and C. elegans [43]; the latter was grown in liquid culture with 13 C-labeled E. coli that was first grown in M9 minimal media on either 95% or 5% 13 C glucose, creating labelled C. elegans. These 95% 13 C and 5% 13 C glucose labelling experiments, when extracted and combined, show distinctive IROA patterns: 12 C-derived molecules, 13 C-derived molecules, artifacts (lack IROA patterns) and derivatives of exogenously applied compounds. Only metabolites of biological origin will have mirrored 12 C and 13 C metabolite peaks at the same retention time. Furthermore, the abundance of the heavy isotopologues in the 5% 13 C samples (M + 1, M + 2, etc. the 12 C envelope) or light isotopologues in the 95% 13 C samples (M À 1, M À 2, etc. the 13 C envelope), follows the binomial distribution for 13 C in metabolite products based on the initial substrate enrichment. The mass difference between the 12 C monoisotopic peak and the 13 C monoisotopic peak indicates the number of carbons in the metabolite. Uniquely, the accurate mass IROA-GC/MS protocol developed, using both chemical ionization (CI) and electron ionization (EI), extends the information acquired from the isotopic peak patterns for molecular formulae generation. The process has been formulated as an algorithm, in which the numbers of carbons, methoximations and silylations are used as search constraints, and an accurate mass CI IROA library with retention times based on the Fiehn protocol has been published [17 ]. The combination of CI and EI IROA protocols affords a metabolite identification procedure that can identify co-eluting metabolites [17 ]. In summary, non-targeted stable isotope metabolite profiling using IROA reduces the complexity for global stable isotope metabolite identification [50], and can extend metabolome analysis by identifying 'known unknowns' with an IROA mirror pattern, and generating the number of carbons in the unknown metabolite.

Conclusions
The hugely beneficial impact of the Human Genome Project on 21st century science is undeniable [51,52]. No such large-scale experimental characterisations of organism metabolomes have been reported and many of the studies published to date describe only a fraction of the estimated size of a metabolome. That said, efforts are now underway from text mining to novel experimental approaches that offer to accelerate this process, and coordination of some of these activities is being achieved through the Metabolomics Society's task group. Activity in metabolome identification is therefore expected to increase over the next couple of years with significant returns on this investment within 5-10 years. Looking further ahead, challenges will include developing analytical strategies to quantify several thousand known metabolites (simultaneously) as well as the spatial localisation of these compounds, for example using MS imaging [53,54]. Ultimately, a better understanding of the parts list is going to facilitate growth of several fields, including phylometabolomics, the study of organism metabolic pathways in the context of evolution.

17.
Qiu Y, Moir R, Willis I, Beecher C, Tsai YH, Garrett TJ, Yost RA, Kurland IJ: Isotopic ratio outlier analysis of the S. cerevisiae metabolome using accurate mass gas chromatography/timeof-flight mass spectrometry: a new method for discovery. Anal Chem 2016, 88:2747-2754. First report using IROA technology in combination with accurate mass GC/TOF-MS, used to examine theS. cerevisiae metabolome. An accurate mass CI IROA library containing 126 metabolites with retention times based on the Fiehn protocol was established. The combination of CI and EI IROA protocols identifies co-eluting metabolites and differentiates metabolite spectra from artifacts, and "known unknown" metabolites were easily separated, with the number of carbons identified from the IROA patterns.

24.
Schafer M, Brutting C, Baldwin IT, Kallenbach M: Highthroughput quantification of more than 100 primary-and secondary-metabolites, and phytohormones by a single solidphase extraction based sample preparation with analysis by UHPLC-HESI-MS/MS. Plant Methods 2016, 12. Excellent demonstration of the quantitatitive capabilities of targeted metabolite analysis, here assaying more than 100 primary and secondary metabolites simultaneously following a single extraction of plant material. Such an approach is highly recommended when the metabolite 'targets' are already known, and hence non-targeted metabolomics is not required.