Mining the Unknown: A Systems Approach to Metabolite Identification Combining Genetic and Metabolic Information

Recent genome-wide association studies (GWAS) with metabolomics data linked genetic variation in the human genome to differences in individual metabolite levels. A strong relevance of this metabolic individuality for biomedical and pharmaceutical research has been reported. However, a considerable amount of the molecules currently quantified by modern metabolomics techniques are chemically unidentified. The identification of these “unknown metabolites” is still a demanding and intricate task, limiting their usability as functional markers of metabolic processes. As a consequence, previous GWAS largely ignored unknown metabolites as metabolic traits for the analysis. Here we present a systems-level approach that combines genome-wide association analysis and Gaussian graphical modeling with metabolomics to predict the identity of the unknown metabolites. We apply our method to original data of 517 metabolic traits, of which 225 are unknowns, and genotyping information on 655,658 genetic variants, measured in 1,768 human blood samples. We report previously undescribed genotype–metabotype associations for six distinct gene loci (SLC22A2, COMT, CYP3A5, CYP2C18, GBA3, UGT3A1) and one locus not related to any known gene (rs12413935). Overlaying the inferred genetic associations, metabolic networks, and knowledge-based pathway information, we derive testable hypotheses on the biochemical identities of 106 unknown metabolites. As a proof of principle, we experimentally confirm nine concrete predictions. We demonstrate the benefit of our method for the functional interpretation of previous metabolomics biomarker studies on liver detoxification, hypertension, and insulin resistance. Our approach is generic in nature and can be directly transferred to metabolomics data from different experimental platforms.


Scenario CARNITINE -X-11421 and X-13431 represent medium-chain acylcarnitines
In the CARNITINE scenario, we investigated two specific unknowns that, on the one hand, display associations with fatty acid derivatives, in particular with acylcarnitines, and, on the other hand, associate with enzymes of the acyl-coenzyme A dehydrogenase, ACAD, class. Acylcarnitines represent a transport form of fatty acids tagged for mitochondrial transport and subsequent β-oxidation [1]. In previous studies we already demonstrated (a) strong GGM edges between carnitine species with a carbon atom difference of two [2], and (b) genetic associations between various acylcarnitines and loci encoding for β-oxidation related ACAD enzymes.
The first unknown metabolite, X-11421, shares significant GGM edges with C8 and C6 carnitines and further associates with ACADM, the ACAD enzyme for medium-chain length fatty acyl residues. In the context of our previous findings and considering the mass peak of X-11421 (314.2 m/z, positive mode), we therefore hypothesized that X-11421 is a medium-chain length carnitine with 10 carbon atoms.
Matching our computational prediction, this unknown has indeed been experimentally identified as cis-4-decenoyl-carnitine, a carnitine with 10 carbon atoms and an ω-6 double bond, by testing the pure compound. It has to be noted that carnitines shift elution times dramatically in relation to their RI markers on the analytical platform used in this work. The cis-4-form was confirmed in a spiking experiment in a well characterized human plasma sample, which was run with original samples. The second unknown metabolite, X-13431, is linked to a C11 free fatty acid in the GGM, and with the ACADL locus. In a previous study, this locus has been shown to associate with C9 carnitine levels [3]. This observation together with the molecule mass peak detected for the unknown (302.3 m/z, positive mode) make C9 carnitine a good candidate for X-13431.
Our prediction is experimentally confirmed by the fragmentation of X-13431 as the molecule produces fragments shared by mid-and long-chain acylcarnitines and several neutral losses (loss of 59 m/z and 161 m/z) that are highly diagnostic of carnitines. With respect to chromatography, X-13431 elutes between C8 and C10 carnitines, thus further supporting the hypothesized C9 carnitine. The accurate mass of 301.22476±0.0015 Da determined for X-13431 corresponds to the molecular formula C 16 H 31 NO 4 and, thus, also matches C9 carnitine. Due to the lack of a commercial source for pure C9 carnitine, the final confirmation of the predicted chemical identity by testing the pure compound is still pending.

Scenario BILIRUBIN -X-11793 represents an oxidized bilirubin variant
In the BILIRUBIN scenario, we focused on the unknown metabolite X-11793 (601.1 m/z, positive mode), which shares a GGM edge with a specific bilirubin stereoisomer (EE) and associates with the UGT1A locus encoding for the enzyme UDP glucuronosyltransferase 1 family, polypeptide A. The bilirubin stereoisomers and biliverdin, which shows close proximity in the GGM, are degradation products of heme, the oxygen-carrying prosthetic group contained in hemoglobin [4]. For further metabolization and excretion, the very insoluble bilirubin must be transformed into soluble derivatives. Glucuronidation of bilirubin represents the main mechanism for this transformation in the human metabolism. The transformation is mainly catalyzed by an enzyme encoded at the UGT1A1 locus [5,6], matching the observations in our data that X-11793 and three of the four degradation products display genetic associations with the UGT1A locus.
Since X-11793 is embedded in the biochemical and genetic network of bilirubin derivatives and also shares their association with the UGT1A locus, we assumed that X-11793 represents a further bilirubin derivative. Moreover, the mass difference between bilirubin and X-11793 is 15.9, which might correspond to the addition of oxygen. We therefore predicted X-11793 to be an (ep)oxidized bilirubin as a possible result of bilirubin oxidation mediated by cytochrome P450. Such oxidation processes have been suggested as alternative routes for the metabolization of bilirubin besides glucuronidation previously [7,8,9].
From an experimental perspective, the neutral accurate mass of X-11793 of 600.25859 Da corresponding to C 33 H 36 N 4 O 7 , perfectly matches the formula for the predicted (ep)oxidized bilirubin variant. The fragmentation pattern produced by the unknown molecule further supports the hypothesis: Bilirubin generates fragments with 285 m/z and 299 m/z corresponding to a cleavage of the central C-C bond of the molecule. If the hypothesized ep(oxidized) bilirubin broke at the same position, it would produce fragments with 301 m/z and 299 m/z accordingly, which both occur in the fragmentation spectrum of X-11793. The final confirmation of the prediction by running pure epoxidized bilirubin is still pending due to the lack of commercial sources of the pure substances. X-11793 (601.1 m/z), RI = 3634, pos. mode epoxidized bilirubin X-11793 identified as (ep)oxidized bilirubin might represent an interesting additional biomarker for the efficacy of heme degradation processes, which plays an important role in various diseases. Serum concentrations of bilirubin as well as the UGT1A locus encoding the enzyme mainly responsible for the degradation of bilirubin are not only associated with bilirubin turnover-related syndromes such as jaundice but also with different cancer variants and coronary heart disease (CHD) [5,10,11,12,13]. While jaundice is caused by high bilirubin concentrations, bilirubin has proven to be an effective antioxidant [14], which might explain the association found between reduced risk of CHD and various forms of cancer with higher bilirubin concentrations.

Scenario ASCORBATE -X-11593 represents O-methylascorbate
In the ASCORBATE scenario, we investigated the unknown X-11593 (189.2 m/z, negative mode), which is close to threonate, ascorbate and related substances in the GGM. These metabolites are tightly interconnected in the ascorbate (vitamin C) pathway. Furthermore, we found significant associations of X-11593 with SNPs in the gene encoding catechol-O-methyltransferase (COMT), an enzyme relevant for Since, according to the GWAS, X-11593 is probably a substrate or a product of O-methylation, we determined the mass differences to the known metabolites neighboring X-11593, namely ascorbate and threonate. While the mass difference of X-11593 and threonate is 54, X-11593 and ascorbate show a mass difference of 14, which corresponds to the addition of a methyl moiety. Moreover, in ascorbate, the double bond within the 5-ring with its two hydroxyl moieties could "mimic" the corresponding planar substructure in catechol, on which COMT is usually working. Finally, the methylation of ascorbate through the catalysis of COMT has already been shown experimentally [15]. These observations make Omethylated ascorbate derivatives (most probably 2-O-methylascorbate) good candidates for X-11593.
From an experimental perspective, our hypothesis is supported by the accurate neutral mass of 190.04787 Da determined for X-11593. Based on the accurate mass, the molecular formula for X-11593 is C 7 H 10 O 6 matching our prediction. The retention time of X-11593 shows a slight shift compared to the time for ascorbate. This shift matches the shift expected for adding a methyl group. Moreover, X-11593's primary fragment loss is 60 m/z, which is the same as for ascorbate. The loss of 15 m/z, also seen for X-11593, is typical for phenols substituted with a -OH and -OCH 3 . Due to the lack of a commercial source for 2-O-methylascorbate the confirmation through the spectrum of the pure substance is still pending.