A Metabolome- and Metagenome-Wide Association Network Reveals Microbial Natural Products and Microbial Biotransformation Products from the Human Microbiota

Experimental advances have enabled the acquisition of tandem mass spectrometry and metagenomics sequencing data from tens of thousands of environmental/host-oriented microbial communities. Each of these communities contains hundreds of microbial features (corresponding to microbial species) and thousands of molecular features (corresponding to microbial natural products). However, with the current technology, it is very difficult to identify the microbial species responsible for the production/biotransformation of each molecular feature. Here, we develop association networks, a new approach for identifying the microbial producer/biotransformer of natural products through cooccurrence analysis of metagenomics and mass spectrometry data collected on multiple microbiomes.

Metabolomics studies have shown that among all the molecules in the human metabolome, microbial metabolites are the ones most altered in metabolic and inflammatory disorders (3). These molecules include the biosynthetic products of microbiota (microbial natural products) and the microbial modifications of host, dietary, and drug molecules (microbial biotransformation products) (4).
Currently, the majority of known microbial products and biotransformation products are discovered through the targeted analysis of specific molecules, such as short-chain fatty acids, secondary bile acids, and oral drugs in model systems (e.g., mice with a controlled diet and environment) (5)(6)(7). However, these methods do not generalize to complex communities like the human microbiome, where it is impossible to control environmental factors. Moreover, targeted metabolomics analysis cannot detect novel microbial metabolites.
Recent large-scale microbiome data sets, such as the Integrative Human Microbiome Project (iHMP) (8) and the American Gut Project (AGP) (9), collect microbial and molecular abundance profiles over thousands of human microbiota samples, providing us with an unprecedented opportunity to explore the interactions between microorganisms, enzymes, and molecules in complex communities. In these projects, the abundances of tens of thousands of microbial strains/species are measured using microbial marker gene amplicon sequencing and whole-metagenome or metatranscriptome shotgun sequencing (10), and the abundances of tens of thousands of molecules are measured using untargeted liquid chromatography-mass spectrometry (LC-MS) (11). Recently, new methods have been proposed for finding associations between microbial and molecular features through the correlations of their abundance profiles across multiple microbiome samples (12,13). However, these methods fail to extend to thousands of microbiome samples. In addition, there is no consensus on how to extract features from LC-MS data or what association test should be used.
In this study, we develop an efficient pipeline to discover potential microbial metabolites and microbial biotransformations by building a cooccurrence network of microbes and metabolites using high-throughput LC-MS data and metagenomics data collected over thousands of microbiota samples. Using this strategy, we identify several microbial products and microbial biotransformation products from the human microbiome. Moreover, we develop a new method for computing the false discovery rates (FDR) of the associations and using them to benchmark various metabolomics feature extraction methods and association tests. Furthermore, we develop a new method to detect clade-specific metabolites based on the cooccurrence network and the analysis of a microbial phylogenetic tree.

RESULTS
Outline of the pipeline. Our pipeline ( Fig. 1) includes the following: (a) extracting microbial features, which could be either operational taxonomic units (OTUs) or biosynthetic gene cluster (BGC) families, (b) extracting molecular features, which could be either mass spectrometry (MS) features or tandem mass spectrometry (MS/MS) features, (c) searching for pairs of associated features and computing false discovery rates, (d) constructing the association network, and (e) assigning molecular features to phylogenetic clades.
Data sets. The AGP data set consists of LC-MS/MS and 16S rRNA data collected from the human gut microbiomes of 2,125 subjects. For a subset of these samples, shotgun metagenomics data are also available. Optimus extracted 29,567 molecular features from the LC-MS data (MinIntensity ϭ 1,000), and MS-Clustering extracted 74,913 molecular features from the LC-MS/MS data (cosine similarity threshold [] ϭ 0.4). We further applied deduplication using an m/z threshold of 0.01 and a Fisher's exact test P value threshold of 10 Ϫ5 . This decreased the number of molecular features from 29,567 to 18,940 for Optimus and from 74,913 to 73,275 for MS-Clustering. We additionally annotated the extracted molecular features using spectral library search (14) and Dereplicatorϩ (15). Using the Greengenes Database (16) as the reference, QIIME extracted 11,265 unique OTUs from the AGP data set (MinCount ϭ 0).
The data set for human microbiome isolates from cystic fibrosis patients (HUMAN-CF) consists of tandem mass spectrometry and metagenomics data collected from 243 microbial isolates from cultures of sputum samples from cystic fibrosis patients (Global Natural Product Social Molecular Networking [GNPS] data set MSV000080251). Each sample contains one or a mixture of a few (from 1 to 11) different bacteria. Based on the metagenomics data of HUMAN-CF, Quinn et al. (17) analyzed the association between microbial species and discovered that Pseudomonas and Staphylococcus aureus are anticorrelated with Gram-positive anaerobes. In this study, we obtained 23,176 molecular features from LC-MS/MS data (see Materials and Methods for details). We further applied SPAdes (18), antiSMASH (19), and BiG-SCAPE (20) to the shotgun metagenomics data and extracted 18 nonribosomal-peptide BGC families which are present in at least 10 samples.
Microbial products and biotransformation products. Microbial natural products can be detected as positive correlations between the occurrence of the microbial species and the molecules in the association network (Fig. 2a). In addition to the microbial products, the association network also reveals many microbial biotransformation products. Microbial biotransformation products are distinguished by a strong negative correlation between the occurrences of the microbial species and the precursor molecules, along with strong positive correlations between the microbial species and the product molecules (Fig. 2b).
We applied the association network pipeline to the AGP data set and found 18,623 and 8,178 associations with a P value threshold (P Threshold ) of 10 Ϫ10 for the molecular features obtained by Optimus and MS-Clustering, respectively. To explore the power of the association network (Fig. 3) in detecting microbial products and biotransformation products, we further searched the mass spectra against AntiMarin (21), the Dictionary of Natural Products database (22), and the Human Metabolome Database (23) using Dereplicatorϩ and analyzed the densely connected modules of this network that contained the molecules annotated by Dereplicatorϩ (Fig. 3).  Correlating mass spectral data to 16S rRNA data. At a P Threshold of 10 Ϫ10 , microbial features from the Pseudomonas genus are positively associated with phenazine-1-carboxylic acid (m/z 225.07), five rhamnolipids, and five pseudomonas quinolone signals (PQS) (Fig. 3b). Among the 42 rhamnolipids with unique masses produced by Pseudomonas (24), 8 are included in the GNPS spectral library. A spectral library search found four of the rhamnolipids in the AGP data set, and two (m/z 673. 40 and m/z 701.41) are significantly associated with Pseudomonas. With molecular networking (14,25), two more rhamnolipids were identified (m/z 553. 25  All of these molecules are known to be produced by Pseudomonas aeruginosa bacteria, playing roles in quorum sensing and virulence (27)(28)(29). We further mapped shotgun metagenomics data collected on samples with PQS present against PQS BGC, and we identified 2,472 out of 2,488,704 reads mapped to PQS BGC.
A Corynebacterium kutscheri OTU feature (Greengenes number 13393) is positively correlated with a molecule at m/z 495.4 (P ϭ 3 · 10 Ϫ5 ). Dereplicatorϩ annotated this molecule as corynomycolenic acid (Fig. 4). The BGC for corynomycolic acid, which is a close variant of corynomycolenic acid, has previously been discovered in Corynebacterium diphtheria strain NCTC 13129 (30). The reference genome with a feature closest to this C. kutscheri feature is that of C. kutscheri strain DSM 20755 (31) (99% identical 16S rRNA over 100% coverage), which contains a BGC with high similarity to the corynomycolic acid BGC reported in C. diphtheriae NCTC 13129 (72% identical over 52% coverage).
We also observed a positive correlation between Desulfovibrio species and cholic acid (P ϭ 10 Ϫ13 ), which is a human bile acid (Fig. 3c). This is explained by the fact that the Desulfovibrio species feed on the sulfur released by deconjugation of taurocholic acids to cholic acid (32). As sulfur is below the dynamic range of mass spectrometers, the association network fails to correlate sulfur with Desulfovibrio species. This example shows that some of the detected associations are noncausal.
Several species within the Enterobacteriaceae showed a negative correlation with cholic acid (m/z 409.29 [P ϭ 2eϪ26]) and a positive correlation with 7-oxodeoxycholate (m/z 407.28 [P ϭ 4eϪ10]), confirming the evidence that Enterobacteriaceae play a role in dehydrogenation of bile acids (35,36).
We also observed a strong correlation between Bacillus species and a steroid hormone with m/z 285.18 (P ϭ 9 · 10 Ϫ24 ). Bacillus species are known to biotransform steroids (37).
In addition, we observed a negative correlation between Oxalobacteraceae and phenylalanine (m/z 165.08 [P ϭ 6 · 10 Ϫ11 ]) and n-acetylphenylalanine (m/z 207.12 [P ϭ 3 · 10 Ϫ13 ]). In fact, phenylalanine and n-acetylphenylalanine were not detectable in any of the subjects where Oxalobacteraceae were present. Oxalobacteraceae species are shown to be capable of consuming phenylalanine as a carbon source (38).
Correlating mass spectral data to BGC families. In the HUMAN-CF data set, we correlated BGC families with molecular features and discovered an interesting BGC family containing two adenylation domains, two thiolation domains, one condensation domain, and one NAD binding domain (Fig. 5a) that was positively correlated with two molecular features (m/z 229.135 [P ϭ 4.05 · 10 Ϫ16 ] and m/z 245.125 [P ϭ 1.98 · 10 Ϫ9 ]). Dereplicatorϩ annotated these two features as phevalin (score of 4) and tyrvalin (score of 7). These annotations matched the adenylation specificities of the corresponding domains (Fig. 5). BLAST results suggest that this BGC family contains the aureusimine nonribosomal peptide synthetase from Staphylococcus aureus (100% coverage and 99.46% identity), which is known for the synthesis of phevalin and tyrvalin (40). 16S rRNA sequencing results show that Staphylococcus aureus is widely present in the HUMAN-CF data set (17).
Discovering a corynomycolenic acid BGC. We further investigated the genes responsible for the production of corynomycolenic acid in the human microbiota. Corynomycolenic acid is a member of the mycolic acid family with immunomodulatory activities that is produced by Corynebacterium and Mycobacterium species (41)(42)(43)(44). These molecules are ligands of human T cells, prompting specific immune responses. Mining the genome of C. kutscheri DSM 20755 revealed a BGC that contains all the necessary biosynthetic enzymes for the production of corynomycolenic acid (Table 1, Fig. 6). Moreover, we highlight the different genes of the two BGCs which are potentially responsible for the structural difference between the molecules from the two species (Table 2).
Assigning molecular features to the corresponding phylogenetic clades. We assigned the molecular features to the clades in the phylogenetic tree with which they were significantly associated. For this analysis, we used the Greengenes phylogenetic tree, which was pruned to keep only the OTUs that were associated with at least one metabolite. At a P value threshold of 10 Ϫ10 , 550 of the MS-Clustering features were mapped to 872 OTUs in the phylogenetic tree. Figure 7 demonstrates molecular features assigned to different clades at a P value threshold of 10 Ϫ20 .
Benchmarking. We benchmarked various feature extraction methods with various parameters by comparing the numbers of identifications at different false discovery rates. Moreover, we benchmarked four different techniques for estimating the associations between molecular and microbial features. These techniques include Fisher's exact test (for binary data), Pearson's correlation test, Spearman's correlation test, and the mutual information criterion. Our results show that Optimus and Spearman's correlation are the best feature extraction and association methods ( Fig. 8 and 9).

DISCUSSION
Recent experimental advances have enabled the acquisition of tandem mass spectrometry and shotgun metagenomics data from tens of thousands of environmental/ host-oriented microbial communities through large-scale projects, including the American Gut Project and the Integrative Human Microbiome Project. Metagenome-mining studies have revealed thousands of biosynthetic enzymes with uncharacterized substrates/products from these data sets. Moreover, metabolomics studies have revealed signals for hundreds of thousands of bioactive small molecules in the mass spectral data sets.
While these data sets represent a gold mine for discovering small molecules associated with the microbiota, manual analysis of billions of mass spectra in these data sets is infeasible, and new computational approaches are needed to integrate the large-scale metagenomics and tandem mass spectrometry data for systematic discovery of the unknown small-molecule products of the biosynthetic enzymes. In this regard, the following three questions need to be addressed. (i) Is the molecular feature associated with the microbiota? If so, which microbial species is it associated with? (ii) Which biosynthetic enzyme within the microbial species is it associated with? (iii) What is the chemical structure of the molecule? In this article, we developed a method for addressing the first question. Our method detects microbial natural products and microbial biotransformation products through  Association Network Reveals Microbial Natural Products a comparative analysis of the molecular and microbial features across multiple microbiomes. In the case of corynomycolenic acid, we further used genome mining to assign the molecule to its BGC within the genome of its microbial producer. While identification of the biosynthetic enzymes responsible for corynomycolenic acid production provides a proof of concept, novel computational methods are needed for systematic characterization of the products of the microbial biosynthetic enzymes through the association network approach. The association network detects pairwise interactions between the molecular and microbial features across thousands of microbiomes. While this method is capable of discovering microbial natural products and microbial biotransformation products, interactions that involve multiple sequential biotransformations/complex pathways cannot be handled. Moreover, many of the interactions retrieved by this method are noncausal correlations. For example, the association network finds correlating features that are caused by a confounding factor. While this results in a denser network with noncausal edges, in some scenarios, these noncausal edges can lead to the discovery of causal interactions that were missed by the network.
Currently, the association network approach is based on the use of Fisher's exact test P values, which assumes different samples are independent. While the independence assumption is natural for data sets such as that of the American Gut Project, collected from distinct individuals, confounders like health status could increase the false discovery rate. The association network approach is the first step toward detecting the complex interactions between microbial and molecular features through the comparative analysis of thousands of microbiome samples.
In addition to linking BGCs to molecules, other potential applications of associ- ation networks include detection of carbon sources depleted by microorganisms, identifying biomarkers for drug metabolism, linking microbial enzymes to xenobiotic metabolism, and identifying the role of microbial metabolites in disease. Association networks provide an untargeted approach for generating/testing vari-

FIG 7
Assigning the molecular features that are positively associated with the microbial features at a P value threshold of 10 Ϫ20 to the phylogenetic tree. The tree is trimmed to the taxonomic-order level. Numbers in boldface show the counts of molecules assigned to the corresponding clades. Heatmap shows Ϫ log 10 ͑P͒, where P is the minimal P value between the molecule and an OTU within the clade. Dereplicatorϩ molecular annotations for the known molecules are shown. The molecular features were extracted by MSClustering based on tandem mass spectral data and annotated by spectral library search and Dereplicatorϩ (level 2 and 4 metabolite identification) (46).

FIG 8
Benchmarking various feature extraction methods and association tests. Different methods are compared based on the number of associations discovered (a) and the number of unique metagenomic features associated with a molecular feature (b) at different false discovery rate thresholds. Here, we benchmark MS-Clustering and Optimus (binarized abundance with thresholds 10, 10 2 , . . ., 10 6 ) with Fisher's exact test association and Optimus (continuous abundance) with Pearson's correlation test association, Spearman's rank correlation test association, and mutual information criterion. In the case of Pearson's correlation, no association was discovered at a false discovery rate of 0.01. ous hypotheses about the causal relationships between the molecules and microbes in complex communities.

MATERIALS AND METHODS
Definitions. Consider a set of microbial community samples (Samples), a set of molecular features (Molecules), and a set of microbial features (Microbes). Here, each molecular feature is the abundance of a specific molecule (binary or continuous), and each microbial feature is the abundance of a specific microbe. Every feature X is characterized by a subset of samples that X is present in, as follows: Samples X ϭ ͕SʦSamples|X is present in S͖. Here, S represents a sample.
Inputs. The inputs to our pipeline are the untargeted mass spectrometry data and metagenomics data collected on a set of microbiome samples.
Main pipeline. The association network pipeline consists of the following steps.
(i) For microbial feature extraction, QIIME (47) is used to extract and quantify the operational taxonomic units (OTUs) from the 16S rRNA sequencing data. The QIIME output is the OTUCount matrix, where OTUCount(A, S) is the number of times an OTU A is observed in a sample S. For each OTU A, we define Samples A ϭ ͕S|OTUCount͑A,S͒ϾMinCount͖ for a threshold MinCount.
When shotgun metagenomics data are available, we can quantify BGC families on top of OTUs. First, we apply SPAdes (18) to metagenomics data to obtain genome assemblies. Second, we apply antiSMASH (19) to the genome assemblies to extract putative BGCs. Third, we use BiG-SCAPE (20) to cluster similar BGCs into BGC families, resulting in an absence-presence table of the BGC families in each sample. We exclude from analysis rare BGC families that are present in less than 10 samples.
(ii) For molecular feature extraction, molecular features from the liquid chromatography-mass spectrometry (LC-MS) data are first extracted and quantified using the feature extraction algorithm Optimus (48). Optimus outputs the FeatureIntensity matrix, where FeatureIntensity(X, S) is the intensity of a feature X in a sample S. We then select a threshold MinIntensity, and for every feature X, we define Samples X ϭ ͕S|FeatureIntensity͑X,S͒ϾMinIntensity͖. We further remove molecular features that are present in less than two samples. When LC-MS/MS data are available, we extract molecular features using the MS-Clustering algorithm (49). Since the LC-MS/MS data are more suitable for molecular-feature annotation, we use MS-Clustering as the molecular-feature extraction method when analyzing the AGP and HUMAN-CF data sets.
We also construct a set of decoy molecular features, DecoyMolecules (Fig. 1b). These decoy molecules are used to estimate the FDR. The set DecoyMolecules is created as follows: for every feature XʦMolecules, we construct a decoy feature X d , with Samples X d being a randomly chosen subset of Samples with size |Samples X |.
(iii) To perform the association test, we then search for pairwise associations between Molecules and Microbes (Fig. 1c). More specifically, we look for pairs (X, A) consisting of a molecular feature X and a microbial feature A that have a statistically significant correlation in their patterns of occurrence.

FIG 9
Benchmarking MS-Clustering and Optimus (binarized abundance with thresholds 10, 10 2 , . . ., 10 6 ) with Fisher's exact test association. Data on the x axis represent false discovery rates estimated by the Benjamini-Hochberg procedure. Data on the y axis represent the numbers of metabolite-microbe associations discovered.
Given two features X and A, to detect whether X and A are cooccurring, we consider the null hypothesis that the events "X is present in a sample" and "A is present in a sample" are independent. A statistically significant correlation in the patterns of occurrence of X and A is detected if the P value of Fisher's exact test, denoted P Value (X, Y), is lower than the selected threshold P Threshold , and the null hypothesis is rejected.
While there are other techniques for computing the associations between the molecular and microbial features, including Pearson's correlation, Spearman's correlation, and mutual information criterion, in this section, we focus on the Fisher's exact test method.
For the multiple-hypothesis testing, we compute the FDR using the target-decoy approach (TDA) (50). We first search for the associations between DecoyMolecules and Microbes and then estimate the FDR as |DecoyAssociations| ⁄ |RealAssociations|, where DecoyAssociations and RealAssociations are the sets of association pairs found in decoy and target data sets. We also use the Benjamini-Hochberg (BH) procedure for estimating the FDR.
(iv) To build the associations network, we further construct a bipartite network where the vertices are the molecular and microbial features and there is an edge between two vertices if the corresponding features are associated (Fig. 1d).
(v) We also report the associations between the molecular features and the groups of related microbial features by assigning molecular features to the clades in the phylogenetic tree that are potentially responsible for their production/biotransformation (Fig. 1e). Note that here, assignment of a molecule to a phylogenetic clade does not necessarily mean that the molecule is produced by those species. For example, those species might play a role in biotransformation of the molecule.
Given a phylogenetic tree T and a molecular feature X, we first mark all the microbial features that are positively correlated to X and count the number of marked features in every clade. Then, we select the minimal clade that has at least P percent (P ϭ 80) of features marked. If the selected clade is a proper subset of the whole tree, we assign X to this clade. We perform the steps described for every molecular feature, and for each clade, we report the set of molecular features that are assigned to it.
Deduplication of molecular features. Feature extraction methods usually report redundant features, i.e., each single molecule is reported as multiple features with similar m/z values. Such features are called "duplicates." The process of finding all groups of duplicate features and merging them into unique features is called "deduplication." We apply deduplication to remove the redundancy in the molecular features.
We consider a pair of molecular features to be duplicates if they have similar m/z values and a statistically significant correlation in their patterns of occurrence. Then, we build a graph in which molecular features are nodes and every putative pair of duplicates is connected by an edge. The connected components of the resulting graph are the groups of duplicate features. For the i-th group DuplicatesGroup i , a new consensus feature Y i is constructed with the m/z being the average m/z of all the features in DuplicatesGroup i , and Samples Y i is defined as the union ഫ XʦDuplicatesGroup i Samples X .
Benchmarking. Molecular-feature extraction consists of identification and quantification of the peaks across multiple LC-MS runs and is a fundamental step in proteomics and metabolomics. Although many tools for molecular-feature extraction have been proposed, it is not clear which one is more accurate. Moreover, it is not clear how to adjust the parameters in various feature extraction methods.
Here, we describe an approach to compare the various feature extraction methods in the microbiome-wide correlation studies. Given a set of microbial features and several feature extraction methods with various sets of molecular features, we apply the pairwise association pipeline to these sets to identify the method and the parameter settings that result in the highest number of pairs of cooccurring features discovered at a certain FDR level. To avoid bias toward methods that report higher numbers of molecular features, we also compare the numbers of discovered microbial features in these pairs. The FDR is estimated by the target-decoy approach (TDA) and the Benjamini-Hochberg procedure. Four different association tests are benchmarked, including Fisher's exact test, Pearson's correlation test, Spearman's rank correlation test, and the mutual information criterion.