Unraveling the functional dark matter through global metagenomics

Pavlopoulos, Georgios A.; Baltoumas, Fotis A.; Liu, Sirui; Selvitopi, Oguz; Camargo, Antonio Pedro; Nayfach, Stephen; Azad, Ariful; Roux, Simon; Call, Lee; Ivanova, Natalia N.; Chen, I. Min; Paez-Espino, David; Karatzas, Evangelos; Iliopoulos, Ioannis; Konstantinidis, Konstantinos; Tiedje, James M.; Pett-Ridge, Jennifer; Baker, David; Visel, Axel; Ouzounis, Christos A.; Ovchinnikov, Sergey; Buluç, Aydin; Kyrpides, Nikos C.

doi:10.1038/s41586-023-06583-7

Download PDF

Article
Open access
Published: 11 October 2023

Unraveling the functional dark matter through global metagenomics

Nature volume 622, pages 594–602 (2023)Cite this article

46k Accesses
12 Citations
350 Altmetric
Metrics details

Subjects

Abstract

Metagenomes encode an enormous diversity of proteins, reflecting a multiplicity of functions and activities^1,2. Exploration of this vast sequence space has been limited to a comparative analysis against reference microbial genomes and protein families derived from those genomes. Here, to examine the scale of yet untapped functional diversity beyond what is currently possible through the lens of reference genomes, we develop a computational approach to generate reference-free protein families from the sequence space in metagenomes. We analyse 26,931 metagenomes and identify 1.17 billion protein sequences longer than 35 amino acids with no similarity to any sequences from 102,491 reference genomes or the Pfam database³. Using massively parallel graph-based clustering, we group these proteins into 106,198 novel sequence clusters with more than 100 members, doubling the number of protein families obtained from the reference genomes clustered using the same approach. We annotate these families on the basis of their taxonomic, habitat, geographical and gene neighbourhood distributions and, where sufficient sequence diversity is available, predict protein three-dimensional models, revealing novel structures. Overall, our results uncover an enormously diverse functional space, highlighting the importance of further exploring the microbial functional dark matter.

Towards the biogeography of prokaryotic genes

Article 15 December 2021

A genomic catalog of Earth’s microbiomes

Article Open access 09 November 2020

Functional and evolutionary significance of unknown genes from uncultivated taxa

Article Open access 18 December 2023

Main

Metagenome shotgun sequencing has become the method of choice for studying and classifying microorganisms from various biomes¹. With the latest advances in whole-genome sequencing technologies and the constant improvements in quality and cost efficiency, large-scale sequencing has become increasingly easier, faster and more affordable. This has led to a considerable increase in metagenomic sequencing data over the past few years, therefore making them an indispensable resource for investigating the microbial dark matter².

To elucidate the genetic composition of a metagenomic sample, two major approaches are typically used, each with distinct advantages and disadvantages. In the first, sequencing reads are accurately mapped to a known, annotated set of reference genome sequences to provide a quick overview of the presence of known organisms, genes and potential functions. MG-RAST⁴ is one system that excels in this type of analysis. In the second approach, massive de novo assembly of the reads into contigs/scaffolds can provide invaluable insights into the presence of previously undescribed organisms and their genetic makeup. Recent technological advancements in assembly and binning tools⁵ have led to a significant increase in the assembled fraction of the average metagenome, coupled with a parallel exponential increase in the number of metagenome-assembled genomes (MAGs). Data management and comparative analysis systems supporting this type of data include Integrated Microbial Genomes & Microbiomes (IMG/M)⁶ and MGnify⁷.

However, both approaches share the same major limitation with respect to gene functional annotation, which relies on predicting function by homology searching against reference protein databases, such as COG⁸, Pfam³ and KEGG Orthology⁹. As a result, any genes predicted in assembled metagenomic data that do not map to reference protein families are typically ignored and dropped from subsequent comparative analysis. To eliminate this reliance on reference datasets and to estimate the breadth of unexplored functional diversity, referred to as the functional dark matter, an all-versus-all metagenomic comparison is required. Such a task requires considerable computational resources, yet reaching such levels of scalability remains technically challenging. Although some excellent efforts to address this issue have been recently reported^10,11,12, metagenomes have not yet been comprehensively surveyed to uncover the functional dark matter.

Here we present a scalable computational approach for identifying and characterizing functional dark matter found in metagenomes. First, we identified the novel protein space present in 26,931 metagenomic datasets from IMG/M, after removing all genes with matches to the IMG database of over 100,000 reference genomes or Pfam. We next clustered the remaining sequences into protein families and explored their taxonomic and biome distributions and, where possible, predicted their tertiary (three-dimensional (3D)) structures.

The novel protein sequence space

We initially collected all protein sequences (longer than 35 amino acid residues) from all public reference genomes and assembled metagenomes and metatranscriptomes hosted in the IMG/M platform⁶. In total, we extracted all protein sequences from 89,412 bacterial, 9,202 viral, 3,073 archaeal and 804 eukaryal genomes, resulting in a final dataset of 94,672,003 sequences. The reference genomes included in this study consisted solely of isolate genomes, not MAGs or single-amplified genomes. Similarly, for unbinned metagenomes, we extracted all predicted protein sequences from scaffolds of at least 500 bp and with lengths of at least 35 amino acids from 26,931 datasets (20,759 metagenomes and 6,172 metatranscriptomes), hereafter referred to as the environmental dataset (ED). This resulted in a non-redundant set of 8,364,611,943 predicted proteins or protein fragments. To identify the functional dark-matter component of this dataset, we first discarded any protein sequence with hits to Pfam⁸ or to any sequences from the reference genome set. The final non-redundant catalogue representing the unexplored metagenomic protein space consisted of 1,171,974,849 protein sequences (14% of the total).

Novel protein families

We next clustered the 1.1 billion ED proteins using a graph-based approach. For comparative purposes, we followed the same approach for the 94 million proteins from reference genomes. First, an all-versus-all similarity matrix was built for each of the two gene catalogues (that is, proteins from reference genomes and those from the ED) by calculating all significant pairwise sequence similarities. Each of the two graphs was then analysed to identify sequence-similarity-based protein clusters. For this purpose, we used HipMCL¹³, a massively parallel implementation of the original MCL algorithm¹⁴ that was previously developed for this scale of data and that can run on distributed-memory computers. The whole process from data retrieval to cluster generation is shown in Fig. 1a.

**Fig. 1: Sequence clustering overview.**

Although most clusters with at least 50 members (and possibly even those with at least 25 members) probably represent potentially functionally important clusters, we restricted the subsequent analysis to the larger families with at least 100 members to focus on higher-quality data as well as better candidates for predicting structures (Table 1). In total, we identified 106,198 families with at least 100 members that will be referred to as novel metagenome protein families (NMPFs) (Table 1 (right column)). For comparison, we identified 92,909 protein clusters in the corresponding set of protein clusters with at least 100 members from reference genomes. By directly comparing the two clustered sets (reference versus ED protein clusters), we observed an increase in the ED protein clusters by greater than 14-fold for clusters with at least 3 members, greater than 3-fold for clusters with at least 25 members, around a 2-fold increase for clusters with at least 50 and 75 members as well as an increase for clusters with at least 100 members (Table 1). Although the metagenome sequence space is intrinsically more fragmented compared with the reference genomes (Supplementary Methods and Supplementary Fig. 1), and a higher percentage of genes would be erroneous or incomplete (which is also one of the reasons we decided to focus all further analysis on the larger clusters), these results also suggest that much of protein sequence space remains to be explored. This is also supported by rarefaction curves generated from the ≥100-member clusters (Fig. 1b). These curves show that, as more samples become available, the cluster number increases linearly for reference genomes but exponentially, without reaching a plateau, for metagenomes.

Table 1 HipMCL clustering of proteins from reference genomes and metagenomes and their corresponding clusters of different protein family sizes (cumulative)

Full size table

Biome distribution

To determine the biome distribution landscape of the NMPFs, the corresponding metadata were collected for each sample from IMG/M⁶ using the GOLD database¹⁵ ecosystem classification scheme¹⁶ (Supplementary Table 1). The biome distribution of the NMPFs is shown in Fig. 2a,b and Extended Data Fig. 1. Here the three main GOLD ecosystems (environmental, host-associated and engineered) are further divided into eight more specific ecosystem types: freshwater, marine, soil, plants, human, non-human mammals, other host-associated and engineered. Examining the network topology, we observed minimum gene sharing within each NMPF across the three broad ecosystems, in accordance with recent observations of protein families from 13,174 metagenomes¹⁷, with the exception of soil/plant associations (see below). However, 7,692 NMPFs (7%) were found to have members across all of the eight ecosystem types. The properties of the top NMPFs distributed across all ecosystem types are shown in Supplementary Table 2, while the properties of the top NMPFs of each distinct ecosystem type are shown in Supplementary Tables 3 and 4. In addition to the analysis presented above, each ecosystem was further divided into subcategories for finer analysis (Extended Data Figs. 2–5 and Supplementary Fig. 2).

**Fig. 2: Ecosystem analysis of NMPFs.**

The largest number of NMPFs was shared between soil and plant environments (62% of the soil and 96% of plant-associated families), as would be expected due to the strong overlap of the sampling in these ecosystems (that is, most of the plant samples are from the rhizosphere) (Fig. 2a and Extended Data Fig. 3). This was followed by NMPFs shared between soil and freshwater, which could be primarily due to the assignment of wetland and sediment samples under the freshwater ecosystem classification. For the same reason, we observed a notable overlap between plants and freshwater NMPFs as well as between soil, freshwater and plant NMPFs. Conversely, only 37% of freshwater and 46% of marine NMPFs were shared with each other. Even fewer protein families were shared between ecosystem types such as human, non-human mammals and host-associated. On the other hand, a rather substantial overlap in NMPFs between human and engineered environments was observed (Fig. 2). This is not surprising, considering that engineered environments largely contain samples from human-waste-related ecosystems (such as solid waste and wastewater). Similarly, an overlap exists between freshwater and engineered environments, as well as between freshwater and host-associated types (human, non-human mammals and other host-associated), as shown in Extended Data Fig. 1. These overlaps could be indicative of phenomena such as faecal contamination of freshwater environments, leading to the co-occurrence of the same NMPFs—and, therefore, the same microbial communities—in different ecosystem types.

The percentage of ecosystem-specific NMPFs varied significantly across each of the eight ecosystem types, with the highest percentage observed for host-associated (non-human mammals) (85.6%) and host-associated (other) samples (79.2%), followed by marine (48.4%) and then soil (14.2%) samples (Fig. 2c). This is explained by the unique characteristics of the environments contained in these ecosystem categories, for example, oceanic environments of marine samples, and even more so in the case of the host-associated category, which contains a diverse array of microbiome hosts with significant biological differences (for example, arthropods and annelids) (Extended Data Fig. 4). In contrast to marine samples, freshwater samples had a very small percentage of ecosystem-type-specific families, mostly due to a large number of wetland and sediment samples with strong associations with soil, as did the plant/rhizosphere-related samples with soil samples (Fig. 2a).

Finally, to investigate the ecosystem distribution of NMPFs, the ecosystem prevalence of the most-abundant NMPFs of each ecosystem type was evaluated. The prevalence of each NMPF in an ecosystem (for example, freshwater) was calculated as the number of family ecosystem-associated datasets over the total number of ecosystem-associated datasets in the study (Supplementary Table 3). Despite the existence of NMPFs strongly associated with a particular ecosystem type (for example, >80% of NMPFs), their prevalence in the overall datasets associated with said ecosystems was rather low, with most NMPFs distributed across 5–20% of the samples associated with each ecosystem type. The only exception was the non-human-mammal-associated NMPFs, for which prevalence reached up to around 45% of the total non-human mammalian datasets.

Taxonomic distribution

Taxonomic assignment of NMPFs was performed on the basis of the available taxonomy information of the corresponding scaffolds in IMG, for each member of the clusters¹⁸. In cases in which no such annotation was available, we used a combination of additional approaches to computationally infer the taxonomy of the scaffolds (Methods). Of the total 17,280,119 IMG/M scaffolds containing the NMPF members, 8,049,154 were classified as Bacteria, 382,761 as Archaea, 1,184,393 as Eukaryota and 1,406,588 as viruses, leaving 6,257,223 as unclassified.

The taxonomic distribution of the NMPFs, on the basis of their corresponding scaffold taxonomic assignment, is shown in Fig. 3a and Extended Data Fig. 6. The majority of protein families included sequences with multiple taxonomic assignments (such as bacterial and unclassified, or bacterial and viral). The largest category consisted of families with bacterial/unclassified sequences, followed by viruses/unclassified and bacteria/viruses. A much smaller group of families was assigned to Eukarya and even fewer families to Archaea. Finally, 7,253 clusters had no taxonomic information at all.

**Fig. 3: Taxonomic composition and occurrence of NMPFs in bacterial and archaeal MAGs.**

As no reliable de novo eukaryotic gene predictor exists for unbinned metagenomes¹⁹, a lot of sequences may come from eukaryotic scaffolds that may contain translation errors (such as mistranslated introns). However, analysing the contents of these clusters (Supplementary Methods) showed that their majority include proteins from Bacteria and Archaea alongside Eukarya, with very few NMPFs containing only eukaryotic sequences (Supplementary Data 6). Moreover, more than half of these clusters are validated by metatranscriptomic data. These two observations supported the quality of the eukaryotic-containing NMPFs.

Subsequently, we evaluated whether any of the NMPF proteins (and their corresponding families) were found in any of the recently identified MAGs from the Genomes from Earth’s Microbiomes (GEM) catalogue²⁰. Specifically, we examined whether any of the scaffolds containing genes of the NMPFs were binned in any of the 52,515 MAGs of the GEM catalogue. This revealed that only 17,953 genes, coming from 7,937 NMPFs (7.4% of total) (Fig. 3b,c), were found within the GEM catalogue, of which the vast majority (93%) was from uncultured species. For those families that were present in two or more MAGs, we noticed a strong narrow taxonomic distribution, with more than two-thirds being restricted to a single species or genus, and only a very small number found across multiple families, classes or phyla (Fig. 3d). NMPFs were found to be statistically enriched in several phyla common in soil environments (for example, Gemmatimonadota, Acidobacteriota, Crenarchaeota and Myxococcota) and statistically depleted from several phyla found in humans and other host-associated environments (Firmicutes, Proteobacteria and Bacteroidota; Fig. 3e). Taken together, these results reveal that a significant fraction of functional diversity remains taxonomically orphan despite improvements in sequencing throughout and large-scale MAG reconstructions.

Metadata distribution

We next examined the geographical distribution of NMPFs (Extended Data Fig. 7). A very small number of families (1,372; 1.3%) was found to have limited geographical distribution (within 1 km), and this number only moderately increased (4,330; 4%) when we allowed for a maximum distance of 1,000 km. Most of these families were found in plant, soil and freshwater ecosystems. A very small number of these families included members found in marine ecosystems or human samples as expected from the higher microbial dispersal in these ecosystems (Extended Data Fig. 7f,g).

The majority of NMPFs (64,186 or 60.44%) comprised a mixture of proteins from both metagenomes and metatranscriptomes, further validating their existence, whereas 38,292 (36.06%) of NMPFs contained proteins found exclusively in metagenomes and 3,720 (3.50%) of NMPFs contained proteins found only in metatranscriptomes (Supplementary Table 5). The percentage of families containing members from both metagenomes and metatranscriptomes steadily decreased along with the number of members per family. NMPFs found in both metagenomes and metatranscriptomes also had the widest sample distribution, that is, the clusters were found in the largest numbers of samples (Supplementary Table 6). The majority of these clusters was classified in environmental ecosystems (soil and, to a lesser extent, marine and freshwater samples) and primarily contained bacterial and unclassified sequences.

To estimate the distribution of novel protein clusters among the environmental sequencing data, we compared the number of novel proteins, extracted from each scaffold and used in this study, against the total number of genes/proteins in the respective scaffold. Most analysed scaffolds (13,407,728 or 77.59%) contained both novel and known genes (top 20 scaffolds; Supplementary Tables 7–11). A comparison of novel versus the total number of genes in these scaffolds revealed that the size of the scaffold or total number of genes per scaffold was not correlated with the number of novel genes. The largest scaffold in our study (5,123,848 bp, 4,302 genes) contained only one novel sequence. Generally, the largest scaffolds in the study contained only a limited number of novel sequences and originated from bacterial or unclassified metagenomic samples (Supplementary Table 9). Conversely, the scaffolds with the most novel sequences were of variable length (and gene count) and mostly originated from viruses (Supplementary Table 10).

As the majority of novel proteins (14,185,414 sequences) was located next to known genes, we investigated the co-occurrence of NMPFs with neighbouring genes assigned to the same Pfam family. Additional annotation was obtained by mapping each NMPF’s co-occurring Pfam domains to their corresponding COG functional categories; this can be used to provide further information on each family’s gene neighbourhood. The distribution of NMPFs across functional categories is given in Supplementary Figs. 3–5. Conserved gene neighbourhoods suggest functional coupling²¹ and can therefore be used to provide additional lines of information for putative function prediction. Accordingly, family F004468 was found to co-occur with ribosomal proteins in 78% of scaffolds (that is, in 118 scaffolds out of the 151 scaffolds in which it was encoded), suggesting that it has a translation-related function. Similarly, family F021307 was found within a probable chloroplast ribosomal protein operon in 67% of encoding scaffolds. In total, 7,625 NMPFs were found to have greater than 50% co-occurrence with specific Pfams, while 585 families had greater than 90% co-occurrence with a Pfam family (Supplementary Data 1). These associations can also be used to predict a functional role for NMPFs; a few examples are given in Extended Data Figs. 8 and 9, in which the gene neighbourhoods of selected NMPFs are presented as association networks, combined with functional annotation from COG.

Structural distribution

Recent breakthroughs in protein structure prediction²² have enabled fast and accurate structural characterization of protein sequences. Metagenomic sequences have been shown to represent a particularly rich source for the discovery of novel structures^23,24. Here we ran AlphaFold2²² on NMPFs with at least 16 diverse sequences, or where TrRosetta²⁵ predicted a well-structured protein (Methods). The results are summarized in Fig. 4a. Out of the 81,345 NMPFs that met the above criteria, 80,585 3D models were predicted, with 13,096 NMPFs having a high confidence (predicted TM (pTM) score > 0.700) prediction. The pTM-score integrates both the predicted confidence per position and the predicted alignment error (pAE) for every pair of positions, indicating the confidence of domain–domain orientations.

**Fig. 4: Structural characterization of the NMPFs.**

On the basis of structural clustering, these high-confidence predictions represented 4,361 unique structures. To examine the novelty or functions of these structures, we compared them to experimentally determined structures from SCOP-Extended (SCOPe)²⁶ and assemblies from the Protein Data Bank (PDB)²⁷. In total, 3,808 structures (12,253 NMPFs) had a significant structural overlap with at least 1 SCOPe domain (TM-score > 0.5). Of these, 2,718 (7,769 NMPFs) had a non-trivial hit, indicating that 62.3% of high-quality predictions had some similarity to at least one SCOPe domain or PDB assembly.

These novel assignments, based on structural similarity, can now be used for functional prediction of the corresponding sequences. A few examples are shown in Fig. 4c. For example, family F034396 had no hits to the PDB using HHsearch (top hit of e-value = 12), yet a strong hit to the PDB using a structural search of the SCOPe domain d3cmba1 (TM-score = 0.69), with the function of acetoacetate decarboxylase. Other examples with no HHsearch hits (e-value > 10), yet strong structural hits included F010804-d1z45a1 (TM-score = 0.73, galactose mutarotase) and F097565-d1xkra_ (TM-score = 0.73, chemotaxis). We stress that these cases should be treated as informed predictions that require experimental validation, as the same fold does not always correspond to the same function. A full list is provided in Supplementary Data 2. However, some validation and additional functional annotation can be performed by combining these novel assignments with other NMPF metadata, such as gene co-occurrence. A few examples are given in Extended Data Fig. 8.

To confirm that the remaining 553 proteins with no SCOPe hit were novel folds, a more thorough search was performed against all PDB biological assemblies, including all possible chain permutations. In total, 345 models had a hit to at least one PDB entry, of which 305 represented additional novel assignments. The remaining 208 were processed for further filtering, removing predictions of which 50% of the structure matched a SCOPe domain. Finally, 162 folds and/or domain–domain orientations from 223 NMPFs were identified as novel (Fig. 4b). A complete list of these folds is provided in Supplementary Data 3.

Although the absence of any significant structural homology precludes the reliable functional annotation for these novel folds, some hints towards their potential function can be gleaned from their associated metadata. Characteristic examples are given in Extended Data Fig. 9, showcasing the gene neighbourhood and ecosystem metadata of three NMPFs with novel structural folds.

Discussion

Arguably, the best approach for estimating and exploring microbial functional diversity is through systematically cataloguing and exhaustively characterizing sequence-diversity space. Over the past three decades, genome sequencing of hundreds of thousands of cultured microbial strains has enabled unprecedented growth and characterization of this sequence space, revealing that sequencing efforts targeted to maximize phylogenetic diversity can lead to further discoveries and growth of currently known protein family diversity²⁸. Although the exploration of the corresponding sequence-encoded functional diversity is lagging substantially²⁹, the explosion in the number of identified novel protein families has, to a great extent, been accompanied by an increase in targeted functional characterization of some of those families, particularly in areas of important biotechnological applications such as the discovery of new CRISPR–Cas genes and systems³⁰. The advent of metagenomics has further fuelled the rush to discover new enzymatic activities by unearthing a hidden treasure trove of untapped sequence information. Yet, aside from generating for-the-first-time important habitat-specific environmental gene catalogues³¹, most explorations of exponentially growing metagenomic sequence space have focused on expanding the diversity and characterization of previously known protein families³².

To alleviate this limitation and pioneer global insights into the extent of novel sequence space and, by effect, functional diversity across the realm of sequenced biomes, we have amassed close to 27,000 publicly available assembled metagenomic and metatranscriptomic datasets. From these datasets, we generated the NMPF catalogue, consisting of 106,198 metagenome protein families of 100 members or more with no sequence similarity to genes from reference microbial genomes or Pfam entries. Although these families represented a mere duplication over the number of families generated from more than 100,000 reference genomes integrated into IMG from all domains of life, far greater increases were observed in the families containing more than 25, 50 or 75 members, strongly suggesting that extensive sequence and functional diversity remains untapped. We anticipate that this diversity in unexplored microbial protein space will continue to increase over the next several years as more novel environmental samples are sequenced.

Although a much smaller number of metatranscriptomes was available for this analysis (4,739; 17.6%), we observed that the majority of NMPFs (60%) comprised proteins encoded by genes identified in both metagenomes and metatranscriptomes, indicating that most of those genes are actively expressed, further supporting the validity of those clusters. The clustering quality was also supported by the observation that 92% of clusters had members spanning 50 samples or more, while 50% of the clusters were from proteins distributed across 100 samples or more (Fig. 2d).

The identification of 7.5% of the NMPFs on the recently reconstructed MAGs of the GEM catalogue indicates that, as we continue to access the genetic content of uncultured microbial diversity, an increasing number of taxonomically orphan novel protein families will become taxonomically assigned, an important step towards their functional and ecological characterization.

There are several limitations underlying the metagenomic data and methodology used in this study. One limiting factor to consider is the short size (shorter than 5 kb) of the majority of scaffolds used in this study. However, note that, due to the required alignment coverage of at least 80%, potentially truncated sequences have to be sufficiently complete to cluster with full-length sequences (defined as located in the middle of longer scaffolds). This requirement has largely precluded the enrichment of NMPFs with fragmented proteins. However, even in the case of NMPFs with a high percentage of these suspect sequences, the clusters are found to produce stable 3D models (often with high structural quality, as evidenced by pLDDT and pTM scores), many of which have structural homologues to SCOPe domains. As a result, families containing such sequences could potentially represent protein fragments or protein domains that form parts of multi-domain sequences, or components in multimeric complexes. An additional potential limitation is the inclusion of eukaryotic sequences in the sequence dataset, which may introduce errors in the analysis. Yet, as shown (taxonomic distribution; Supplementary Methods), the contributions from eukaryotic scaffolds are relatively small, and the majority of the associated NMPFs also contain data from metatranscriptomes and/or prokaryotic taxa in sequence alignments, supporting their validity. However, until reliable eukaryotic gene predictors for metagenomes become available, eukaryotic, as well as unclassified NMPFs and sequences should be handled with care.

Overall, as more metagenomic data become available, an increasing diversity of sequences will be incorporated into NMPFs, which will then enable the generation of a much higher number of high-confidence structures, and therefore further increase the numbers of assignments to known structures as well as uncover novel folds. The identification of NMPFs opens new paths for structural genomics and challenges for fold recognition and the exploration of microbial dark matter.

Methods

Data collection and filtering

All publicly available metagenomes, metatranscriptomes and reference genomes were retrieved from the IMG/M database⁶ (database release, July 2019). Low-complexity regions were removed with the use of the tantan application³³. In total, we extracted all protein sequences from 89,412 bacterial, 9,202 viral, 3,073 archaeal and 804 eukaryal genomes. This corresponded to 87,084,214 bacterial, 221,027 viral, 2,464,569 archaeal and 4,902,193 eukaryotic non-redundant proteins, resulting in a final dataset of 94,672,003 sequences. Pfam hits (v.31) were detected with the use of the hmmsearch tool (HMMER v.3.1 package)³⁴ using the default trusted cut-off. Hits to proteins from reference genomes were calculated using LAST³⁵. We considered a hit to be any aligned sequence at >30% identity over 70% of its length (bidirectionally between the query and the subject). A detailed workflow of the sequence selection, filtering and analysis procedure is provided the Supplementary Methods. Α full summary of the sequences contained in each metagenome and metatranscriptome dataset, including hits to Pfam and reference genomes, sequences used in clustering (see below) and the remaining unannotated sequences, is provided in Supplementary Data 4.

Sequence clustering and analysis

Sequence clustering was performed using the HipMCL algorithm with inflation parameter 2.0 using identity scores as the input. HipMCL was chosen over other clustering solutions owing to its scalability and parallelization capabilities, as well as its ability to efficiently cluster very large datasets (Supplementary Methods). Before clustering, the all-versus-all pairwise alignments were calculated using LAST (70% sequence identity, 80% alignment coverage). The reference genome graph consisted of 71,312,220 nodes (proteins) and 5,313,956,680 edges (pairwise similarities). The graph for the ED proteins consisted of 570,198,677 nodes and 5,196,499,560 edges. Notably, during the similarity matrix construction, 23,359,783 (~24.67%) out of the 94,672,003 reference proteins and 601,776,172 (~51.34%) out of the 1,171,974,849 ED proteins remained as singletons. Using 2,500 compute nodes (170,000 compute cores) of the NERSC Cori supercomputer (Intel KNL partition), HipMCL clustered the reference protein graph in 24 min and the ED protein graph in 3 h and 20 min.

Notably, for this task, we explored several graph- and sequence-based methods such as CD-HIT³⁶, UCLUST³⁷, kClust³⁸ or Louvain³⁹ that could not perform at this scale as well as MMSeq2-linclust⁴⁰ and SPICi⁴¹, which were previously proposed to scale at this data size but none performed sufficiently well for the scope of this work. A comparison among these different clustering strategies is provided in the Supplementary Methods. With regard to sequence identity, we set the cut-off at 70% to achieve a compromise between sensitivity and specificity. Although other resources have used more stringent parameters (such as MGnify⁶, in which sequence clustering is performed with a 90% sequence identity cut-off), focusing on specificity, we have chosen a more sensitivity-based approach, yet not overly sensitive, to avoid artifacts produced by noise, multi-domain effects and false positives. Regarding alignment coverage, the choice of a high-coverage threshold (80%) ensures generating better-quality sequence alignments and avoiding potential artifacts, such as partial hits involving significantly truncated/incomplete genes, pseudogenes and chimeric sequences. The overall quality of the final dataset is further improved by considering MCL clusters with 100 members or more, ensuring the preclusion of spurious gene products. An analysis of how the clustering cut-off values (specifically sequence identity) may influence the quality of the resulting clusters is also provided in the Supplementary Methods.

For each cluster, a multiple-sequence alignment (MSA) was calculated using Clustal Omega⁴². The resulting MSAs were then filtered to produce seed (non-redundant) alignments using a script written in Python and the ProDy/Evol and Biopython modules⁴³ (90% sequence identity, 75% alignment coverage) (Supplementary Methods). All of the subsequent analysis steps, including the calculation of length distributions, HMM profile training and 3D structure predictions, were performed using the generated seed MSAs. HMM profiles were generated using the hmmalign utility of HMMER v.3.1 suite³⁴. The representative consensus sequences for each cluster were calculated with Biopython (Supplementary Methods). Consensus sequences were searched against the whole Pfam-A⁸ database (v.33) HMM profiles with HMMER (inclusion thresholds: 7.0 total, 5.0 domain), as well as against the reference genomes using BLAST⁴⁴ (30% identity, 50% coverage bidirectionally). Initial NMPFs of ≥100 members (113,752 clusters from 19,473 datasets with 20,211,137 scaffolds and 21,260,914 proteins) were processed for additional filtering with more stringent criteria to remove sequences with even weak similarities to Pfam-A models or genes from reference genomes, based on their calculated consensus profile sequences. This resulted in a more-stringent set of 106,198 clusters. Clusters of which the consensus sequence was found to have a hit to either Pfam-A or a reference genome were removed. A complete summary of the clustering procedure for each ED dataset is provided in Supplementary Data 3. Plot distributions were computed using R and the R/ggplot2⁴⁵ package. For all calculations regarding NMPF sequence length, the average sequence length of each cluster’s seed MSA was calculated and used.

Verification of protein family novelty

Additional searches were performed against the reference proteomes of RefSeq (November 2021 release) using the NMPF clusters’ consensus sequences and HMM profiles with LAST (matrix: BLOSUM62, gap open: 11, gap extend: 2) and HMMER (inclusion thresholds: 7.0 total, 5.0 domain), respectively, and applying a 70% alignment coverage cut-off. These searches showed that 5,111 clusters (around 5% of the total) obtained positive hits against 134,273 RefSeq sequences. Taxonomic annotation of these sequences showed that 75,215 (56.01%) of the positive hits were eukaryotic, 46,793 (34.85%) were bacterial, 9,296 (6.92%) were archaeal and 2,905 (2.16%) were viral, with 64 sequences having no taxonomic annotation. About half (70,554 or 52.54%) of the hits were published from 2020 onwards, with the majority of those matching genes from MAGs. Cross-examination of the RefSeq hits with UniProt (release 2021_04) records showed that 31,242 (23.27%) of the RefSeq sequences were mapped to 33,628 UniProtKB entries (453 SwissProt and 33,175 TrEMBL); these sequences correspond to only 32 NMPF clusters. The rest of the RefSeq sequences (103,031) were contained in the UniParc archive of UniProt and had no annotation evidence (either manual or automatically generated). The NMPF clusters were searched against Pfam-B, a non-annotated, computationally generated dataset of alignments for sequences not covered by Pfam-A. This resulted in 8,313 unique clusters with positive hits against 5,310 Pfam-B. Finally, the positive hits against the searches of each dataset were compiled and compared. In total, 12,846 clusters had a positive hit to RefSeq, Pfam-B or both, while the rest of the clusters (93,352) had no hits. Finally, all NMPF sequences were searched against AntiFam⁴⁶ v.6.0, a collection of HMM profiles designed to detect potential spurious protein sequences, pseudogenes and false protein translations. Only 43 sequences were identified, with low-score hits and low alignment coverage (<50%) to two AntiFam profiles.

Ecosystem and taxonomic annotation of protein clusters

NMPF clusters were annotated with available environmental and taxonomy metadata through their associated environmental datasets. In the case of environmental metadata, the GOLD¹⁵ ecosystem classification scheme¹⁴ was used to organize datasets into ecosystem groups (such as freshwater, marine, soil, host-associated); each protein cluster was then assigned to one or more ecosystems on the basis of the ecosystem information of the ED samples in IMG on which the protein sequences were found. In addition to GOLD, the Environment Ontology (ENVO)⁴⁷ and Earth Microbiome Project Ontology (EMPO)⁴⁸ were considered as potential alternative classification systems. Mapping of ED samples and NMPFs to ENVO and EMPO was performed using the metadata of the GOLD biosample project associated with each ED sample. The ecosystem assignments for all three classification systems are presented in Supplementary Data 5. Ultimately, the GOLD classification was used, as it was found to offer the most diverse options and classified all ED samples and NMPFs. From the 19,326 environmental samples that included the NMPFs, 14,540 (75.24%) were environmental (for example, soil, freshwater), 3,867 (20.01%) were host-associated (for example, human, plants) and 919 (4.76%) came from engineered environments (for example, wastewater, industrial wastes).

In a similar manner, the initial taxonomic annotation for the clusters was performed using the NCBI taxonomy information of the scaffolds contained in each IMG/M dataset, where available. Note that the majority of the scaffolds used in this study were too short and therefore remained unclassified taxonomically. Furthermore, there is very little information on the taxonomy of viral scaffolds. To alleviate these issues, annotations for scaffolds >5 kb that had previously been identified as viral and included in version 3.0 of IMG/VR⁴⁹ were used. Moreover, scaffolds 1–5 kb in length were analysed using DeepVirFinder (v.1.0)⁵⁰, and the generated P values were subsequently converted to q values using the R package qvalue⁵¹ to obtain estimates of the false-discovery rate. Scaffolds with q ≤ 0.001 were retained as putative viral scaffolds. Unclassified scaffolds were further analysed using two eukaryotic sequence detection tools, Whokaryote⁵² and EukRep⁵³. Furthermore, the NMPF clusters were searched against the Tara Oceans collection of eukaryotic MAGs⁵⁴. Finally, all remaining unclassified scaffolds were taxonomically assigned using the MMseqs2 taxonomy tool^40,55, performing six-frame translation searches against UniRef50⁵⁶ and assigning each analysed scaffold to the lowest common ancestor of the best hits for each frame. The taxonomic assignments of the NMPF clusters were based on the source scaffolds and are given in Supplementary Data 6. A detailed description of the taxonomic annotation and analysis is given in the Supplementary Methods.

Distribution analysis of the protein clusters across ecosystems and NCBI taxa was performed by creating and visualizing networks with Gephi⁵⁷ using the Yifan Hu algorithm⁵⁸ to generate the layout. As the resulting networks, when taking into account all clusters, were very dense, an association threshold was used to filter the data for better clarity, keeping only the edges where at least 2% of the members of each cluster were assigned to a certain ecosystem or phylogeny. Additional analysis was performed by creating circos plots and distribution matrices, produced using the R/chorddiag⁵⁹ and Processing/P5 libraries, respectively. Bar plots were also created in R, using the R/ggplot2⁴⁵ and R/plotly⁶⁰ libraries. Visualizations of geographical distribution were created using maps from the Natural Earth public domain repository (https://www.naturalearthdata.com/).

Sequence quality control

The quality of the predicted protein sequences used in the analysis was evaluated by taking into account the predicted gene coordinates and gene density of the source ED scaffolds. In particular, evaluation was performed for scaffolds identified as eukaryotic from the IMG/M pipeline, as well as scaffolds with no taxonomic annotation and low density, as the latter is often indicative of potential eukaryotic contigs, or contigs featuring alternative genetic codes. Furthermore, the distance of each NMPF sequence from its respective scaffold ends was evaluated to detect potentially shortened/truncated genes. Based on the above, a number of metrics have been established to assess the quality of NMPF clusters. Details on the analysis are provided in the Supplementary Methods. The quality assessment of NMPFs is presented in Supplementary Data 7.

Protein cluster co-occurrence with Pfams

The co-occurrence of NMPFs with known protein domains was determined by performing searches for the existence of Pfam protein domains in the analysed scaffolds containing both novel and known protein-coding genes. The translated sequences of the known genes for each scaffold were searched against the HMM profiles of Pfam using HMMER and the HMM profiles’ default trusted cut-off. All positive hits were assigned to their respective scaffolds and, in turn, to NMPFs containing novel sequences from these scaffolds, as potential co-occurring domains. The co-occurrence frequency percentage of each Pfam domain for each NMPF was calculated, defined as the number of scaffolds containing this domain over the total number of scaffolds associated with the NMPF. The Pfam domains were subsequently mapped to COG⁷ domains and their functional categories. No associations to Pfam were observed for 7,885 NMPFs; for the rest of the clusters, the top five Pfam and COG hits based on their frequency are reported in Supplementary Data 1. Moreover, the gene neighbourhoods of selected NMPFs were visualized as association networks, built with NORMA (v.2.0)⁶¹, and using COG functional categories to provide annotations.

Protein fold prediction

Multiple-sequence alignment

The query sequence for the MSA was determined by taking the central or pivot sequence of the seed MSA. Query sequences were defined in each MSA by performing pairwise distance calculations, creating an all-against-all distance matrix, and selecting the sequence with the minimum Hamming distance. The MSAs were then recalculated using the central sequence as a guide, filtering to remove sequences poorly aligned to the query (cut-offs set at 90% for sequence identity and 80% for alignment coverage), as well as poorly aligned positions (low column occupancy). The final MSAs were considered for further analysis if they had more than 16 effective sequences. Calculations were performed using Python and the TensorFlow2⁶², SciKit⁶³ and Biopython libraries.

TrRosetta for initial screening

For the initial pass, each of the putative protein families was analysed using TrRosetta²⁵ to obtain a distogram. Notably, a distogram is a tensor that contains the predicted distance distribution for every pair of residues. By summing the distances of less than 8 Å, a distogram can be converted into a contact map, indicating the probability of contact. The mean of the top probabilities has been shown to be highly correlated with structure accuracy²⁶ and can be used to filter for proteins that are probably well structured. This metric is very fast to compute and enables us to quickly scan through 106,198 examples. MSAs with at least 0.5 average probability were selected for AlphaFold prediction, alongside the MSAs with enough effective sequences.

AlphaFold for final prediction

As none of the NMPFs match known protein families (Pfam) or structures in the PDB, AlphaFold2²² was run with no template input. Five models were generated per run and the model with the best pLDDT average was selected for downstream evaluation.

Searching for structural homologues

Before any structural search, regions with low predicted confidence (pLDDT < 0.7) were removed. To test whether there is any structural similarity to experimentally determined structures, a TMalign⁶⁴ search was performed against every domain in SCOPe²⁷ (21 October 2021, v.2.0.8), an annotated version of the SCOP⁶⁵ database of protein domains. To test if hits (TM-score > 0.5) were novel assignments (not easily inferred from distant sequence homologues), the TMalign score was also computed for the structure of the top HHsearch hit. Finally, the TM-score could be low owing to the length difference between the target and query. This can happen because the query is multi-domain; to account for this, an additional search was performed against the entire PDB²⁸ (accessed on 17 December 2021) using MMalign⁶⁶. To further confirm that any hits to PDB were non-trivial, we compared the predicted structure to the PDB structure of the top hit from an HMM–HMM alignment search, using HHsearch⁶⁷. The predicted structures with non-trivial hits to SCOPe that were supported by HHsearch results are referred to as novel assignments.

Clustering

For clustering, the all-versus-all TMalign score was computed for all predicted structures with pTM > 0.7 trimmed to regions with pLDDT > 0.7. Clusters are defined as the connected components of a network, where edges are defined by TM-score > 0.6 for both target-to-query and query-to-target alignment.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

All of the analysed datasets along with their corresponding sequences are available from the IMG system (http://img.jgi.doe.gov/). A list of the datasets used in this study is provided in Supplementary Data 8. All data from the protein clusters, including sequences, multiple alignments, HMM profiles, 3D structure models, and taxonomic and ecosystem annotation, are available through NMPFamsDB, publicly accessible at www.nmpfamsdb.org. The 3D models are also available at ModelArchive under accession code ma-nmpfamsdb.

Code availability

Sequence analysis was performed using Tantan (https://gitlab.com/mcfrith/tantan), BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi), LAST (https://gitlab.com/mcfrith/last), HMMER (http://hmmer.org/) and HH-suite3 (https://github.com/soedinglab/hh-suite). Clustering was performed using HipMCL (https://bitbucket.org/azadcse/hipmcl/src/master/). Additional taxonomic annotation was performed using Whokaryote (https://github.com/LottePronk/whokaryote), EukRep (https://github.com/patrickwest/EukRep), DeepVirFinder (https://github.com/jessieren/DeepVirFinder) and MMseqs2 (https://github.com/soedinglab/MMseqs2). 3D modelling was performed using AlphaFold2 (https://github.com/deepmind/alphafold) and TrRosetta2 (https://github.com/RosettaCommons/trRosetta2). Structural alignments were performed using TMalign (https://zhanggroup.org/TM-align/) and MMalign (https://zhanggroup.org/MM-align/). All custom scripts used for the generation and analysis of the data are available at Zenodo (https://doi.org/10.5281/zenodo.8097349).

References

New, F. N. & Brito, I. L. What is metagenomics teaching us, and what is missed? Annu. Rev. Microbiol. 74, 117–135 (2020).
Article PubMed CAS Google Scholar
Rinke, C. et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature 499, 431–437 (2013).
Article ADS PubMed CAS Google Scholar
Mistry, J. et al. Pfam: the protein families database in 2021. Nucleic Acids Res. 49, D412–D419 (2021).
Article PubMed CAS Google Scholar
Meyer, F. et al. MG-RAST version 4—lessons learned from a decade of low-budget ultra-high-throughput metagenome analysis. Brief. Bioinform. 20, 1151–1159 (2019).
Article PubMed Google Scholar
Ayling, M., Clark, M. D. & Leggett, R. M. New approaches for metagenome assembly with short reads. Brief. Bioinform. 21, 584–594 (2020).
Article PubMed CAS Google Scholar
Chen, I.-M. A. et al. The IMG/M data management and analysis system v.6.0: new tools and advanced capabilities. Nucleic Acids Res. 49, D751–D763 (2021).
Article PubMed CAS Google Scholar
Mitchell, A. L. et al. MGnify: the microbiome analysis resource in 2020. Nucleic Acids Res. 48, D570–D578 (2019).
Galperin, M. Y. et al. COG database update: focus on microbial diversity, model organisms, and widespread pathogens. Nucleic Acids Res. 49, D274–D281 (2021).
Article PubMed CAS Google Scholar
Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, D457–D462 (2016).
Article PubMed CAS Google Scholar
Vanni, C. et al. AGNOSTOS-DB: a resource to unlock the uncharted regions of the coding sequence space. Preprint at bioRxiv https://doi.org/10.1101/2021.06.07.447314 (2021).
Rodríguez del Río, Á. et al. Functional and evolutionary significance of unknown genes from uncultivated taxa. Preprint at bioRxiv https://doi.org/10.1101/2022.01.26.477801 (2022).
Modha, S., Robertson, D. L., Hughes, J. & Orton, R. J. Quantifying and cataloguing unknown sequences within human microbiomes. mSystems https://doi.org/10.1128/msystems.01468-21 (2022).
Azad, A., Pavlopoulos, G. A., Ouzounis, C. A., Kyrpides, N. C. & Buluç, A. HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks. Nucleic Acids Res. 46, e33 (2018).
Article PubMed PubMed Central CAS Google Scholar
Enright, A. J., Van Dongen, S. & Ouzounis, C. A. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 30, 1575–1584 (2002).
Article PubMed PubMed Central CAS Google Scholar
Mukherjee, S. et al. Genomes OnLine Database (GOLD) v.8: overview and updates. Nucleic Acids Res. 49, D723–D733 (2021).
Article PubMed CAS Google Scholar
Ivanova, N. et al. A call for standardized classification of metagenome projects. Environ. Microbiol. 12, 1803–1805 (2010).
Article PubMed Google Scholar
Coelho, L. P. et al. Towards the biogeography of prokaryotic genes. Nature 601, 252–256 (2022).
Article ADS PubMed CAS Google Scholar
Clum, A. et al. DOE JGI Metagenome Workflow. mSystems 6, e00804-20 (2021).
Article PubMed PubMed Central Google Scholar
Baltoumas, F. A. et al. Exploring microbial functional biodiversity at the protein family level-From metagenomic sequence reads to annotated protein clusters. Front. Bioinform. 3, 1157956 (2023).
Article PubMed PubMed Central Google Scholar
Nayfach, S. et al. A genomic catalog of Earth’s microbiomes. Nat. Biotechnol. 39, 499–509 (2021).
Article PubMed CAS Google Scholar
Overbeek, R., Fonstein, M., D’Souza, M., Pusch, G. D. & Maltsev, N. The use of gene clusters to infer functional coupling. Proc. Natl Acad. Sci. USA 96, 2896–2901 (1999).
Article ADS PubMed PubMed Central CAS Google Scholar
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Article ADS PubMed PubMed Central CAS Google Scholar
Ovchinnikov, S. et al. Protein structure determination using metagenome sequence data. Science 355, 294–298 (2017).
Article ADS PubMed PubMed Central CAS Google Scholar
Hou, Q. et al. Using metagenomic data to boost protein structure prediction and discovery. Comput. Struct. Biotechnol. J. 20, 434–442 (2022).
Article PubMed PubMed Central CAS Google Scholar
Yang, J. et al. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl Acad. Sci. USA 117, 1496–1503 (2020).
Article ADS PubMed PubMed Central CAS Google Scholar
Chandonia, J.-M. et al. SCOPe: improvements to the structural classification of proteins—extended database to facilitate variant interpretation and machine learning. Nucleic Acids Res. 50, D553–D559 (2022).
Article PubMed CAS Google Scholar
Berman, H., Henrick, K. & Nakamura, H. Announcing the worldwide Protein Data Bank. Nat. Struct. Biol. 10, 980 (2003).
Article PubMed CAS Google Scholar
Mukherjee, S. et al. 1,003 reference genomes of bacterial and archaeal isolates expand coverage of the tree of life. Nat. Biotechnol. 35, 676–683 (2017).
Article PubMed CAS Google Scholar
Roberts, R. J. et al. COMBREX: a project to accelerate the functional annotation of prokaryotic genomes. Nucleic Acids Res. 39, D11–D14 (2011).
Article PubMed CAS Google Scholar
Koonin, E. V. & Makarova, K. S. Evolutionary plasticity and functional versatility of CRISPR systems. PLoS Biol. 20, e3001481 (2022).
Article PubMed PubMed Central CAS Google Scholar
Qin, J. et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59–65 (2010).
Article PubMed PubMed Central CAS Google Scholar
Wyman, S. K., Avila-Herrera, A., Nayfach, S. & Pollard, K. S. A most wanted list of conserved microbial protein families with no known domains. PLoS ONE 13, e0205749 (2018).
Article PubMed PubMed Central Google Scholar
Frith, M. C. A new repeat-masking method enables specific detection of homologous sequences. Nucleic Acids Res. 39, e23 (2011).
Article PubMed Google Scholar
Eddy, S. R. Accelerated Profile HMM Searches. PLoS Comput. Biol. 7, e1002195 (2011).
Article ADS MathSciNet PubMed PubMed Central CAS Google Scholar
Kiełbasa, S. M., Wan, R., Sato, K., Horton, P. & Frith, M. C. Adaptive seeds tame genomic sequence comparison. Genome Res. 21, 487–493 (2011).
Article PubMed PubMed Central Google Scholar
Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).
Article PubMed CAS Google Scholar
Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460–2461 (2010).
Article PubMed CAS Google Scholar
Hauser, M., Mayer, C. E. & Söding, J. kClust: fast and sensitive clustering of large protein sequence databases. BMC Bioinform. 14, 248 (2013).
Article Google Scholar
Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. 2008, P10008 (2008).
Article MATH Google Scholar
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Article PubMed CAS Google Scholar
Jiang, P. & Singh, M. SPICi: a fast clustering algorithm for large biological networks. Bioinformatics 26, 1105–1111 (2010).
Article PubMed PubMed Central Google Scholar
Sievers, F. & Higgins, D. G. Clustal Omega for making accurate alignments of many protein sequences. Protein Sci. 27, 135–145 (2018).
Article PubMed CAS Google Scholar
Cock, P. J. A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009).
Article PubMed PubMed Central CAS Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Article PubMed CAS Google Scholar
Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer-Verlag, 2016).
Eberhardt, R. Y. et al. AntiFam: a tool to help identify spurious ORFs in protein annotation. Database 2012, bas003 (2012).
Article PubMed PubMed Central Google Scholar
Buttigieg, P. L. et al. The environment ontology in 2016: bridging domains with increased scope, semantic density, and interoperation. J. Biomed. Semantics 7, 57 (2016).
Article PubMed PubMed Central Google Scholar
Thompson, L. R. et al. A communal catalogue reveals Earth’s multiscale microbial diversity. Nature 551, 457–463 (2017).
Article ADS PubMed PubMed Central CAS Google Scholar
Roux, S. et al. IMG/VR v3: an integrated ecological and evolutionary framework for interrogating genomes of uncultivated viruses. Nucleic Acids Res. 49, D764–D775 (2021).
Article PubMed CAS Google Scholar
Ren, J. et al. Identifying viruses from metagenomic data using deep learning. Quant. Biol. 8, 64–77 (2020).
Article PubMed PubMed Central CAS Google Scholar
Storey, J. D., Bass, A. J., Dabney, A. & Robinson, D. qvalue: Q-value estimation for false discovery rate control. R package version 2.32.0 http://github.com/jdstorey/qvalue (2023).
Pronk, L. J. U. & Medema, M. H. Whokaryote: distinguishing eukaryotic and prokaryotic contigs in metagenomes based on gene structure. Microb. Genomics 8, mgen000823 (2022).
Article Google Scholar
West, P. T., Probst, A. J., Grigoriev, I. V., Thomas, B. C. & Banfield, J. F. Genome-reconstruction for eukaryotes from complex natural microbial communities. Genome Res. 28, 569–580 (2018).
Article PubMed PubMed Central CAS Google Scholar
Delmont, T. O. et al. Functional repertoire convergence of distantly related eukaryotic plankton lineages abundant in the sunlit ocean. Cell Genomics 2, 100123 (2022).
Article PubMed PubMed Central CAS Google Scholar
Mirdita, M., Steinegger, M., Breitwieser, F., Söding, J. & Levy Karin, E. Fast and sensitive taxonomic assignment to metagenomic contigs. Bioinformatics 37, 3029–3031 (2021).
Article PubMed PubMed Central CAS Google Scholar
Suzek, B. E., Wang, Y., Huang, H., McGarvey, P. B. & Wu, C. H. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).
Article PubMed CAS Google Scholar
Bastian, M., Heymann, S. & Jacomy, M. Gephi: an open source software for exploring and manipulating networks. In Proc. International AAAI Conference on Web and Social Media Vol. 3, 361–362 (AAAI, 2009).
Hu, Y. in Combinatorial Scientific Computing (eds Naumann, U. & Schenk, O.) 525–549 (CRC Press, 2010).
Flajolet, P. & Noy, M. in Formal Power Series and Algebraic Combinatorics (eds Krob, D. et al.) 191–201 (Springer, 2000); https://doi.org/10.1007/978-3-662-04166-6_17.
Sievert, C. Interactive Web-Based Data Visualization with R, plotly, and shiny (Chapman and Hall/CRC, 2020).
Karatzas, E. et al. The network makeup artist (NORMA-2.0): distinguishing annotated groups in a network using innovative layout strategies. Bioinform. Adv. 2, vbac036 (2022).
Article PubMed PubMed Central Google Scholar
Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous systems. Preprint at https://doi.org/10.48550/arXiv.1603.04467 (2015).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet MATH Google Scholar
Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res. 33, 2302–2309 (2005).
Article PubMed PubMed Central CAS Google Scholar
Andreeva, A., Kulesha, E., Gough, J. & Murzin, A. G. The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures. Nucleic Acids Res. 48, D376–D382 (2020).
Article PubMed CAS Google Scholar
Mukherjee, S. & Zhang, Y. MM-align: a quick algorithm for aligning multiple-chain protein complex structures using iterative dynamic programming. Nucleic Acids Res. 37, e83 (2009).
Article PubMed PubMed Central Google Scholar
Steinegger, M. et al. HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinform. 20, 473 (2019).
Article Google Scholar

Download references

Acknowledgements

We thank H. Maughan for reading the paper; and all of the colleagues who contributed to the many facets of metagenomics, from sample collection to sequencing and annotation that made this work possible. The list of the JGI Proposal Award DOIs is available in Supplementary Table 13. This work used resources of the National Energy Research Scientific Computing Center (NERSC), supported by the Office of Science of the US Department of Energy (DOE). Additional computations were performed with the use of the Greek Research and Technology Network (GRNET) Aris High Processing Computing (HPC) infrastructure (project code: PR009008-BOLOGNA). This work was supported in part by the US DOE Joint Genome Institute (DE-AC02–05CH11231, in part), a DOE Office of Science User Facility; the Applied Mathematics program of the DOE Office of Advanced Scientific Computing Research (DE-AC02–05CH11231, in part), Office of Science of the US DOE; Exascale Computing Project (17-SC-20-SC), a collaborative effort of the US DOE Office of Science and the National Nuclear Security Administration; DOE grant DE-SC0022098. G.A.P., F.A.B. and E.K. were supported by Fondation Santé and the Hellenic Foundation for Research and Innovation (H.F.R.I.) under the ‘First Call for H.F.R.I. Research Projects to support faculty members and researchers and the procurement of high-cost research equipment grant’ (grant ID HFRI-17-1855-BOLOGNA). G.A.P. also acknowledges the Marie Skłodowska-Curie Individual Fellowships (MSCA-IF-EF-CAR, grant ID 838018, H2020-MSCA-IF-2018) and ‘The Greek Research Infrastructure for Personalized Medicine (pMedGR)’ (MIS 5002802), which is implemented under the Action ‘Reinforcement of the Research and Innovation Infrastructure’, funded by the Operational Program ‘Competitiveness, Entrepreneurship and Innovation’ (NSRF 2014-2020) and co-financed by Greece and the European Union (European Regional Development Fund). C.A.O. and I.I. acknowledge support by the project Elixir-GR (MIS 5002780), implemented under the Action ‘Reinforcement of the Research and Innovation Infrastructure’, funded by the Operational Program Competitiveness, Entrepreneurship and Innovation (NSRF 2014-2020) and co-financed by Greece and the European Union (European Regional Development Fund). S.O. and S.L. are supported by NIH grant DP5OD026389 and the Moore–Simons Project on the Origin of the Eukaryotic Cell, Simons Foundation 735929LPI (https://doi.org/10.46714/735929LPI). J.P.-R. was supported by the US DOE Genomic Sciences Program, award SCW1632; and work conducted at the LLNL was conducted under the auspices of the US DOE under Contract DE-AC52-07NA27344. Work from the consortium was supported by NSF grants OIA-1826734, DEB-1441717 and OCE-1232982; NSF 1921429; CONACyT grants A1-S-9889 and CB-2010-01-151007; US DOE, Office of Science, Office of Biological and Environmental Research (BER), Great Lakes Bioenergy Research Center (DOE BER DE-SC0018409 and DE-FC02-07ER64494); NSF grant OCE-082546, US DOE, Office of Science, Facilities Integrating Collaborations for User Science (FICUS) program, Office of Workforce Development for Teachers and Scientists, Office of Science Graduate Student Research (SCGSR) program; New Zealand Foundation for Research, Science and Technology grant CO1X0306 and National Science Foundation grant 1745341; NSF Division of Chemical, Bioengineering, Environmental and Transport Systems grants 1438092 and 1643486; NSF OCE-1559179, NSF OCE-1537951, NSF OCE-1459200, Gordon & Betty Moore Foundation Investigator Award 3789; the G. Unger Vetlesen and Ambrose Monell Foundations; the Natural Sciences and Engineering Research Council of Canada; Genome Canada and Genome British Columbia; the PR-INBRE BiRC program (NIH/NIGMS- award number P20 GM103475); Great Lakes Bioenergy Research Center, US DOE, Office of Science, Office of Biological and Environmental Research under award numbers DE-SC0018409 and DE-FC02-07ER64494; the Agriculture and Food Research Initiative, competitive grant 2009-447 35319-05186 from the US Department of Agriculture, National Institute of Food and Agriculture; Sol Leshin Foundation and the Shanbrom Family Fund; Towards Sustainability Foundation, Cornell Sigma Xi, NSERC PGS-D, NSF-BREAD (IOS-0965336), Cornell Biogeochemistry Program, Cornell Crop and Soil Science Department, USDA-NIFA Carbon Cycle (2014-6700322069) and the Cornell Atkinson Center for a Sustainable Future; Office of Science (BER), US DOE (DE-SC0014395); NSF grant OCE 0424602; US DOE, Office of Science, Office of Biological and Environmental Research, Environmental System Science (ESS) Program; Australian Research Council: DP150100244; NSF-OPP 1641019; NSF 1754756; NSF 1442231; NSF award OCE-173723; USDA National Institute of Food and Agriculture Foundational Program (award 2017-67019-26396); USDA NIFA award 2011-67019-30178; BER grant DE-SC0014395; National Science Foundation grant DEB-1927155; US DOE, Office of Science, Office of Biological and Environmental Research, Environmental System Science (ESS) Program; River Corridor Scientific Focus Area (SFA) project at Pacific Northwest National Laboratory (PNNL); grant NNX16AJ62G from NASA Exobiology; NASA Exobiology awards 80NSSC19K1633 and NNX17AK85G; NSF award DEB-1146149; US NSF (DEB 1912525); US DOE Office of Biological and Environmental Research (DE-SC0020382); NSF EAR-1820658; DE-FG02-94ER20137 from the Photosynthetic Systems Program, Division of Chemical Sciences, Geosciences and Biosciences (CSGB), Office of Basic Energy Sciences of the US DOE; Max Planck Society and the BioEnergy Science Center (BESC), a US DOE Bioenergy Research Center supported by the Office of Biological and Environmental Research in the DOE Office of Science; and US DOE, Office of Science, Biological and Environmental Research, as part of the Plant Microbe Interfaces Scientific Focus Area at Oak Ridge National Laboratory.

Author information

Authors and Affiliations

Institute for Fundamental Biomedical Research, Biomedical Science Research Center Alexander Fleming, Vari, Greece
Georgios A. Pavlopoulos, Fotis A. Baltoumas & Evangelos Karatzas
DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
Georgios A. Pavlopoulos, Antonio Pedro Camargo, Stephen Nayfach, Simon Roux, Lee Call, Natalia N. Ivanova, I. Min Chen, David Paez-Espino, Emiley Eloe-Fadrosh, Marcel Huntemann, T. B. K. Reddy, Rekha Seshadri, Susannah Tringe, Tanja Woyke, Axel Visel, Christos A. Ouzounis & Nikos C. Kyrpides
Center for New Biotechnologies and Precision Medicine, School of Medicine, National and Kapodistrian University of Athens, Athens, Greece
Georgios A. Pavlopoulos
John Harvard Distinguished Science Fellowship Program, Harvard University, Cambridge, MA, USA
Sirui Liu & Sergey Ovchinnikov
Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
Oguz Selvitopi & Aydin Buluç
Luddy School of Informatics, Computing and Engineering, Indiana University Bloomington, Bloomington, IN, USA
Ariful Azad
Department of Basic Sciences, School of Medicine, University of Crete, Heraklion, Greece
Ioannis Iliopoulos
School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, GA, USA
Konstantinos Konstantinidis
Center for Microbial Ecology, Michigan State University, East Lansing, MI, USA
Aaron Garoutte, Jiarong Guo & James M. Tiedje
Physical and Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, CA, USA
Jennifer Pett-Ridge
Department of Biochemistry, University of Washington, Seattle, WA, USA
David Baker
Institute for Protein Design, University of Washington, Seattle, WA, USA
David Baker
Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
David Baker
Biological Computation & Process Laboratory, Chemical Process & Energy Resources Institute, Centre for Research & Technology Hellas, Thessalonica, Greece
Christos A. Ouzounis
Biological Computation & Computational Biology Group, Artificial Intelligence & Information Analysis Lab, School of Informatics, Aristotle University of Thessalonica, Thessalonica, Greece
Christos A. Ouzounis
Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA
Aydin Buluç
Department of Marine Biology and Oceanography, Institut de Ciències del Mar, Barcelona, Spain
Silvia G. Acinas
Biology Department, Clark University, Worcester, MA, USA
Nathan Ahlgren
AgResearch, Grasslands Research Centre, Palmerston North, New Zealand
Graeme Attwood
Laboratory of Environmental Microbiology, Institute of Microbiology of the Czech Academy of Sciences, Prague, Czech Republic
Petr Baldrian
Department of Soil Science, University of Wisconsin-Madison, Madison, WI, USA
Timothy Berry & Thea Whitman
Department of Biology, Boston University, Boston, MA, USA
Jennifer M. Bhatnagar
Carnegie Institution for Science, Stanford, CA, USA
Devaki Bhaya
Department of Marine and Coastal Sciences, Rutgers University, New Brunswick, NJ, USA
Kay D. Bidle & Kimberlee Thamatrakoln
Department of Biology, University of Massachusetts, Amherst, MA, USA
Jeffrey L. Blanchard
Department of Microbiology and Cell Biology, Montana State University, Bozeman, MT, USA
Eric S. Boyd
Marine Science Center, Department of Marine and Environmental Sciences, Northeastern University, Nahant, MA, USA
Jennifer L. Bowen
Integrative Oceanography Division, Scripps Institution of Oceanography, UC San Diego, La Jolla, CA, USA
Jeff Bowman
School of Marine Sciences, University of Maine, Orono, ME, USA
Susan H. Brawley
Earth and Environmental Sciences, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
Eoin L. Brodie
Research Group Insect Microbiology and Symbiosis, Max Planck Institute for Terrestrial Microbiology, Marburg, Germany
Andreas Brune
Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, USA
Donald A. Bryant
Department of Microbiology, The University of Tennessee, Knoxville, Knoxville, TN, USA
Alison Buchan
School of Life Sciences, Arizona State University, Tempe, AZ, USA
Hinsby Cadillo-Quiroz
Department of Biological Sciences, Clemson University, Clemson, SC, USA
Barbara J. Campbell
School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Sydney, New South Wales, Australia
Ricardo Cavicchioli
Center for Ecosystem Science and Society, Northern Arizona University, Flagstaff, AZ, USA
Peter F. Chuckran & Paul Dijkstra
Department of the Geophysical Sciences, University of Chicago, Chicago, IL, USA
Maureen Coleman
Department of Microbiology and Immunology, University of British Columbia, Vancouver, British Columbia, Canada
Sean Crowe
Department of Biology, San Diego State University, San Diego, CA, USA
Daniel R. Colman & Nathalie Delherbe
Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, USA
Cameron R. Currie
Department of Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jeff Dangl & Marina Kalyuzhnaya
Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI, USA
Vincent J. Denef & Timothy Y. James
Ocean Genome Legacy, Marine Science Center, Northeastern University, Nahant, MA, USA
Daniel D. Distel
Department of Biological Sciences, California State University, Los Angeles, CA, USA
Kirsten Fisher
Department of Earth System Science, Stanford University, Stanford, CA, USA
Christopher Francis
Department of Plant Sciences, University of California, Davis, Davis, CA, USA
Amelie Gaudin
Center for Marine Biotechnology and Biomedicine, Scripps Institution of Oceanography, University of California San Diego, La Jolla, CA, USA
Lena Gerwick
Department of Microbiology and Medical Zoology, School of Medicine, University of Puerto Rico, San Juan, PR, USA
Filipa Godoy-Vitorino
Lynker, Albuquerque, NM, USA
Peter Guerra
Department of Crops and Soil Sciences, University of Georgia, Griffin, GA, USA
Mussie Y. Habteselassie
Department of Microbiology & Immunology, University of British Columbia, Vancouver, British Columbia, Canada
Steven J. Hallam
Department of Chemistry and Biochemistry, Montana State University, Bozeman, MT, USA
Roland Hatzenpichler, Mackenzie Lynes & Nicholas J. Reichart
RD3 Marine Symbioses, GEOMAR Helmholtz Centre for Ocean Research Kiel, Kiel, Germany
Ute Hentschel
Systems Microbiology and Natural Products Laboratory, University of California, Davis, Davis, CA, USA
Matthias Hess
Department of Molecular, Cell & Developmental Biology, University of California, Los Angeles (UCLA), Los Angeles, CA, USA
Ann M. Hirsch
Department of Biology, University of Waterloo, Waterloo, Ontario, Canada
Laura A. Hug & Josh D. Neufeld
Department of Microbiology, University of Helsinki, Helsinki, Finland
Jenni Hultman
Marine Laboratory, Duke University, Beaufort, NC, USA
Dana E. Hunt
Department of Land Resources & Environmental Sciences, Montana State University, Bozeman, MT, USA
William P. Inskeep & David M. Ward
Earth and Biological Sciences Directorate, Pacific Northwest National Lab, Richland, WA, USA
Janet Jansson
Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA
Eric R. Johnston & Dale Pelletier
Division of Forestry and Natural Resources, West Virginia University, Morgantown, WV, USA
Charlene N. Kelly
Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC, USA
Robert M. Kelly
Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA
Jonathan L. Klassen
Department of Microbiology, University of Massachusetts Amherst, Amherst, MA, USA
Klaus Nüsslein
School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, USA
Joel E. Kostka
Department of Plant and Microbial Biology, University of California, Berkeley, CA, USA
Steven Lindow
USDA Forest Service, Northern Research Station, Houghton, MI, USA
Erik Lilleskov
Department of Biology, California State University Northridge, Northridge, CA, USA
Rachel Mackelprang
Université de Lorraine, INRAe, UMR 1136 Interactions Arbres/Microorganismes, INRAe-Grand Est-Nancy, Champenoux, France
Francis M. Martin
Department of Earth, Ocean and Atmospheric Science, Florida State University, Tallahassee, FL, USA
Olivia U. Mason
Great Lakes Institute for Environmental Research, University of Windsor, Windsor, Ontario, Canada
R. Michael McKay
Departments of Civil and Environmental Engineering and Bacteriology, University of Wisconsin, Madison, WI, USA
Katherine McMahon
Varigen Biosciences Corporation, Madison, WI, USA
David A. Mead
Biology Department, The Pennsylvania State University, University Park, PA, USA
Monica Medina
School of Natural Resources and the Environment, University of Arizona, Tucson, AZ, USA
Laura K. Meredith
BIO5 Institute, University of Arizona, Tucson, AZ, USA
Laura K. Meredith
School of Environmental Sciences, University of East Anglia, Norwich, UK
Thomas Mock
Department of Microbiology & Immunology, Life Sciences Institute, The University of British Columbia, Vancouver, British Columbia, Canada
William W. Mohn & Rachel Simister
Department of Marine Sciences, University of Georgia, Athens, GA, USA
Mary Ann Moran
Division of Earth and Ecosystem Science, Desert Research Institute, Reno, NV, USA
Alison Murray
Department of Civil & Environmental Engineering, University of Washington, Seattle, WA, USA
Rebecca Neumann & Nicholas B. Waldo
Department of Plants, Soils and Climate, Utah State University, Logan, UT, USA
Jeanette M. Norton
Unidad Irapuato, Centro de Investigación y de Estudios Avanzados del IPN (Cinvestav), Irapuato, Mexico
Laila P. Partida-Martinez
Plant and Environmental Sciences Department, New Mexico State University, Las Cruces, NM, USA
Nicole Pietrasiak
University of South Alabama, Mobile, AL, USA
Brandi Kiel Reese
New Mexico Institute of Mining and Technology, Socorro, NM, USA
Rebecca Reiss
Marine Chemistry and Geochemistry Department, Woods Hole Oceanographic Institution, Woods Hole, MA, USA
Mak A. Saito
Department of Agronomy and Horticulture and Center for Plant Science Innovation, University of Nebraska–Lincoln, Lincoln, NE, USA
Daniel P. Schachtman
Department of Microbiology and Molecular Genetics, Michigan State University, East Lansing, MI, USA
Ashley Shade
Life Sciences Institute, University of Michigan, Ann Arbor, MI, USA
David Sherman
Departments of Medicinal Chemistry, Chemistry, and Microbiology and Immunology, University of Michigan, Ann Arbor, MI, USA
David Sherman
Division of Environmental and Biomolecular Systems, Institute of Environmental Health, Oregon Health & Science University, Portland, OR, USA
Holly Simon
Animal Microbiome Analytics, Oakland, CA, USA
Holly Simon
Pacific Northwest National Laboratory, Richland, WA, USA
James Stegen
Bigelow Laboratory for Ocean Sciences, East Boothbay, ME, USA
Ramunas Stepanauskas
Departments of Microbiology and Civil, Environmental and Geodetic Engineering, Ohio State University, Columbus, OH, USA
Matthew Sullivan
Department of Earth and Planetary Sciences, University of California Davis, Davis, CA, USA
Dawn Y. Sumner
Max-Planck-Institute for Marine Microbiology, Bremen, Germany
Hanno Teeling
Department of Ecology and Evolutionary Biology, University of California Irvine, Irvine, CA, USA
Kathleen Treseder
Space Biosciences Division, NASA Ames Research Center, Moffett Field, CA, USA
Parag Vaishampayan
Department of Earth Science and Marine Science Institute, University of California, Santa Barbara, CA, USA
David L. Valentine
Geology, Minerals, Energy and Geophysics Science Center, Menlo Park, CA, USA
Mark P. Waldrop
Department of Biology, Concordia University, Montreal, Quebec, Canada
David A. Walsh
Department of Soil & Crop Sciences, Colorado State University, Fort Collins, CO, USA
Michael Wilkins
Department of Forest and Rangeland Stewardship, Colorado State University, Fort Collins, CO, USA
Jamie Woolet

Authors

Georgios A. Pavlopoulos
View author publications
You can also search for this author in PubMed Google Scholar
Fotis A. Baltoumas
View author publications
You can also search for this author in PubMed Google Scholar
Sirui Liu
View author publications
You can also search for this author in PubMed Google Scholar
Oguz Selvitopi
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Pedro Camargo
View author publications
You can also search for this author in PubMed Google Scholar
Stephen Nayfach
View author publications
You can also search for this author in PubMed Google Scholar
Ariful Azad
View author publications
You can also search for this author in PubMed Google Scholar
Simon Roux
View author publications
You can also search for this author in PubMed Google Scholar
Lee Call
View author publications
You can also search for this author in PubMed Google Scholar
Natalia N. Ivanova
View author publications
You can also search for this author in PubMed Google Scholar
I. Min Chen
View author publications
You can also search for this author in PubMed Google Scholar
David Paez-Espino
View author publications
You can also search for this author in PubMed Google Scholar
Evangelos Karatzas
View author publications
You can also search for this author in PubMed Google Scholar
Ioannis Iliopoulos
View author publications
You can also search for this author in PubMed Google Scholar
Konstantinos Konstantinidis
View author publications
You can also search for this author in PubMed Google Scholar
James M. Tiedje
View author publications
You can also search for this author in PubMed Google Scholar
Jennifer Pett-Ridge
View author publications
You can also search for this author in PubMed Google Scholar
David Baker
View author publications
You can also search for this author in PubMed Google Scholar
Axel Visel
View author publications
You can also search for this author in PubMed Google Scholar
Christos A. Ouzounis
View author publications
You can also search for this author in PubMed Google Scholar
Sergey Ovchinnikov
View author publications
You can also search for this author in PubMed Google Scholar
Aydin Buluç
View author publications
You can also search for this author in PubMed Google Scholar
Nikos C. Kyrpides
View author publications
You can also search for this author in PubMed Google Scholar

Consortia

Novel Metagenome Protein Families Consortium

Silvia G. Acinas
, Nathan Ahlgren
, Graeme Attwood
, Petr Baldrian
, Timothy Berry
, Jennifer M. Bhatnagar
, Devaki Bhaya
, Kay D. Bidle
, Jeffrey L. Blanchard
, Eric S. Boyd
, Jennifer L. Bowen
, Jeff Bowman
, Susan H. Brawley
, Eoin L. Brodie
, Andreas Brune
, Donald A. Bryant
, Alison Buchan
, Hinsby Cadillo-Quiroz
, Barbara J. Campbell
, Ricardo Cavicchioli
, Peter F. Chuckran
, Maureen Coleman
, Sean Crowe
, Daniel R. Colman
, Cameron R. Currie
, Jeff Dangl
, Nathalie Delherbe
, Vincent J. Denef
, Paul Dijkstra
, Daniel D. Distel
, Emiley Eloe-Fadrosh
, Kirsten Fisher
, Christopher Francis
, Aaron Garoutte
, Amelie Gaudin
, Lena Gerwick
, Filipa Godoy-Vitorino
, Peter Guerra
, Jiarong Guo
, Mussie Y. Habteselassie
, Steven J. Hallam
, Roland Hatzenpichler
, Ute Hentschel
, Matthias Hess
, Ann M. Hirsch
, Laura A. Hug
, Jenni Hultman
, Dana E. Hunt
, Marcel Huntemann
, William P. Inskeep
, Timothy Y. James
, Janet Jansson
, Eric R. Johnston
, Marina Kalyuzhnaya
, Charlene N. Kelly
, Robert M. Kelly
, Jonathan L. Klassen
, Klaus Nüsslein
, Joel E. Kostka
, Steven Lindow
, Erik Lilleskov
, Mackenzie Lynes
, Rachel Mackelprang
, Francis M. Martin
, Olivia U. Mason
, R. Michael McKay
, Katherine McMahon
, David A. Mead
, Monica Medina
, Laura K. Meredith
, Thomas Mock
, William W. Mohn
, Mary Ann Moran
, Alison Murray
, Josh D. Neufeld
, Rebecca Neumann
, Jeanette M. Norton
, Laila P. Partida-Martinez
, Nicole Pietrasiak
, Dale Pelletier
, T. B. K. Reddy
, Brandi Kiel Reese
, Nicholas J. Reichart
, Rebecca Reiss
, Mak A. Saito
, Daniel P. Schachtman
, Rekha Seshadri
, Ashley Shade
, David Sherman
, Rachel Simister
, Holly Simon
, James Stegen
, Ramunas Stepanauskas
, Matthew Sullivan
, Dawn Y. Sumner
, Hanno Teeling
, Kimberlee Thamatrakoln
, Kathleen Treseder
, Susannah Tringe
, Parag Vaishampayan
, David L. Valentine
, Nicholas B. Waldo
, Mark P. Waldrop
, David A. Walsh
, David M. Ward
, Michael Wilkins
, Thea Whitman
, Jamie Woolet
& Tanja Woyke

Contributions

N.C.K. conceived the project and performed part of the analysis. G.A.P. performed data collection as well as the analysis. F.A.B. assisted in data analysis and implemented the NMPFamsDB database. I.I., N.N.I., A.V. and C.A.O. provided feedback about the strategy and edited parts of the manuscript. E.K. helped with scripting, data analysis and the implementation of the database. A.A., O.S. and A.B. implemented the HipMCL algorithm and ran the clustering. S.O., D.B. and S.L. performed the protein structure predictions and analysis. D.P.-E., A.P.C., S.N. and L.C. helped with the taxonomic classification of scaffolds. J.M.T. and the Metagenome Protein Families Consortium provided data. G.A.P., S.O. and N.C.K. wrote the paper, with feedback from all of the authors. All of the authors read and approved the manuscript.

Corresponding authors

Correspondence to Georgios A. Pavlopoulos or Nikos C. Kyrpides.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature thanks Ami Bhatt, Alexander Probst and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Distribution of NMPF clusters across the eight ecosystem types.

(a) Circos Plot. The distribution of the ecosystems is presented in a chord-like circular diagram. The rim of the diagram represents the total size of the ecosystem types (i.e. number of NMPFs in each ecosystem), with the numbers outside the rim indicating the size scale. The intersections of categories are represented by arcs drawn between them. The size of the arc is proportional to the importance of the flow. (b) 8×8 matrix. Each cell in the matrix presents the common NMPFs in a binary combination of two ecosystems (e.g. 17,442 NMPFs are common among Marine and Freshwater ecosystems). The diagonal of the matrix displays the ecosystem-specific NMPFs. Each ecosystem column is coloured using the same colour code as Fig. 2, with the brightness of each cell being proportional to the NMPF number (brighter colour = less NMPFs).

Extended Data Fig. 2 Distribution of NMPF clusters across the sub-categories of the Freshwater (top) and Marine (bottom) aquatic ecosystems.

Data are shown as circos plots (a,d), colour-coded matrices (b,e) and UpSet plots (c,f).

Extended Data Fig. 3 Distribution of NMPF clusters across the sub-categories of the Soil (top) and Plant (bottom) ecosystems.

Data are shown as circos plots (a,d), colour-coded matrices (b,e) and UpSet plots (c,f).

Extended Data Fig. 4 Distribution of NMPF clusters across the sub-categories of the Non-human mammal (top) and Other Host-associated (bottom) ecosystems.

Data are shown as circos plots (a,d), colour-coded matrices (b,e) and UpSet plots (c,f).

Extended Data Fig. 5 Distribution of NMPF clusters across the sub-categories of the Human tissue (top) and Engineered (bottom) ecosystems.

Data are shown as circos plots (a,d), colour-coded matrices (b,e) and UpSet plots (c,f).

Extended Data Fig. 6 Distribution of NMPF clusters across different taxa (bacteria, archaea, eukarya, viruses, and unclassified).

(a) Venn Diagram, displaying the intersections among the different taxonomy categories. (b) Network representation of the protein clusters and their taxonomic assignments. The taxa are represented by central, coloured nodes (hubs) whereas the grey peripheral nodes represent the protein clusters.

Extended Data Fig. 7 Geographical distribution of the ED samples and NMPFs.

(a) Locations for all ED samples in the study with available geo-location metadata (Longitude and Latitude). (b-f) Distribution of geographically-isolated NMPF clusters, based on a cut-off distance of 1, 10, 100, 500, and 1000 Km. In all cases, dots are coloured based on the ecosystem type (blue: marine, cyan: freshwater, brown: soil, purple: other environmental, green: plants, red: human, magenta: non-human mammals, salmon pink: other host-associated, grey: engineered). (g) UpSet plot showing the distribution of the geographically-isolated NMPF clusters, based on a cut-off distance of 1000 Km (as shown in panel f). Map panels were created using data from the Natural Earth dataset (www.naturalearthdata.com).

Extended Data Fig. 8 Functional annotation of NMPFs with remote structural homologues.

Five example NMPFs (a-e) are shown. Annotation is performed using using structural information (left), gene co-occurrence analysis (middle), and ecosystem distribution (right). Each of the NMPFs has a high-quality 3D model with at least one remote structural homologue to SCOPe. The NMPFs’ 3D models, produced with AlphaFold, and the structures of the SCOPe domains are rendered in the same orientation and coloured based on their per-residue structure confidence (pLDDT for AlphaFold models and inverse B-factor for experimental structures). The gene neighbourhood of each NMPF is presented in the form of an association network; with nodes representing gene products (the NMPFs and their adjacent genes that encode Pfam domains) and edges representing co-occurrence in the same sequencing scaffold. Pfam domains are further grouped using their associated COG functional categories as annotation. Finally, the NMPFs’ associated ecosystems are presented in pie charts. Ecosystems with a <1% presence in the NMPFs are summed into the category “Other ecosystems”.

Extended Data Fig. 9 Putative functional annotation of NMPFs with potential novel structural folds.

Three example NMPFs (a-c) are shown. The produced AlphaFold 3D model (left), gene co-occurrence analysis (middle) and ecosystem distribution (right) are given. 3D models are coloured based on their per-residue structure confidence (pLDDT). The gene neighbourhood of each NMPF is presented in the form of an association network; with nodes representing gene products (the NMPFs and their adjacent genes that encode Pfam domains) and edges representing co-occurrence in the same sequencing scaffold. Pfam domains are further grouped using their associated COG functional categories as annotation. Finally, the NMPFs’ associated ecosystems are presented in pie charts. Ecosystems with a <1% presence in the NMPFs are summed into the category “Other ecosystems”.

Supplementary information

Supplementary Information

Supplementary Methods, Supplementary Results, Supplementary Figs. 1–13 and Supplementary Tables 1–13.

Reporting Summary

Supplementary Data 1

Top co-occurring Pfam domains for NMPF clusters.

Supplementary Data 2

NMPF structural models and structural/functional assignments, based on searches with TMalign, MMalign and HHsearch against SCOPe and PDB.

Supplementary Data 3

Potential novel structural folds. Sheet 1 displays the list of novel folds, including their statistics (number of effective sequences, TrRosetta and AlphaFold scores) and TMalign/MMalign comparison results to SCOPe and PDB. Sheet 2 shows the top five co-occurring Pfam domains of the NMPFs associated with these folds.

Supplementary Data 4

Contribution of NMPFs in the annotation of metagenomes and example recruitment for future updates. Sheet 1 presents a full summary of all the IMG/M datasets used in this study. Each dataset’s genomic content is presented, including the numbers of previously annotated protein-coding genes (hits to Pfam or to isolate genomes); genes clustered to NMPFs in the present study and remaining unannotated genes. Sheet 2 contains a breakdown of the MCL clustering, displaying which genes were kept in NMPFs and which were removed. Sheets 3 and 4 present the same results, grouped for each ecosystem type. Sheets 5 and 6 present the results of an example NMPF enrichment by recruiting sequences from 360 metagenomic/metatranscriptomic samples per NMPF (sheet 5) and dataset (sheet 6).

Supplementary Data 5

Ecosystem classification in the GOLD, EMPO and ENVO classification schemes. Sheet 1 displays the ecosystem assignments of each ED sample used in the analysis. Sheet 2 displays the top ecosystem assignment of each NMPF cluster in the three schemes. In cases in which an ED sample or NMPF has no assignment to ENVO or EMPO, it was marked as unclassified.

Supplementary Data 6

Taxonomic annotation of NMPF clusters.

Supplementary Data 7

Quality assessment of NMPF clusters.

Supplementary Data 8

IMG datasets

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Pavlopoulos, G.A., Baltoumas, F.A., Liu, S. et al. Unraveling the functional dark matter through global metagenomics. Nature 622, 594–602 (2023). https://doi.org/10.1038/s41586-023-06583-7

Download citation

Received: 18 March 2022
Accepted: 30 August 2023
Published: 11 October 2023
Issue Date: 19 October 2023
DOI: https://doi.org/10.1038/s41586-023-06583-7

This article is cited by

The journey to understand previously unknown microbial genes
- Jakob Wirbel
- Ami S. Bhatt
- Alexander J. Probst
Nature (2024)
Machine learning sheds light on microbial dark proteins
- Aeron Tynes Hammack
- Crysten E. Blaby-Haas
Nature Reviews Microbiology (2024)
Unveiling the expanding protein universe of life
- Hajk-Georg Drost
Nature Reviews Genetics (2024)
Revealing viral diversity in the Napahai plateau wetland based on metagenomics
- Lingling Xiong
- Yanmei Li
- Xiuling Ji
Antonie van Leeuwenhoek (2024)
Indoles and the advances in their biotechnological production for industrial applications
- Lenny Ferrer
- Melanie Mindt
- Katarina Cankar
Systems Microbiology and Biomanufacturing (2024)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Main

The novel protein sequence space

Novel protein families

Biome distribution

Taxonomic distribution

Metadata distribution

Structural distribution

Discussion

Methods

Data collection and filtering

Sequence clustering and analysis

Verification of protein family novelty

Ecosystem and taxonomic annotation of protein clusters

Sequence quality control

Protein cluster co-occurrence with Pfams

Protein fold prediction

Multiple-sequence alignment

TrRosetta for initial screening

AlphaFold for final prediction

Searching for structural homologues

Clustering

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Consortia

Novel Metagenome Protein Families Consortium

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data figures and tables

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links