Structure-guided metagenome mining to tap microbial functional diversity

Scientists now have access to millions of accurate three-dimensional (3D) models of protein structures. How do we leverage 3D structural models to learn about microbial functions encoded in metagenomes? Here, we review recent developments using protein structural features to mine metagenomes from diverse environments ranging from the human gut to soil and ocean viromes. We compare 3D protein structural methods to characterize antibiotic resistance phenotypes, nutrient cycling


Introduction: the 'microbial metaverse'
The term 'metaverse' -a network of three-dimensional (3D) realms -has become a controversial social media buzzword.Controversies aside, interactive exploration of virtual protein realms in 3D is making news in the microbiome field as well.Landmark advances such as AlphaFold2 (AF2) achieved an accuracy score on par with experimental approaches [1] redefining what is possible for 3D protein modeling at the scale of microbiomes.AF2 was critically enabled by decades of experimental work, the availability of large-scale sequencing databases, and advances in bioinformatics, graphics processing unit (GPU) technology, and deeplearning algorithms.Collectively, these factors played a role in bringing us to the current state of the art in protein structure prediction.AF2 models must also be interpreted with caution and have key limitations, such as the lack of cofactors, metal ions, and ligands, which tools such as AlphaFill [2] can attempt to supplement if homologous structures are available in the Protein Data Bank (PDB).Keeping the limitations of modeling in mind, we can now explore accurate 3D protein structural models for > 200 million proteins in the AlphaFold database [3] and an expanded set of > 600 million protein models from metagenomes through the ESM Metagenomic Atlas [4].Scrolling seamlessly from the human body to rhizospheres to hot springs, we can navigate through hundreds of 3D-modeled metagenomic protein worlds instantaneously (Figure 1).
What does this mean for microbiologists?Specifically, how can 3D structural models provide insights into protein functions encoded in microbiomes?Until recently, querying metagenomes was largely limited to sequence-based approaches.In this review, we focus on how 3D protein modeling can enhance the detection of functional biochemical parts and pathways from microbiomes in a process known as 'metagenome mining' [5,6].This review is organized across scales (Figure 2) starting with global 3D structural investigations of microbiomes with a case study on viromes.We then zoom in to examine features, including distance to ligand (DTL), active site residues, and structural fragments for their potential to mine metagenomic functional diversity.

Gut feelings: the human microbiome in three domensions
A decade before AF2, Turnbaugh et al. wrote of "viewing the human microbiome through three-dimensional glasses [7]" by urging for the integration of structural and metagenomic datasets for carbohydrateactive enzymes (CAZymes) [7].AF2 has now enabled broader-scale structural comparisons across gut microbial CAZymes [8] and other biochemical parts and pathways [9].For example, Rho and colleagues mined the structural diversity of bacterial sphingomyelinases as virulence factors and host immune system modulators [8].
The authors pinpointed key residues in a beta-hairpin apex loop that directly modulate substrate affinity and coupled structural alignments with new measurements of the hemolytic activity of metagenomic sphingomyelinases [8].Similar structural metagenomic pipelines [10] proved useful to analyze gut microbial enzymes, including bile salt hydrolases [11], sulfatases [12], and β-glucoronidases [13].These enzymes play crucial roles in catalyzing the reactivation of drugs and endobiotics [12].Importantly, the application of 3D structural approaches guided the design of specific inhibitors targeting microbial β-glucuronidases with the aim to reduce the toxicity of the cancer drug irinotecan [14].This example demonstrated how 3D structures can directly advance the development of new gut microbiome-targeted therapeutics, showcasing their practical importance in the field.
Another pioneering study before AF2 targeted the gut microbiome to catalog antibiotic resistance determinants (ARDs) using structural homology modeling and pairwise structural alignments [15].In this work, Ruppé et al. mined > 6000 ARDs from 3.9 million gut microbial proteins.Structure-informed algorithms were able to accurately differentiate between 'true' and 'false' ARDs, even though some ARDs in anaerobic gut bacteria are only distantly related in sequence to reference ARDs.Ruppé et al. validated their computational predictions for gut ARDs with new experiments and external datasets, including soil metagenomes.Remarkably, 3D-informed models showed 'read across' efficacy from the gut to the soil for the prediction of ARDs.Other work has highlighted the value of microbiome 'read across' and using niche-specific protein families from different biomes (human gut, lakes, fermenters, and soil), for improved structural and functional prediction [16].

Going viral: structure-based queries of viromes
Metagenomics has also revealed stunning viral diversity, from the soil [17] to the sea [18].Sequence-homology-based approaches for viruses face limitations since they 1) lack many conserved marker genes for universal comparisons and 2) have faster mutation rates than other microbes [19].Viral sequences thus rapidly diverge and fall into the 'twilight The microbial 'metaverse': a view of one million clustered 3D protein models from sequences of metagenomic origin made available through the ESM Metagenomic Atlas [4].Brighter colors correspond to proteins that are 'unknown', whereas darker colors correspond to more 'known' proteins based on the nearest-neighbor score to proteins in the PDB and UniRef90.Proteins are clustered by structural similarity, as described by Lin et al. [4].Data available under a CC BY 4.0 license for academic and commercial use.Copyright (c) Meta Platforms, Inc.All Rights Reserved.zone' sharing 30% amino acid identity or lower with characterized reference sequences [20].Since protein structures are three to ten times more conserved than nucleotide sequences [21], structural approaches may offer a sensitive solution for viral protein function prediction.
Structural modeling has enabled the identification and annotation of entire new clades of viruses and their encoded protein families [22][23][24].For example, Delmont and coworkers recently combined genome-resolved metagenomic analysis with AF2 to uncover a new candidate phylum of DNA viruses: the 'Mirusviricota' [24].Structural searches were critical to detect the 'Mirusviricota' major capsid fold proteins that were difficult to find using sequence-based approaches due to their low sequence identity.Notably, AF2 revealed 'Mirusviricota' capsid models with an intermediate 'tower' height between the heights of bacteriophages and human-infecting virus major capsid proteins (Figure 3), hinting at a potential evolutionary trajectory for eukaryotic DNA viruses.
Many viral genomes also contain auxiliary metabolic genes (AMGs) [25], which can augment host metabolic abilities, for example, for nutrient acquisition [26].Guided by structural models of AMGs from soil viromes, Jansson and coworkers discovered and expressed active virally encoded chitosinases bearing an unusual domain of unknown function [26].This domain is now a useful structural marker to annotate viral AMGs involved in chitosan degradation for host nutrient acquisition.

A deeper dive: distance to ligand and catalytic residues
The case studies described above primarily relied on pairwise or multiple alignments of full-length protein structures.While highly effective, this type of global analysis may overwhelm the signal from local, (sub-) structural features.In this next section, we highlight alternative descriptors such as DTL, relative solvent accessibility, and catalytic residues for the detection of finer-grained biological signals.
Kiefl et al. led a pioneering study on 'structure-informed population genomics' [27].The authors analyzed how the distribution of singlecodon variants in the abundant SAR11 1a.3.V. subclade of marine bacteria from Tara oceans metagenomes varied with respect to DTL and relative solvent accessibility.In ocean sampling sites with low nitrate concentrations, the authors found that there were fewer nonsynonymous mutations in metagenomic reads mapping to low-DTL regions of glutamine synthase (GS) structures (Figure 4).Since GS is critical for nitrogen assimilation, Kiefl et al. suggested this supports the 'use it or lose it' evolutionary hypothesis since GS is not able to tolerate as many deleterious mutations near the ligandbinding site under low nitrate concentrations.Thanks to advances in 3D modeling, this study breaks new ground in using structural features to understand the evolution of nutrient cycling enzymes in a major marine bacterial clade at the population level [27].
Catalytic residues can act as functional anchors to guide metagenome mining studies.Interestingly, distant proteins with essentially no sequence homology may share the same catalytic residues [28].Structural comparisons can thus reveal convergent evolution for active site configurations among unrelated protein families [29].This approach can also be reversed to identify novel protein functions based on unusual catalytic residues.Surface models of major capsid proteins highlight evolutionary relationships inferred based on 'tower' heights [24], as shown for the human-infecting cytomegalovirus (Herpesvirales) in green, marine plankton-infecting virus ('Mirusviricota') in purple, and Escherichiainfecting phage HK97 (Caudiviricetes) major capsid proteins in blue.For example, an activity-based screen of a soil-derived metagenomic library revealed a highly active and acidstable endoglucanase with an atypical catalytic triad [30,31].Future studies that target unusual catalytic residue configurations in metagenomes -'diversityguided' approaches -have the potential to yield many inactive enzymes but also reveal new protein functions.
Catalytic residue alignments aided Eiamthong et al. to mine putative polyethylene terephthalate hydrolases (PETases) from marine and human saliva metagenomes [32].Electrostatic potential mapping was further used to group proteins with patches of acidic, basic, or neutral residues near the active site in order to select, express, and characterize new metagenomic PETases.The authors then engineered the most active PETase to covalently bind PET by installing a nonproteinogenic 2,3diaminopropionic acid residue in the substrate-binding pocket.This work combined active site conservation with innovation through expansion of the genetic code to yield a plastic-binding enzyme with extended biotechnological applications.PETase active site features identified here could also be applied in the future for the improved prediction of plastic degradation potential from metagenomes.For example, the incorporation of structural features would deepen the analysis of an existing sequence-based study reporting a correlation between pollution levels and plastic degradation enzyme abundances in environmental microbiomes [33].

Beyond static structures: dynamic descriptors
Until this point, this review has focused on static structural features.However, proteins are dynamic entities typically occupying a range of different states.Taking conformational dynamics into account can advance predictions of enzyme function and substrate specificity [34].For example, an activity-guided screen of soil bacteria unexpectedly revealed defluorination activity in some haloacid dehalogenase family proteins [35].Sequence-based analysis and comparison of static 3D structural features failed to yield clear distinctions between enzymes with and without defluorination activity [36].Through molecular dynamics (MD) simulations, Chan and coworkers found that the active sites of defluorinating enzymes sampled more closed configurations than those without defluorination activity.Given the smaller atomic radius of fluorine relative to other halogens, these results suggest conformational dynamics are particularly useful in cases such as this, where substrates selectively bind and proteins undergo conformational changes [36].
Protein-small-molecule docking and quantum mechanical (QM) calculations can also yield useful descriptors as recently demonstrated in a study using machine learning to predict the substrate specificity of nitrilases, primarily from rhizosphere-associated bacteria [37].Parks and colleagues used a molecular docking approach taking into account the flexibility of the ligand and protein side chains.The authors calculated a suite of other descriptors, including atomic partial charges, molecular dipole moments, and HOMO-LUMO gaps.Interestingly, the attractive portion of the Lennard-Jones potential calculated based on protein-ligand docking was the most important feature for model performance.Most static QM descriptors did not significantly improve prediction accuracy, at least for this study.Overall, including QM descriptors -with caution-decreases the ability to extrapolate to new systems, but may improve functional predictions [38].
Both of the above studies were limited to tens of proteins.At the moment, it is not reasonable to run long MD simulations for metagenomic libraries consisting of thousands of large biomolecules.However, methods to speed up or estimate protein dynamics are rapidly developing [39].For example, machine learning-based approaches, for example, to estimate molecular dynamic fingerprints [40], are currently applied to small-molecule libraries but offer promise for future metagenome mining efforts.

Conclusions: the final [functional] frontier
This review has covered structure-guided metagenome mining for proteins inhabiting known areas of sequence space.We have so far neglected all the 'unknowns,' including proteins that are not detected or lack functional annotation.Pipelines, such as AGNOSTOS [41], are useful to group homologous unknown genes into clusters and integrate contextual, for example, ecological and phylogenetic data, to infer functions.AGNOSTOS and other workflows are primarily sequence-based but structural integration is advancing rapidly.Structural features are already incorporated into metagenomic workflows such as anvi'o structure [27] which will augment the analysis of unknowns from microbiomes.
Beyond sequence-based features, we encourage microbiologists to consider alternative representations of proteins using language models or 'structural alphabets'.Steinegger and coworkers recently pioneered a new type of structural alphabet capturing 3D tertiary interactions between residues.The implementation of this alphabet into Foldseek [42,43] and Foldseek cluster yielded structural comparison algorithms with massive gains in speed while maintaining the accuracy of state-of-the-art methods.Durairaj et al. developed a different approach parsing protein structures as geometric shapes broken down into smaller, rotation-invariant structural units, known as 'shape-mers' [44].Shape-mer comparisons enabled identification of structural outliers in the AlphaFold database through comparison with the entire PDB [45].This study identified proteins with rare and novel features or entire new folds -such as the β-flower fold -as targets for future experiments [45], thus focusing the vast search space.
Finally, this review has focused on 'classical' proteincoding sequences.Yet, much of functional sequence space also falls outside of this narrow definition, including nucleic acid enzymes (i.e.ribozymes) [46], intrinsically disordered [47], and small proteins [48].Many proteins adopt multiple 3D states and additionally can form amyloids or biomolecular condensates that 3D modeling approaches such as AF2 do not (yet) adequately predict [49,50].The 3D distribution and assembly of different molecular players within microbiomes is not well-understood, nor is this process fully predictable to date.We have not yet arrived at holistic, dynamic, and immersive integration of data across biological scales from molecules to cells to hosts and their environments.Like its virtual counterpart, the 'microbial metaverse' still awaits full realization.

Figure 1 Current
Figure 1

Figure 2 Current
Figure 2

Figure 3 Current
Figure 3

Figure 4 Current
Figure 4