VB12Path for Accurate Metagenomic Profiling of Microbially Driven Cobalamin Synthesis Pathways

ABSTRACT Cobalamin (vitamin B12; VB12) is an indispensable nutrient for all living entities in the Earth’s biosphere and plays a vital role in both natural and host environments. Currently in the metagenomic era, gene families of interest are extracted and analyzed based on functional profiles by searching shotgun metagenomes against public databases. However, critical issues exist in applying public databases for specific processes such as VB12 biosynthesis pathways. We developed a curated functional gene database termed VB12Path for accurate metagenomic profiling of VB12 biosynthesis gene families of microbial communities in complex environments. VB12Path contains a total of 60 VB12 synthesis gene families, 287,731 sequences, and 21,154 homology groups, and it aims to provide accurate functional and taxonomic profiles of VB12 synthesis pathways for shotgun metagenomes and minimize false-positive assignments. VB12Path was applied to characterize cobalamin biosynthesis gene families in human intestines and marine environments. The results demonstrated that ocean and human intestine had dramatically different VB12 synthesis processes and that gene families belonging to salvage and remodeling pathway dominated human intestine but were lowest in the ocean ecosystem. VB12Path is expected to be a useful tool to study cobalamin biosynthesis processes via shotgun metagenome sequencing in both environmental and human microbiome research. IMPORTANCE Vitamin B12 (VB12) is an indispensable nutrient for all living entities in the world but can only be synthesized by a small subset of prokaryotes. Therefore, this small subset of prokaryotes controls ecosystem stability and host health to some extent. However, critical accuracy and comprehensiveness issues exist in applying public databases to profile VB12 synthetic gene families and taxonomic groups in complex metagenomes. In this study, we developed a curated functional gene database termed VB12Path for accurate metagenomic profiling of VB12 communities in complex environments. VB12Path is expected to serve as a valuable tool to uncover the hidden microbial communities producing this precious nutrient on Earth.

synthesized by a small subset of prokaryotic groups (6), while other microorganisms need exogenous supplies. Exogeneous cobalamin can be obtained mainly through two approaches: direct interactions with producers and decomposition of cells containing cobalamin (7). The production, transformation, and circulation of cobalamin play a fundamental role in structuring microbial community diversity and activity as well as strengthening interactions between microorganisms. For example, while phytoplankton accounts for about half of the global primary production, most of the eukaryotic phytoplankton in the surface ocean are VB 12 auxotrophs (8), and several studies have confirmed that the availability of cobalamin can limit the growth and activity of phytoplankton and thus primary productivity (7). Cobalamin also has potential major effects on biogeochemical cycling driven by microorganisms and a wide variety of general microbial processes involved in gene expression regulation (9), CO 2 fixation, DNA replication and repair, and synthesis of amino acids (10). Recent research on Nitrospira spp. has shown that there are close links between cobalamin production and aerobic nitrogen cycle (11). Also, in human beings, cobalamin functions as the "anti-pernicious anemia factor" and holds a certain significance of resisting the development of pernicious anemia in animals, shaping the composition of human gut microbial communities (4). Over 83% of sequenced human gut microbial strains possess enzymes that are dependent on VB 12 (3). The human gut microbiome is also expected to synthesize different forms of cobalamin as the essential micronutrients (1). In addition, vitamin B 12 deficiency could cause humans to produce methylmalonic aciduria and homocysteinemia that eventually lead to devastating diseases and affect the human nervous system (5). Therefore, revealing the taxonomic and functional composition of microbial communities involved in cobalamin biosynthesis and transformation in various ecosystems is of crucial importance to understand the cobalamin-based microbial interdependencies that sustain the ecosystem's multifunctioning, stability, and human/host health.
Over the past years, efforts and progress have been made to expand our understanding and knowledge of the microbial processes that are responsible for cobalamin biosynthesis and transformation. For example, advances in methodology allow direct measurements of cobalamin in various environmental samples (12). The employment of analytical chemistry techniques helps identify genes involved in cobalamin circulating processes (13). It is known that cobalamin is synthesized exclusively by a subset of bacteria and archaea (10) that requires more than 30 enzymatic steps involving a total of 60 gene families, including the BtuBFCD transporter (3) and ATP-binding cassette (ABC) transport system. The synthesis of cobalamin can be divided into two categories, namely, de novo pathway and salvage and remodeling pathway. The de novo pathway can be further divided into aerobic pathway and anerobic pathway according to the timing of cobalt insertion and the oxygen requirements (11). Most reactions of aerobic and anaerobic pathways share similar metabolic steps, making it difficult to distinguish corresponding gene families due to their high homology. For instance, most methyltransferases involved in peripheral methylation reactions have been confirmed with high sequence similarity (14). Genes belonging to cysG and cobA are also highly homologous due to similar functional domains (15). Traditionally, corresponding gene families belonging to aerobic and anaerobic pathways were respectively named using different prefixes as cob and cbi from different organisms (16). Notably, the corresponding cbi and cob gene families involved in aerobic and anaerobic pathways, respectively, are usually mixed together as the same orthologous group in public orthology databases.
Recent advances in high-throughput metagenome sequencing have made community-level functional and taxonomic profiling of complex microbial communities a routine application in microbial ecology studies (17). Microbial species encoding proteins of VB 12 biosynthesis pathways and their relationships with other organisms in complex natural ecosystems and host environments can also be studied in an elegant manner, such as via the combination of metagenomic and chemical characterization of cobalamin production in soils (11). To do so, metagenome sequences are searched against public ortholog databases, e.g., KEGG (Kyoto Encyclopedia of Genes and Genome) (18), eggNOG (evolutionary genealogy of genes: Nonsupervised Orthologous Groups) (19), and COG (Clusters of Orthologous Groups of proteins) (20), based on which gene families of interest could be extracted and subjected to further detailed analyses. However, as we reported previously (21), there are many challenges for accurate profiling of functional gene families with public ortholog databases. First, orthologous groups in most orthology databases are in fact sequence clusters based solely on sequence identities and are therefore usually mixtures of multiple different gene families. Second, our most recent knowledge and understanding of these specific gene families and pathways are usually not reflected in these public ortholog databases. Third, searching against large-scale comprehensive databases is computationally expensive and timeconsuming, while searching against customized databases containing extracted gene families can easily lead to high false-positive assignments due to small database issues. Taking VB 12 synthesis gene families for example, gene families belonging to aerobic and anaerobic processes are usually mixed together in the same orthologous group and thus can hardly be separated in database searching. These issues generally lead to inaccurate and less-useful profiling of functional gene families at the cost of extensive computational resources and time.
In this study, we aimed to develop a curated functional gene database as well as tool sets (VB 12 Path) for metagenomic profiling of gene families involved in VB 12 biosynthesis pathways. VB 12 Path held multiple advantages compared to public ortholog databases, such as accurate definition of gene families and sequences, minimum false-positive assignment by comprehensively including homologous gene families, clear separation of homologous gene families in aerobic and anaerobic processes, and both functional and taxonomic profiling tool sets for VB 12 gene families. VB 12 Path was applied to profile both functional gene families and taxonomic compositions of VB 12 biosynthesis pathways in human gut and ocean environments. The results demonstrated that VB 12 Path is a useful tool for analyzing VB 12 biosynthesis pathways of microbial communities from different environments via metagenome sequencing.

RESULTS
Summary of gene families involved in VB 12 synthesis pathways. For VB 12 Path, a total of 60 key gene families involved in VB 12 biosynthesis were selected, targeting metabolic processes of VB 12 biosynthesis pathways in different microorganisms, including de novo biosynthesis pathways (aerobic and anaerobic) and salvage and remodeling pathways (Table S1). Since multiple gene families were shared by these different pathways, we further divided the complex VB 12 pathways into five different processes, including precorrin-2 synthesis, aerobic, anaerobic, salvage and remodeling, and post-AdoCbi-P processes (Fig. S1). A total of 4,079, 168,630, and 287,731 sequences were collected for the seed database, the core database, and the full database, respectively. In addition to targeted gene families, 21,154 homologous groups were also identified and included in VB 12 Path, with the purpose to minimize false-positive assignments. Among these homologous groups, 650 were from arCOG database, 770 from COG database, 17,546 from eggNOG database, and 2,188 from KEGG database (Table 1). To illustrate the extent that "small database" may affect metagenomic profiling, five randomly selected TARA Oceans metagenomes were profiled against the core and full database of VB 12 Path. The results suggested that about 52% of sequences targeted by VB 12 gene families in the core database could be better assigned to their homologs in the full database (Fig. S2), suggesting that small databases not considering their homologs could be a critical issue for functional assignment in metagenomics.
Precorrin-2 synthesis processes. Eight gene families are known to be responsible for d -aminolevulinate (ALA) formation and precorrin-2 synthesis from ALA. These gene families are shared by both aerobic and anaerobic biosynthesis pathways. Three gene families, including hemA, hemL, and gltX, are responsible for ALA synthesis via C4 and C5 pathways. Conversion of ALA to uroporphyrinogen-III is carried out by enzymes encoded by hemC, hemB, and hemD. Precorrin-2 is then synthesized via methylation of uroporphyrinogen-III at positions C2 and C7. This process is catalyzed by enzymes  NA, not available. VB 12 PATH for Metagenomic Profiling encoded by cobA and cysG. A total of 154,108 sequences and 8,652 homologous groups were present for these eight gene families in VB 12 Path. Aerobic pathway. Gene families encoding enzymes that convert precorrin-2 to adenosylcobinamide phosphate were selected for aerobic biosynthesis pathway. A total of 18 gene families were recruited, including cobI, cobG, cobJ, cobM, cobF, cobK, cobL, cobH, cobB, cobN, cobS-co, cobT-co, cobR, cobO, cobQ, cobC-beta, cobD, and cobP. Among them, conversion of hydrogenobyrinic acid a,c-diamide to cob(II)yrinic acid a,cdiamide through cobalt chelation is carried out by enzymes encoded by cobN, cobS, and cobT. The aerobic and anaerobic pathways converge after coby(II)rinic acid a,c-diamide. Adenosylcobyric acid is then formed via adenylation and amidation of cob(I) yrinic acid a,c-diamide. This process is catalyzed by enzymes encoded by cobR, cobO, and cobQ, and (R)-1-amino-2-propanol is directly attached at the f position of the carboxyl group of adenosylcobyric acid and converted to adenosylcobinamide phosphate (AdoCbi-P). This process is catalyzed by proteins a and b encoded by cobC-beta and cobD. A total of 52,038 sequences and 2,910 homologous groups were present for these 18 gene families in VB 12 Path.
Salvage and remodeling pathway. The salvage pathway is a cost-effective way for bacteria and archaea to obtain cobalamin without consuming much energy. Gene families encoding enzymes that synthesize adenosylcobinamide phosphate (AdoCbl-P) from cobinamide were screened for salvage and remodeling pathway. A total of 13 gene families were recruited, including btuB, btuF, btuC, btuD, cobA, cobO, eutT, pduO, cobP, cobU-ade, cobY, cbiB, and cbiZ. Among these, the BtuBFCD transporter system in bacteria has received much attention from biologists. Cobinamide is transported into the cell via BtuBFCD transporter systems, which are encoded by btuB, btuF, btuC, and btuD. Cobinamide is adenosylated by ATP:co(I)rrinoid adenosyltransferases and converted to adenosylcobinamide (AdoCbi). This process is catalyzed by enzymes encoded by cobA, eutT, pduO, and cobO, and then AdoCbi is converted to AdoCbi-P by enzymes encoded by cobU-ade, cobP, cbiB, cbiZ, and cobY. A total of 30,175 sequences and 1,691 homologous groups were present for these 13 gene families in VB 12 Path.
Post-AdoCbi-P pathway. Twelve gene families were responsible for adenosylcobalamin (AdoCbl) synthesis from AdoCbi-P. These gene families were shared by both de novo and salvage and remodeling biosynthesis pathways. Conversion of AdoCbi-P to adenosylcobalamin 59-phosphate (AdoCbl 59-P) is carried out by enzymes encoded by cobU-ade, cobP, and cobS-gdp. Another alternative step about the synthesis of AdoCbl 59-P is through 5,6-dimethylbenzimidazole (DMB). Three gene families, including bluB, fre, and ubiB, are responsible for DMB synthesis. Then, DMB is converted to AdoCbl 59-P by enzymes encoded by cobU-alpha, cobT-alpha, and cobV. Finally, AdoCbl is synthesized via dephosphorylation of AdoCbl 59-P. This process is catalyzed by enzymes encoded by cobC-ado. A total of 53,191 sequences and 7,303 homologous groups were present for these 12 gene families in VB 12 Path.
Comparative analysis of VB 12 gene families in VB 12 Path and public orthology databases. VB 12 Path was compared with other major public orthology databases in terms of coverage, accuracy, and specificity. Mapping information between VB 12 gene families and orthologous groups was obtained during the full database construction process. First, the coverage of VB 12 gene (sub)families in VB 12 Path and other databases was compared to evaluate the integrity of VB 12 biosynthesis pathways in different databases. The result showed that a total of 60 gene (sub)families were included in VB 12 Path, while the number of orthologous groups involved in VB 12 biosynthesis found in arCOG, COG, eggNOG, and KEGG orthology databases was 27, 34, 39, and 46, respectively (Fig. 1A). Gene families such as pduO, pduS, and ubiB that were also of great importance were missing in these four orthology databases. Second, the accuracy of VB 12 gene families in public orthology databases was analyzed (Fig. 1B). Surprisingly, only a few gene families, including cobQ, hemA, hemB, hemC, and hemL, were represented with high accuracy in these public databases, i.e., the corresponding orthologous groups were composed by sequences belonging to these gene families. Such inaccuracy in gene family definition and representative sequences would eventually result in inaccurate functional profiles of VB 12 gene families when annotating shotgun metagenomes using these orthology databases. Third, homologous gene families were rarely distinguished in these public orthology databases, while they usually played different functions in VB 12 biosynthesis pathways. One such example was cysG and cobA, which were always mixed together as one orthologous group in almost all orthology databases due to the high sequence similarity shared by these genes. Other homologous gene families with similar issues included cobR and cbiH, cobH and cbiC, cobM and cbiF, cobJ and cbiH, etc. These results suggested that VB 12 Path was a comprehensive, accurate, and discriminative functional gene database of VB 12 biosynthesis gene families.
The accuracy of VB 12 Path was further illustrated using metagenomic data and an artificial data set. The results were compared against one of the best current alternatives, the KEGG database. In the metagenomic data set, the relative abundances of VB 12 gene families identified by KEGG were generally higher than those identified by VB 12 Path (Fig. S3). This is because, in addition to VB 12 biosynthesis genes, their homologs were always included in the KEGG orthologous groups, thereby leading to higher relative abundances of VB 12 gene families. In addition, a high number of gene families were mixed in a single orthologous group, leading to ambiguous interpretation of the results. In the artificial data set, a total of 270 protein sequences were retrieved from the NCBI GenBank database, of which 143 belonged to VB 12 biosynthesis gene families, 57 were their homologs, and the remaining 70 were unrelated sequences. As a result, all VB 12 biosynthesis genes were correctly assigned to their targeting gene families in VB 12 Path, while their homologs and unrelated sequences were not assigned. For KEGG, however, both false-positive (23%) and false-negative (3%) observations were found for VB 12 biosynthesis gene families (Table S2). A high rate of false positives (22%) was also observed for unrelated sequences, though they are not related to VB 12 biosynthesis. Such accuracy issues in public orthology databases (e.g., KEGG) should be due to the mixed status of targeted gene families and their homologs in the same orthologous group. Notably, considering the extremely high complexity of metagenome sequences, false positives cannot be completely avoided by any methodologies but can only be minimized with our best current knowledge.
VB 12 biosynthesis pathways in sequenced microbial genomes. Sequenced microbial genomes from the NCBI RefSeq can be mapped to the VB 12 Path to investigate how VB 12 biosynthesis gene families are distributed across different microbial lineages. Overall, cobalamin biosynthesis pathways were distributed mainly in bacterial genomes, which accounted for approximately 98.07% of the total number of sequences ( Fig. 2A). A total of 58 phyla, 95 classes, 224 orders, 494 families, and 2,488 genera were detected to have VB 12 biosynthesis-related gene families in their genomes. Among these, in the bacterial domain, Proteobacteria (57.24%), Actinobacteria (18.16%), and Firmicutes (11.01%) were the most abundant phyla with VB 12 biosynthesis gene families (Fig. 2B). At the family level, Enterobacteriaceae (8.04%) was the most abundant family, followed by Pseudomonadaceae (7.64%) and Streptomycetaceae (6.90%). At the genus level, Pseudomonas (7.57%), Streptomyces (6.69%), and Vibrio (3.37%) were the most abundant genera that had VB 12 biosynthesis genes. A relatively small number of archaea species were found with VB 12 biosynthesis gene families ( Fig. 2A and C). Among these, Euryarchaeota (85.33%) had the highest abundance, followed by Crenarchaeota (6.20%) and Thaumarchaeota (4.31%). At the pathway level, precorrin-2 synthesis process was the most abundant process, with 207,401 sequences, followed by aerobic pathway (138,130 sequences), post-AdoCbi-P pathway (95,934 sequences), salvage and remodeling pathway (91,337 sequences), and anaerobic pathway (81,878 sequences). Notably, the taxonomic composition of these five pathways was generally similar at the phylum level with slight differences for aerobic pathways.
Application of VB 12 Path to human gut and ocean metagenome analysis. To show how VB 12 Path can be used to characterize VB 12 biosynthesis pathways in complex metagenomic data, shotgun metagenome data sets from human gut and ocean samples were searched against VB 12 Path to profile the taxonomic composition and functional potential involved in the synthesis of cobalamin. As expected, PCoA clustering of taxonomic and functional gene profiles suggested that the VB 12 biosynthesis microbial assemblages in human intestine and ocean had dramatically different taxonomic and functional gene compositions (Fig. 3). A total of 45 phyla, 84 classes, 187 orders, 416 families, 1,524 genera, and 5,826 species involved in VB 12 biosynthesis were detected in marine samples. In human intestinal samples, a total of 28 phyla, 70 classes, 154 orders, 336 families, 985 genera, and 2,350 species were detected. The taxonomic composition of ocean and human intestine microbial communities associated with VB 12 biosynthesis was dramatically different even at the phylum level. Cyanobacteria (40.01%), Proteobacteria (38.09% relative abundance), and Bacteroidetes (8.59%) were predominant groups in marine samples, while Bacteroidetes (81.19%), Firmicutes (12.10%), and Proteobacteria (0.95%) were predominant in intestinal samples. At the family level, the predominant families recovered from ocean were Prochloraceae (39.18%), Pelagibacteraceae (12.23%), and Alteromonadaceae (7.56%) (Fig. 4A), while in the human intestine, Bacteroidaceae (64.53%) was the most abundant group, followed by Lachnospiraceae (9.91%) and Ruminococcaceae (6.15%) (Fig. 4C).
Functional gene profiles for VB 12 biosynthesis pathways were also dramatically different between ocean and human intestine samples. In ocean, gene families belonging to precorrin-2 synthesis pathway and aerobic pathway accounted for about 59% and 22% of the total sequences, respectively. Gene families belonging to salvage and remodeling pathway had the fewest sequences in ocean samples, which accounted for about 5% of the total sequences identified by VB 12 Path. In the human intestine, however, gene families involved in salvage and remodeling pathway had the most sequences, which accounted for about 60% of the identified sequences. Gene families belonging to aerobic pathway were found with the fewest assigned sequences in human intestine samples, which accounted for about 7% of the sequences. Among all VB 12 biosynthesis gene families, hemA, hemB, hemC, hemL, and gltX were abundant in both marine and intestinal samples (Fig. 4B and D), while gene families including hemD, ubiB, cobV, cbiP, and cbiZ were rarely detected in both marine and intestine samples. Importantly, it was also noticed that microbial taxonomic groups carrying VB 12 gene families varied highly in both human gut and ocean ecosystems, but the functional gene families were relatively stable, as observed in different angles such as ordination analysis, relative abundance, and Bray-Curtis dissimilarity indices (Fig. 3, Fig. 4, and Fig. S4). The results suggest that VB 12 Path can be used as a useful tool to explore both functional gene and taxonomic profiling of VB 12 biosynthesis microbial communities from different environments.

DISCUSSION
Cobalamin (VB 12 ), hailed as "nature's most beautiful cofactor" (22), is of essential importance to natural ecosystems and human health. As a micronutrient that can be synthesized by only a selective group of microbes, VB 12 plays important roles in structuring the composition of microbial community in various ecosystems (4,11,23,24), modulating various biogeochemical cycles (9,25), and controlling the life and death of human intestinal microbes (1). A series of processes, such as ecosystem stability, microbial growth and development (5), human cardiovascular health, cognitive function  (26), and intestinal microbial ecology (4,25), are affected by the availability of VB 12 . In recent years, many studies have been conducted in marine, terrestrial, and human gut systems owing to growing interests in VB 12 in different ecosystems. These studies focused on the quantification and classification of genes encoding cobalamin biosynthesis proteins, identification of multiple cobalamin transporters, and the combined effects of cobalamin and several other factors on the microbiota and environment (3,5,24,27). Analysis of terrestrial metagenomes indicated that less than 10% of the total microbial community had a complete VB 12 biosynthesis pathway (11), while most organisms require cobalamin to complete a series of necessary biological processes. Surprisingly, more than 98% of the total cobalamin demand of human gut microbiota requires exogenous supply to maintain the structure and function of gut microbial communities (3,4). As elaborate processes are required to produce this cofactor, it has been shown that only a subset of bacteria and archaea can complete this process, emerging the "corrinoid economy" within the gut (1). Therefore, VB 12 has been considered a critical limiting factor that affects the growth of other organisms and their interactions (2). It has been speculated that the producer of cobalamin serves a "keystone function" (11), akin to the notion of keystone species, and may greatly affect the biogeochemical functioning of microbial communities in various ecosystems. In addition, the VB 12 biosynthesis capabilities could affect the behavioral characteristics of organisms, as well as physiological and developmental functions, such as enhancing the predatory feeding behavior of Pristionchus nematodes, resulting in surplus killing (28), and affecting the volume, growth rate, and life span of the organism (5). Most importantly, some neuronal circuits may be directly or indirectly affected by VB 12 (5). Therefore, exploring the cobalamin biosynthesis processes and the taxonomic composition of the cobalamin producers is of great significance to clarify the stability and versatility of natural ecosystems and gut microbial ecology.
The developed VB 12 Path is a useful tool for studying cobalamin biosynthesis pathways because it provides accurate functional and taxonomic profiles of the pathway gene (sub)family of microorganisms in complex environments. Accuracy is a central issue for functional gene databases like VB 12 Path, which ensures high accuracy for subsequently generated functional profiles. The accuracy of VB 12 Path is guaranteed at two different levels. The first one is at the gene family level. An accurate and complete gene catalog in the database can usually be used as a reliable reference to help compile large-scale shotgun sequencing tasks (29,30). VB 12 Path covers 60 key gene families involved in cobalamin biosynthesis, allowing researchers to obtain comprehensive functional gene profiles for microbial communities responsible for cobalamin biosynthesis in complex environments. In addition to commonly known gene families, VB 12 Path also covered a series of gene families that are currently missing in public databases, such as ubiB for DMB synthesis, pduO for cobinamide adenylation (31,32), and cobR for reduction of cob(II)byrinic acid a,c-diamide (33). In addition, VB 12 Path also distinguished gene families that share high sequence identity and are usually merged as one orthologous group in public databases, although these gene families play different functions, such as cobC, cobS, cobT, and cobU. The second level of accuracy lies at the sequence level. Accurate functional and taxonomic profiling relies not only on accurate definition of gene families but also on accurate recruitment of sequences for these gene families. Such high accuracy in sequence databases is of critical importance for reliable postprocessing metagenomic data analysis, such as inferring metabolic potential and characterizing functional and taxonomic compositions (34,35). To ensure the accuracy of retrieved sequences, multiple actions were taken by VB 12 Path. First, seed sequences were downloaded from Swiss-Prot database, in which all sequences were manually curated by experts in the field (36). All seed sequences were subject to a second manual checking before downloading. Second, sequences downloaded from TrEMBL database were first checked against seed sequences based on sequence identity and then assigned to corresponding gene families based on majority rules if homologs exist. Third, sequences from public orthologous databases were also recruited, which were double checked based on sequence identity and majority rules. Meanwhile, their corresponding homologous orthologous groups were also collected to eliminate false-positive assignments in metagenomic analyses. In addition, we also merged NCBI RefSeq database with VB 12 Path. All these steps guaranteed a comprehensive and accurate coverage of highquality sequences in VB 12 Path, providing accurate functional gene assignment for shotgun metagenome sequencing analysis.
In the metagenomic era, a large amount of data is generated with the development of next-generation sequencing technology (17), posing huge challenges in processing and analyzing such large data sets. To expedite this procedure, development of customized databases has emerged (21,37,38). As an essential micronutrient required by almost all organisms in both natural and artificial ecosystems, cobalamin is one of the most frequently analyzed cofactors in the environments, and its biosynthesis pathways have attracted more and more attention (5,11,23). By referencing the framework for knowledge-based database construction in NCycDB (21), VB 12 Path was developed to expedite data mining of gene families responsible for VB 12 biosynthesis in complex metagenomes. VB 12 Path not only inherited critical characteristics of knowledge-based small databases such as comprehensive coverage of gene families, high accuracy, fast database searching, and low false-positive assignments but also provided solutions for both functional and taxonomic profiling of microbial communities involved in VB 12 biosynthesis processes, thereby achieving in-depth insights into microbial VB 12 biosynthesis pathways at both functional gene and taxonomic levels.
We applied VB 12 Path to identify and explore VB 12 biosynthesis gene families in the ocean and human intestinal environments. The taxonomic profiles are consistent with previous reports that Cyanobacteria (39)(40)(41), the most representative producer of pseudocobalamin, is the predominant group of cobalamin producers in ocean environment. Proteobacteria presents as the second most abundant group (23). Studies have shown that the production of pseudocobalamin is likely an ancient relic (27), but the reason for the production and replacement of cobalamin is unknown. Some researchers have suggested that due to impaired DMB synthesis, adenine is used to replace DMB in certain organisms to produce pseudocobalamin (42). Although most sequences are assigned to Cyanobacteria, considering the cobalamin synthesis potential, in addition to the genomic potential, cobalamin cell quotas, abundance, and environmental factors must also be considered (27). At the class level, microbial genera belonging to Gammaproteobacteria and Alphaproteobacteria appear to be the most abundant groups, consistent with previous inferences (8,23), as these groups contain many genes involved in corrin biosynthesis and DMB synthesis and activation. In the human intestinal environment, dramatically different microbial functional gene and taxonomic compositions of VB 12 synthesis pathways were observed compared to those observed in the ocean environment. Such distinction could be due to the huge differences in environmental conditions between ocean and human intestine. Interestingly, for both ocean and human intestine, it could be observed that the compositions of functional gene families were more similar among different samples than were those of taxonomic groups, even at the phylum level. This is generally consistent with a previous study in human gut that found that different individuals hold a core microbiome at the gene level rather than at the lineage level (43). Similar patterns were also observed in the TARA Oceans microbiome study (44). In addition, according to the taxonomic profile, it was noticed that fewer than 5% of the sequences were assigned to archaea. This may be partly due to the reason that a relatively small number of archaea were sequenced, but another reason may be the fact that bacteria dominate VB 12 biosynthesis processes in ocean and human intestine environments.
Taken together, VB 12 Path is a curated functional gene database that can be used as a toolkit for metagenomic profiling of VB 12 biosynthesis gene families in various complex environments. VB 12 Path integrates multiple homology databases with manual verification to improve the coverage and accuracy of gene families and sequences. By applying VB 12 Path to marine and human intestine metagenomic samples, microbial gene (sub)families responsible for VB 12 biosynthesis in ocean and human intestines were characterized. The profiles show that, while the taxonomic groups involved in VB 12 biosynthesis vary highly, VB 12 synthetic gene families are highly consistent among different samples. In conclusion, VB 12 Path provides a platform for shotgun metagenome data analysis, revealing accurate taxonomic and functional profiles for VB 12 biosynthesis gene families of microbial communities from different environments.

MATERIALS AND METHODS
Framework development. Based on the framework previously shown in NCycDB (21), a pipeline was developed for VB 12 Path with moderate modifications (Fig. 5). Overall, the pipeline encompassed four major steps, including seed database construction, core database construction, full database construction, and metagenomic profiling.
Seed database construction. A seed database was first constructed for VB 12 Path. The seed database needs to contain highly accurate reference sequences for each gene family involved in VB 12 biosynthesis, thereby ensuring the accuracy of VB 12 Path at the very beginning. Here, curated reference sequences from the Swiss-Prot database (45) were retrieved. To do so, we first determined the specific processes of VB 12 biosynthesis by referring to multiple literatures (11,23,31) and the VB 12 synthesis pathways in KEGG database (18). Key gene families involved in the VB 12 biosynthesis pathways were then identified. For each gene family, keywords were determined based on their gene name and corresponding protein name (Table S1). The keywords were further refined by manually checking the searching results. In some cases, logical operators were used in order to include wanted or exclude unwanted sequences so that the gene family was most appropriately described (e.g., gene:hemA name:"glutamyl-trna reductase" NOT name:"multifunctional siroheme biosynthesis protein hemA"). Notably, some genes may have multiple protein names and play different roles in different processes in VB 12 synthetic pathways. One such example is cobC, which is the gene name for threonine-phosphate decarboxylase and adenosylcobalamin phosphatase. Of them, threoninephosphate decarboxylase plays a role in the direct attachment of (R)-1-amino-2-propanol to the corrinoid ring, while the adenosylcobalamin phosphatase catalyzes dephosphorylation of adenosylcobalamin (31). The retrieved sequences from Swiss-Prot database were then manually inspected to ensure their accuracy. Verified sequences were downloaded and stored in the seed database.
Core database construction. In addition to Swiss-Prot, sequences were also collected from the TrEMBL database, forming the core database. Sequences belonging to targeted gene families were downloaded from the TrEMBL database using the refined keywords. Since tens of thousands of sequences could be retrieved, these sequences were briefly inspected for their accuracy. For each gene family, these less-accurate sequences downloaded from TrEMBL database were then searched against their seed sequences by USEARCH (46) at a global identity cutoff of 30%. If homologs were found among gene families (e.g., cysG and cobA) involved in VB 12 biosynthesis pathways, seed sequences of all homologous gene families were used in the USEARCH procedure. Sequences retrieved from TrEMBL database were assigned to the gene family with best hit.
Full database construction. In order to comprehend VB 12 Path and minimize false-positive effects due to "small database issue," we also integrated multiple public databases into VB 12 Path. In addition to integrating targeted gene families in public orthology databases, we also identified and included homologous gene families. To do so, public orthology databases, including arCOG (Archaeal Clusters of Orthologous Genes) (47), COG (20), KEGG (18), and eggNOG (19), were searched against the core database. Orthologous groups belonging to targeted gene families as well as homologous orthologous groups were identified based on semimanual inspection efforts. Sequences were extracted and merged with known targeted gene families or included as homologous groups, forming a "pre-full" database. After that, both archaeal and bacterial RefSeq databases in NCBI were downloaded and searched against the pre-full and public orthology databases. Sequences best targeting VB 12 gene families and their homologous groups were then respectively extracted and merged, forming the full database. The CD-HIT program (48) was then used to generate a set of nonredundant representative sequences at 100% identity cutoff.
Metagenomic profiling of VB 12 gene families. Included in VB 12 Path, we also provided a set of PERL scripts for database searching and metagenomic profiling of VB 12 gene families. Both functional gene and taxonomic profiles for targeted VB 12 gene families were generated. For functional profiling, database searching tools including DIAMOND (49), USEARCH (46), and BLAST (50) were supported. Other database searching software can also be integrated easily. For taxonomic profiling, sequences targeting VB 12 gene families were extracted based on database searching report files by the seqtk program (51). The tool Kraken2 (52) was then called for taxonomic assignment of these extracted sequences. Taxonomic profiles at different taxonomic levels were then generated. Further statistical analyses could be carried out for both functional and taxonomic profiles at users' own choices.
Case study procedures. In order to evaluate the performance of VB 12 Path in profiling VB 12 gene families and their taxonomic information in complex metagenomes, we collected 25 representative  12 Path. Multiple steps were included in the framework, including seed database construction, core database construction, database merging, and full database construction. A set of PERL scripts are also provided to generate taxonomic and functional gene profiles from shotgun metagenomes. VB 12 PATH for Metagenomic Profiling shotgun metagenome data sets, among which 10 came from the Human Microbiome Project and 15 came from the TARA Oceans expedition. The human gut metagenomes were downloaded from https:// www.hmpdacc.org. The TARA Oceans metagenome data sets were downloaded from http://ocean -microbiome.embl.de/companion.html. Forward and reverse reads were merged into longer sequences by the PEAR program (version 0.9.6, -q 30) (53). The merged sequences were then searched against VB 12 Path using the DIAMOND program (-k 1 -e 0.0001 -id 0.3) (49). Functional profiles were generated based on the DIAMOND output files. Metagenomic sequences mapped to VB 12 biosynthesis gene families were extracted by the seqtk program (version 1.3) (51). The Kraken2 program (52) was employed for taxonomic assignment for the extracted reads. A standard database was built for Kraken2 which covered complete genomes in RefSeq for the bacterial, archaeal, and viral domains, along with the human genome and a collection of known vectors (UniVec_Core). Taxonomic profiles of VB 12 biosynthesis communities were generated at different taxonomic levels. The generated functional gene and taxonomic profiles were then used for further statistical analyses.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only.