Comparative Functional Genome Analysis Reveals the Habitat Adaptation and Biocontrol Characteristics of Plant Growth-Promoting Bacteria in NCBI Databases

ABSTRACT Plant growth-promoting bacteria (PGPB) are a group of beneficial microorganisms that include 60 bacterial genera, such as Bacillus, Pseudomonas, and Burkholderia, which widely colonize plant leaves and soil, promote plant growth, and/or inhibit pathogen infection. However, the genetic factors underpinning adaptation of PGPB to plant leaves and soil remain poorly understood. In this study, we performed a comparative functional genome analysis approach to investigate the functional genes of 195 leaf-associated (LA) and 283 soil-associated (SA) PGPB strains and their roles in adapting to their environment, using 95 strains from other-associated (OA) environmental habitats with growth-promoting or antimicrobial functions as negative controls. Comparison analysis of the enrichment of nonredundant (NR) protein sequence databases showed that cytochrome P450, DNA repair, and motor chemotaxis genes were significantly enriched in LA PGPB strains related to environmental adaptation, while cell wall-degrading enzymes, TetR transcriptional regulatory factors, and sporulation-related genes were highly enriched in SA PGPB strains. Additionally, analysis of carbohydrate-active enzymes demonstrated that glycosyltransferases (GTs) and glycoside hydrolases (GHs) were abundant families in all PGPB strains, which is in favor of plant growth, and enriched in SA PGPB strains. Except for most Bacillus strains, SA PGPB genomes contained significantly more secondary metabolism clusters than LA PGPB. Most LA PGPB contained hormone biosynthesis genes, which may contribute to plant growth promotion, while SA PGPB harbored numerous carbohydrate and antibiotic metabolism genes. In summary, this study further deepens our understanding of the habitat adaptation and biocontrol characteristics of LA and SA PGPB strains. IMPORTANCE Plant growth-promoting bacteria (PGPB) are essential for the effectiveness of biocontrol agents in plant phyllosphere and rhizosphere. However, little is known about the ecological adaptation of PGPB to different habitats. In this study, comparative functional genome analysis of leaf-associated (LA), soil-associated (SA), and other-associated (OA) PGPB strains was performed. We found that genes related to the metabolism of hormones were enriched in LA PGPB. Carbohydrate and antibiotic metabolism genes were enriched in SA PGPB, which likely facilitated their adaptation to the plant growth environment. Our findings provide genetic insights on LA and SA PGPB strains’ ecological adaptation and biocontrol characteristics.

and widely used in the development of plant growth regulators (5). In contrast, soil-associated (SA) PGPB are generally rich in carbohydrates and secondary metabolite biosynthesis genes and widely used in the development of biopesticides (6,7). However, the unique genetic characteristics and environmental adaptation mechanisms of these LA and SA PGPB are still unclear.
PGPB have multiple mechanisms to enhance plant growth and induce disease resistance, including the use of plant hormones, antioxidant enzymes, carbohydrate enzymes (CAZymes), and antibiotics (8,9). The CAZymes are enzymes that break down the cell walls of plant pathogens, leading to pathogen death. PGPB can also produce bioactive secondary metabolites, such as antibiotics (10), including nonribosomal peptides (NRPs) and polyketides (PKs) synthesized by nonribosomal peptide synthetase (NRPS) (11). Nevertheless, the genes and metabolic pathways of PGPB from plant leaves and soil, including those related to plant hormones, carbohydrates, and secondary metabolite clusters, are still not fully understood.
Comparative functional genome analysis is an effective approach for revealing the evolution and habitat adaptation mechanisms of microorganisms (12)(13)(14) and the diverse metabolic capabilities of strains within the same species isolated from different habitats (15). The aim of this study was to use comparative functional genome analysis to uncover the genetic characteristics and metabolic differences of PGPB residing in leaves and root zone soils, providing insights into the biological and ecological functions of PGPB and their applications in agriculture.

RESULTS
Collection of strains, genomic features, and phylogenetic analysis. We conducted a search for bacterial strains using "plant growth-promoting" and/or "antibacterial" as keywords in the Web of Science platform, which yielded a total of 2,607 strains. Out of these, 1,042 strains had their genomes sequenced and were stored in the NCBI genome and/or protein database. We then employed PYANI, GBDP, and CheckM tools to remove duplicates and obtained a data set of 573 strains (60 genera) with high-quality genome and protein sequences (see Table S1 and Fig. S1 in the supplemental material and method 1 included the redundancy analysis and so on). Among these strains, 478 were plant growth-promoting bacteria (PGPB) that colonized either leaves (leaf associated [LA]; 195 strains) or rhizospheric soil (soil associated [SA]; 283 strains), while the remaining 95 strains were isolated from nonplant environments (other associated [OA]) such as food, air, and the human body, serving as control strains for this study. Notably, these strains were distributed worldwide (Fig. 1A). On analyzing the genome data, we observed that the average genome size of LA PGPB strains was significantly smaller (4.67 Mb) than those of SA PGPB strains (6.08 Mb) and OA strains (5.97 Mb) (t test, P , 0.05) (Fig. 1B). Additionally, the average GC content of LA PGPB strains was significantly higher (63.27%) than those of SA PGPB strains (57.12%) and OA strains (56.95%) (t test, P , 0.05) (Fig. S2).
To investigate the evolutionary relationships among 573 bacterial strains from various habitats, a maximum likelihood phylogenetic tree was generated based on the amino acid sequences of 14 single-copy core genes (Fig. 1C)  The middle layer of phylogenetic tree represents the clustering information of bacterial strains. Due to the substantial variations in genome size and GC content among strains originating from distinct habitats, we employed the k-medoids clustering algorithm to further group the strains based on their similarity of lineage distances, which were converted from the phylogenetic tree of 573 bacteria (method 2). By maximizing the silhouette coefficient of 0.44 (Fig. S3), we identified the optimal k value, which resulted in four distinct clusters: K1, K2, K3, and K4. Group K1 consisted of 285 strains from 60 different genera, with 174 LA PGPB strains accounting for 61.05% of the total in K1, 95 SA PGPB strains accounting for Double and triple asterisks indicate a significant (P , 0.01) and an extremely significant (P , 0.001) difference between two habitats as inferred from Student's t test, respectively. (C) Maximum likelihood phylogenetic tree of 573 high-quality and nonredundant bacterial genomes, based on the protein sequences of 14 single-copy genes. The outer ring shows bacteria of genera, the central ring shows the habitats, and the inner ring shows the taxonomic group. It is worth noting that subsequent studies have adopted the same color scheme as used here. All data are available in the supplemental material. LA, leaf-associated; SA, soil-associated; OA, other-associated.  Fig. S4). The innermost circle of the tree represents habitat information of strains collected in the literature and NCBI databases (Fig. 1C). Given the extensive data and various sources, it is prudent to ascertain the statistical significance of the classification of PGPB compartments in the literature and databases. We applied a hierarchical clustering method with complete linkage to cluster 573 bacterial strains based on the number of CAZyme and secondary metabolite cluster genes per megabase of genome. We found that a total of 168 LA PGPB strains (29.32% of all strains), 251 SA PGPB strains (43.8% of all strains), and 81 OA strains (14.14% of all strains) clustered together (method 2).
Comparison of core genome and functions of PGPB in different habitats. To gain a deeper understanding of the genetic underpinnings and biocontrol properties of PGPB across different environments, we conducted a genome analysis of 12 units from four taxa across three habitats, resulting in a pan-genome. Our findings revealed the identification of 59,187 pan-genome gene families, of which only 163 (0.28% of the total) were considered core gene families that were present in all strains and conserved. The remaining 59,024 (99.72% of the total) represented the accessory genome (37,784 [63.83%]) and strain-specific genes (21,240 [35.89%]) ( Fig. 2A). Meanwhile, phylogenetic principal-component analysis (phylo-PCA) of the 163 core genes in three habitats showed that the gene abundance in all strains was correlated with different habitat characteristics ( Fig. S4) (Adonis test, P = 0.001), suggesting that the abundance of these genes may play an important role in adapting to the environment. Therefore, we randomly selected strains with the closest average phylogenetic distance from each of the three habitats and conducted a comparative analysis of their 163 core gene families (method 3). Analysis of the NR database revealed that LA PGPB strains had higher levels of eight core genes than SA PGPB and OA strains, which are related to environmental adaptation, such as DNA repair, cytochrome P450s, AraC family transcriptional regulators, and motility chemotaxis ( Fig. 2B and Fig. S5). In SA PGPB strains, 16 core genes were significantly more abundant than in LA PGPB and OA strains, including genes encoding proteins with helix-turn-helix structures, acylamidases, TetR family transcriptional regulators, and YceI family proteins and genes for spore adaptation to the environment ( Fig. 2B and Fig. S5). Moreover, 15 core genes were more abundant in LA or SA PGPB strains than in OA strains, including genes for ABC transport proteins, a/b hydrolases, ArsC, and flavin adenine dinucleotide (FAD)-binding oxidoreductases. In contrast, OA strains had an additional 34 abundant core genes, which may be associated with their broad habitat range, including transglutaminase, aspartic acid hemaldehyde dehydrogenase, and alanine racemate, among others (Fig. S5).
CAZyme profiling. To gain a more profound understanding of the potential function of CAZymes in environmental adaptation in each genome, we employed an internal script to determine the number and types of these enzymes in the strain genome to explore the genomic potential of all strains in utilizing carbohydrate enzymes (method 4). Our analysis revealed the presence of numerous genes encoding crucial carbohydrate enzymes in PGPB strains thriving in the SA habitat ( Fig. 2B and Fig. S6). Our findings revealed that 59 CAZyme families were present in all strains, including 32 glycoside hydrolases (GHs), 25 glycosyltransferases (GTs), and 2 carbohydrate-binding module (CBM) families. GHs and GTs were the most abundant CAZyme families in all strains. Nevertheless, there were significant variations in the number of genes encoding CAZymes among the strains, with a range of 82 to 417 (Fig. 3A). This indicates that PGPB strains possess significant carbohydrate degradation capabilities, and GHs and GTs may be the pivotal CAZyme genes necessary for these strains' adaptation and survival in diverse environments. Further analysis has uncovered significant differences in the total number of CAZyme genes among strains of bacteria growing in different habitats. Specifically, strains growing in the SA habitat showed a significantly higher number of CAZyme genes than those growing in the LA habitat, for both K1 and K4 taxa. In contrast, in the LA habitat, the K4 taxon exhibited the highest average number of CAZyme genes, while in the SA habitat, the K2 taxon had the highest average number of CAZyme genes (t test, P , 0.05) (Fig. 3A). Phylo-PCA analysis further revealed that the CAZyme genes in the K1, K2, and K4 taxa could be separated into two axes, PC1 and PC2. Interestingly, there were significant differences in CAZyme genes among strains from different habitats along these two axes (Adonis test, P , 0.05) (Fig. 3B). However, in the K3 taxon, which mainly consisted of Bacillus strains, there was no significant difference in CAZyme genes among strains growing in LA, SA, PGPB, and OA strains (t test, P , 0.05) ( Fig. 3B and Table S2). This suggests that the CAZyme genes of most PGPB are related to the habitat of the strains, whereas the K3 taxon appears to exhibit stable CAZyme levels across different environments. To identify specific variations in the types of CAZyme genes among strains in diverse environments, we utilized Rstudio for additional statistical analysis (method 6). Our results showed that strains from LA and SA habitats in the K1 taxon exhibited significant differences in five types of CAZyme genes (i.e., genes for auxiliary activities [AAs], carbohydrate esterases [CEs], GHs, GTs, and polysaccharide lyases [PLs]). In contrast, for the K2 and K4 taxa, only CEs demonstrated significant differences between LA and SA habitats. Notably, we did not observe any significant differences in specific CAZyme genes between different habitats in the K3 taxon (t test, P , 0.05) (Fig. 4A). Further analysis revealed that in the LA habitat, all five types of CAZyme genes (genes for AAs, CEs, GHs, GTs, and CBMs) showed significant differences among the four clusters. In the SA habitat, six types of CAZyme genes in all four clusters exhibited significant differences, with higher CAZyme gene richness than in the LA habitat (t test, P , 0.05) (Fig. 4B). These findings lead us to conclude that variations in the types and richness of CAZyme genes in different habitats may assist microorganisms in adapting to their ecological environments.
Analysis of the secondary metabolic cluster. To evaluate the potential production of antimicrobial compounds, we utilized the antiSMASH software to predict the quantities and types of secondary metabolite clusters present in all strains (method 5). Our findings indicated that PGPB strains contained both NRPS and bacteriocin genes, with various numbers of secondary metabolite clusters observed across different habitats. Specifically, we noted a significant difference in the total number of secondary metabolite clusters among strains of taxa K1 and K2 in LA and SA habitats (Fig. 5A). Furthermore, phylo-PCA highlighted significant differences in the secondary metabolite clusters of K1, K2, and K4 strains across different habitats on the PC1 and PC2 axes (Fig. 5B), suggesting the potential of PGPB secondary metabolite clusters to play a crucial role in environmental adaptation. Although there was no significant difference in secondary metabolite clusters between strains in different habitats within the K3 taxon, they had the highest level of secondary metabolite clusters (Fig. 5A). Notably, the K3 taxon comprised primarily of Bacillus strains, supporting the notion that Bacillus could serve as a reliable and effective biocontrol agent. Overall, these findings provide deeper insight into the distribution and function of secondary metabolite clusters of PGPB strains across different habitats, offering a theoretical foundation for using PGPB as a biocontrol agent.
We conducted a comparative analysis of the 17 most abundant secondary metabolite clusters between LA and SA PGPB strains across all four taxa (Fig. 6A). Our results revealed significant differences in nine secondary metabolite clusters between the LA and SA habitats of taxon K1, including those for several gene clusters associated with the biosynthesis of secondary metabolites such as arylpolyene, bacteriocin, hserlactone, N-acetyl glutaminyl glutamine amide (NAGGN), NRPS, polyketide synthase (PKS)-like, terpene production, transAT-PKS, and transAT-PKS-like.
Further analysis revealed that taxa K1, K2, and K4 had a significantly higher number of enriched secondary metabolite clusters in the SA habitat than the LA habitat. On the other hand, the members of taxon K3 exhibited more enriched secondary metabolite clusters in the LA habitat. Furthermore, taxon K3 showed the highest number of secondary metabolite clusters in both LA and SA habitats, highlighting its potential as an effective biocontrol agent in both habitats. Notably, taxon K3 mainly comprised Bacillus PGPB strains.

DISCUSSION
The microorganisms in a habitat are selected by environmental factors, which results in distinct microbial genome variations among different habitats (16,17). PGPB strains have been observed in various habitats, as documented in the NCBI database (18). PGPB isolated Adaptation of PGPB to Plant LA and SA Habitats Microbiology Spectrum from LA and SA habitats are crucial components of agroecosystems and contribute significantly to healthy crops (19). In this study, a comparative functional genome analysis was conducted on 573 strains to gain insights into the genomic features of PGPB. Bacteria inhabiting complex ecosystems typically have larger genomes, often containing functionally redundant genes (20,21). A larger number of function-related genes may aid bacteria in adapting to complex environments and improving their biocontrol characteristics, leading to enhanced adaptability, survival, and growth (22). Our study found that SA PGPB strains had significantly larger average genome sizes than LA PGPB strains, which may be attributed to the intricacies of the soil environment. Soil structure, physicochemical factors, and temperature fluctuations can impact bacterial growth more extensively than above-ground conditions (23,24).
A prior study recognized the significant role of core genes in controlling essential bacterial functions (25), and many plant-related functions were found to be conserved across diverse bacterial taxa (4). Genomic functional analysis of PGPB strains from the LA habitat revealed significant abundance of core genes that were related to DNA repair and contributed to exceptional resistance and response to environmental stress (26). These strains were also enriched in cytochrome P450 genes, which could suppress leaf senescence, prolong leaf life span, and boost plant adaptability to the environment (27). Additionally, these genes were involved in the biosynthesis and metabolism of several plant hormones, such as salicylic acid, jasmonic acid, ethylene, and abscisic acid (28). Furthermore, AraC family transcriptional regulators were identified as key to modulation and management of cellular metabolism, thereby allowing the bacteria to survive under various environmental conditions (29). Bacterial chemotaxis could aid in responding to chemical concentration gradients of substances and enabling the bacteria to prefer beneficial stimuli and avoid harmful ones (30) ( Fig. 2A and Fig. S4).
Our analysis revealed that PGPB inhabiting the SA habitat possess a distinct gene repertoire compared to those in the LA and OA habitats. Specifically, they exhibited enrichment in genes related to carbohydrate and amino acid metabolism, suggesting their ability to utilize these nutrients. Interestingly, we also observed an abundance of genes encoding helix-turn-helix domains, which can help alleviate inhibition of virulence gene expression and promote biofilm formation (31). Additionally, the SA PGPB strains were found to have a high number of acylamidase genes, which may allow indirect hydrolysis of immunoregulatory peptidoglycans (32), and transcription regulator genes of the TetR and YceI families, which aid in stress tolerance (33)(34)(35). Finally, the abundance of sporulation genes observed in these strains may contribute to their ability to withstand environmental stressors (36).
Compared to OA strains, the LA and SA PGPB strains were found to contain many genes encoding ABC transporter proteins, a/b hydrolase, and ArsC reductases, which might contribute to the degradation of environmental pollutants and improve environmental adaptability of the strains (37). Moreover, the PGPB strains were also found to be rich in FAD-binding oxidoreductase, which might protect the strains against hypoxic and oxidative stress (38). Additionally, these strains have the ability to produce exopolysaccharides and antibiotics that help PGPB colonize plants and protect the host from pathogens (39,40).
CAZymes and secondary metabolite clusters are crucial factors in determining whether bacteria exhibit biocontrol functions (7,41). The results of this study demonstrate that PGPB generally contain a large quantity of CAZymes, which enhance their ability to colonize plants (42). Among the LA PGPB strains, Pseudomonas strains exhibited a higher abundance of CAZymes than did K1, K2, and K3 PGPB strains, indicating that this genus could be developed into an ideal interfoliar biocontrol agent (43). Conversely, among the SA PGPB strains, Burkholderia strains possessed a higher concentration of genes encoding carbohydrate metabolism enzymes, suggesting that they have diverse mechanisms for carbohydrate utilization (44).
GHs are the key resources for bacteria to produce degraded polysaccharides and promoted their growth (2,45). GTs can synthesize extracellular polysaccharides, which were found to be conducive to the formation of biofilms and resistance to environmental pressures (39,46,47). CBMs are noncatalytic domains, which can be folded into a specific three-dimensional spatial structure and have the function of binding carbohydrate (48). PLs were required for plant growth and leaf senescence, which had no differ significance in four taxa (49). The genus Burkholderia has demonstrated significant differences in CEs (50). Pseudomonas belongs to the lipase-producing bacteria and has been found only in the endosymbiotic flora, while Pseudomonas also shows a greater abundance inside the cell (51). Follow justly, Pseudomonas exhibited a significant difference in the number of CEs between LA and SA habitats.
The antiSMASH program is commonly used to predict secondary metabolite gene clusters that may confer the ability to produce antimicrobial compounds (52). Bacteria may have a wider range of functional diversity if they possess a larger number of genes in their genomes (53). The greater the number of secondary metabolic gene clusters in bacteria, the stronger their ability to perform biological defense (54). Bacillus strains are dominant biocontrol agents in the market (55,56). The majority of Bacillus strains of PGPB possess more abundant secondary metabolic clusters than other taxonomic groups in the LA and SA habitats, with the LA habitat accounting for more. It has been demonstrated that Bacillus subtilis applied in a slow-release manner by foliar spraying or petiole inoculation significantly improves the effectiveness of antipathogenic bacteria (57). All PGPB strains were found to have NRPS, which can synthesize a series of low-molecular-weight peptide-like secondary metabolites with medicinal value through the nonribosomal pathway with NRPs (58). These secondary metabolites include antibiotics, iron carriers, toxins, and pigments and have antibacterial, antifungal, anticancer, antiviral, and immunosuppressive activities (59). Bacteriocins, a class of peptide toxins with antibacterial activity that act as probiotics (60), are also present in all PGPB strains.
In conclusion, we investigated the genetic basis and biocontrol characteristics of 478 strains across different habitats. Our findings revealed that LA PGPB strains possess genes related to environmental adaptation, such as genes for DNA repair, cytochrome P450s, AraC transcriptional regulators, and motor chemotaxis, while SA PGPB strains possess genes related to helix-turn-helix domain-containing proteins, amidases, TetR transcriptional regulators, YceI family proteins, and sporulation. Additionally, LA PGPB strains were enriched in genes related to hormone production, while SA PGPB strains were enriched in genes related to carbohydrate and antibiotic metabolism. These genes contributed to the domination of their complex environment and enhanced their biocontrol efficacy. The genera Bacillus and Paenibacillus, found in the LA and SA habitats, produced a higher number of secondary metabolite clusters, making them suitable for both leaf and soil environments. Burkholderia spp. produced numerous secondary metabolite clusters in the OA habitat, indicating their significant presence in nonplant environments and their potential as promising PGPB. Developing agrochemicals specific to different habitats would help improve the growth promotion and disease prevention effects of PGPB.

MATERIALS AND METHODS
Compilation of data sets for constructing a high-quality genome. In this study, relevant literature on plant promotion and/or biocontrol was searched for using Web of Science (https://www.webofscience .com/wos/alldb/basic-search) as the primary source, followed by manual proofreading. The corresponding whole-genome sequence of the strains was then downloaded from the NCBI website (https://www.ncbi.nlm .nih.gov/). Additionally, a previously established method for constructing a high-quality genome set was referenced (4, 61) (see Table S1 in the supplemental material). Briefly, the average nucleotide identity (ANI) and coverage of each strain were calculated using PYANI. Two genomes were considered functionally redundant if their ANI was at least 99.9% and their coverage was over 95%, in which case one was randomly removed. CheckM (62) was used to evaluate genome completeness (63) and contamination (64), with only genomes that were at least 95% complete and had no more than 5% contamination being used. In addition, the digital DNA-DNA hybridization (dDDH) value was calculated for 573 bacterial strains using the Genome BLAST Distance Phylogeny (GBDP) method (https://ggdc.dsmz.de/ggdc.php) (65). Finally, metadata collection was conducted to obtain information on isolation, genome size, and compartment from NCBI, IMG (https://img .jgi.doe.gov/), and manually collated references.
We compiled a data set of 573 bacterial strains with plant growth-promoting and/or disease-suppressing activity from the literature (Table S1). These strains originated from 36 countries and regions on all seven continents, with 195 strains isolated from plant leaves as LA PGPB, 283 isolated from soil as SA PGPB, and 95 isolated from other habitats unrelated to plants as OA PGPB (Fig. 1A and Fig. S1). Our collection Adaptation of PGPB to Plant LA and SA Habitats Microbiology Spectrum included strains from midfield soils in Hokkaido, Japan (66), bean leaf strains from tropical forests in Brazil with antimicrobial activity (67), and many strains previously characterized as promising biocontrol agents. Phylogenetic analysis. The amino acid sequences corresponding to 14 single-copy core genes were extracted from the OrthoFinder v2.2.7 output files (68). A maximum likelihood phylogenetic tree was constructed based on the single-copy core genome using FastTree v2.1.9 (69) and visualized with iTOL (https://itol.embl.de/) (70). We converted the phylogenetic tree into a distance matrix using the cophenetic function in the ape package in R. Next, we performed k-medoid clustering analysis and maximum profile coefficient determination on the 573 strains, which were clustered into one taxon considered to have a similar genetic background (4). Subsequently, we applied the PAM algorithm from the R package fpc to cluster the 573 genomes into four groups using k-medoid clustering. The k-medoid algorithm clusters a data set of n objects into k predefined clusters. To determine the optimal k value, we compared silhouette coefficients for k values ranging from 1 to 20. We selected a k value of 4, as it resulted in the highest average silhouette coefficient (0.44).
We used the R package pvclust for hierarchical clustering to demonstrate that all strains originating from the same habitat exhibit convergent evolution in their adaptation to the environment and their capacity to function as biocontrol agents (71)(72)(73).
Core genome analysis. We randomly selected 95 genomes from a pool of 195 LA PGPB genomes, 95 genomes from a pool of 283 SA PGPB genomes, and 95 genomes from all OA strain genomes. The mean phylogenetic distances between the LA/SA PGPB genomes and the OA strain genomes were calculated, and the genome with the smallest deviation from the average phylogenetic distance of OA strain genomes was selected. The same method was used to calculate the phylogenetic distances between the 195 LA PGPB genomes and the 195 SA PGPB genomes, and the genome with minimal deviation was selected. Finally, these genomes were used for comparative analysis of the core genome function (4).
Orthologous groups of protein families of core genome were delimited using OrthoFinder v2.2.7 software with the Diamond method (68,74). The resulting output files (Orthogroup_Sequences folder) were used to extract core genome families (genes shared among all strains) and strain-specific genes (genes found only in the same taxa, the same habitat strains) ( Fig. 2A).
Carbohydrate enzyme analysis. Carbohydrase distribution in 573 strains was predicted using the CAZy database. Accession numbers of known carbohydrases were obtained from CAZy, and corresponding genome sequences were obtained from NCBI Batch Entrez (https://www.ncbi.nlm.nih.gov/sites/ batchentrez). Data were formatted using Diamond and sequences with more than 40% identity were kept. Finally, the accession numbers and CAZy database carbohydrase names were compared to match them with each strain for predicting the presence of carbohydrases.
Secondary metabolic cluster analysis. To predict secondary metabolic clusters in the 573 strains, we utilized the antiSMASH localization database. Initially, we created an environment to install and activate the antiSMASH software, downloaded the database, and tested the genome of one strain to obtain the web version of antiSMASH (https://antismash.secondarymetabolites.org/#!/start) with consistent results. Subsequently, we processed the data in batches to obtain the number of secondary metabolic clusters for all 573 strains by collecting and analyzing the corresponding result files.
Statistical analysis. Extensive use of Rstudio (https://www.rstudio.com/) and the R package ggplot2 were employed for data visualization. Latitude and longitude coordinates were obtained in batches, and hierarchical clustering was used to confirm the distribution criteria of habitats. k-Medoid clustering was used to analyze the phylogenetic relationships of similar taxa. Core gene families were displayed using Venn diagrams, and the number of carbohydrases and secondary metabolic clusters were shown using violin plots. To analyze the functional abundance of each taxonomic hierarchy, dimensionality reduction analysis was performed using principal-component analysis (PCA) (62), and statistical analysis of functional differences among all taxa was conducted using analysis of multiple differences among all taxa (ANOVA) and t test.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only. SUPPLEMENTAL FILE 1, PDF file, 2 MB.