Natural product biosynthetic potential reflects macroevolutionary diversification within a widely distributed bacterial taxon

ABSTRACT Flavobacteriaceae spp. are key players in global biogeochemical cycling and are known for their versatile carbohydrate and peptide catabolism. However, it is currently unknown whether secondary metabolism traits underlie their broad range of occurrence across the earth’s biomes. We examined 2,680 genomes to unveil an unprecedented phylogenetic signal dictating natural product biosynthesis diversification within the Flavobacteriaceae family. The distribution of secondary metabolite biosynthetic gene clusters (BGCs) across genomes usually follows macroevolutionary, genus-specific patterns. Noticeably, 88.6% of the observed BGCs were inferred to lead to the biosynthesis of likely novel natural products. We found an unanticipated, large diversity of taxon-specific BGCs encoding carotenoid and flexirubin pigments, the vast majority of which awaiting formal description. In particular, Aquimarina and Kordia spp. possessed large genomes, versatile catabolic traits, and a repertoire of BGCs possibly encoding drug-inspiring polyketides, non-ribosomal peptides, or post-translationally modified peptides. Using a machine learning approach (feature selection), we reveal that marine and non-marine Flavobacteriaceae genomes are differentially enriched in CAZymes and peptidases with distinct functionalities and molecular targets. IMPORTANCE This is the most comprehensive study performed thus far on the biosynthetic potential within the Flavobacteriaceae family. Our findings reveal intertwined taxonomic and natural product biosynthesis diversification within the family. We posit that the carbohydrate, peptide, and secondary metabolism triad synergistically shaped the evolution of this keystone bacterial taxon, acting as major forces underpinning the broad host range and opportunistic-to-pathogenic behavior encompassed by species in the family. This study further breaks new ground for future research on select Flavobacteriaceae spp. as reservoirs of novel drug leads.


Data analysis design and workflow
The approach implemented in this study comprises phylogenomic assessments and functional annotations of isolate genomes and metagenome-assembled genomes (MAGs) encompassing 175 classifiable and non-classifiable (that is, lacking a formal nomenclature) Flavobacteriaceae and Weeksellaceae genera.Briefly, to investigate the putative secondary metabolism of both families, we combined automated BGC detection and classification using antiSMASH v5.0 (1) with network analysis with BiG-SCAPE (2) to map the distribution BGCs across the examined genomes.We uncover BGCs specifically involved in the production of pigments, siderophores, and novel drug-like candidates, establishing their phylogenetic relationships and patterns of distribution across different taxa and environmental settings (marine vs. non-marine) using network analyses, and critically examine the potential of Flavobacteriaceae spp. as renewable sources of novel drugs.Moreover, we benefit from the large dataset leveraged in this study to shed light on peptide and carbohydrate catalytic potential as adaptive features of marine and non-marine Flavobacteriaceae using a machine learning approach, namely Feature Selection (FS).Finally, peptidase:Cazyme ratios were used as proxies to examine the catalytic profile of Flavobacteriaceae genera across marine and non-marine biomes.

Representativeness of metagenome-assembled genomes (MAGs) in the dataset
MAGs represented 25% of all genomes analysed, and 17.16% of all biosynthetic gene clusters (BGCs) were identified in MAGs.However, this also varied with BGC type.For example, 26.3% and 23% of the BGCs coding for terpenes and type III polyketide synthases (PKSs), respectively, were present in MAGs.Although several MAGs were found to represent so-far uncultured lineages and likely novel genera within the Flavobacteriaceae and Weeksellaceae families, several known, cultivatable genera were well represented by a high percentage of MAGs.While the cultivatable genus containing the highest number of MAGs was Flavobacterium (n = 74), the relative contribution of MAGs to the total number of analysed genomes per genus was much greater (> 60%) for less studied, cultured groups such as Aequorivita, Euzebyella, Leeuwenhoekiella, Marinirhabdus, Marixanthomonas, Muriicola, Zunongwangia and Empedobacter, to name a few (Figure 1A, Supplementary Figure S3).

Marine and non-marine Flavobacteriaceae genomes are differentially enriched in CAZymes and peptidases
The Random Forest (RF) classifier used in this study achieved high performance in identifying differentiating genome features among genomes of marine and non-marine origin, as evidenced by F1-measure and accuracy scores of 83.0-86.36%and 81.75-86.5%,respectively, in both the evaluation and testing phases (see Supplementary Table S7 for details).These results demonstrate the consistency and robustness of our feature selection pipeline and RF classifier.Among the differentiating carbohydrate degrading enzymes (CAZyme) features, we also found a coding potential for GH19 domains not associated with carbohydrate-binding modules (CBMs) more frequently in non-marine genomes.In contrast, GH19 enzymes coupled to CBM5, a chitin-specific biding module, were a typical coding feature of marine genomes.GH19 enzymes are glycosylhydrolases which may underpin lysozyme and/or chitinase activities.
The peptidase:CAZyme ratio may be a proxy for habitat conditions and microbial lifestyles (3).
For example, peptidase:CAZyme ratios > 1 seem more prevalent in bacteria often associated with hosts or isolated from oligotrophic habitats (4).The Flavobacteriaceae genus with the highest ratio was Myroides, an opportunistic human pathogen from contaminated environmental sources such as soil and water.The second highest peptidase:CAZyme ratio was found in MED-G11, an uncharacterized genus composed only of MAGs.The genus with the third highest ratio was Tenacibaculum, well known to contain fish pathogenic species such as Tenacibaculum maritimum (5).A high ratio was also recorded for the genus Riemerella, which contains pathogenic members like Riemerella anatipestifer, an avian pathogen with a worldwide economic impact on the duck industry (6).Conversely, four marine genera (Algibacter, Jejuia, Zunongwangia, Leeuwenhoekiella) were found to have a ratio below 1.These genera have seemingly nonpathogenic lifestyles and are found in both free-living and host-associated settings.Algibacter, in particular, is often found in association with seaweeds, which are polysaccharide-rich settings that may favour a higher number of CAZymes.

Genome size and catabolic potential are intertwined features dictating niche differentiation
among Flavobacteriaceae species Some Flavobacteriaceae species have been repeatedly referred to as pathogens in the marine environment.For instance, Tenacibaculum maritimum is a known fish pathogen (5), Kordia algicida is, as the name suggests, responsible for algal mortalities (7) and Aquimarina species have been reported as pathogens of lobsters (8) and algae (9).Here we show that Aquimarina and Kordia genera possess a higher number of genes dedicated to secondary metabolism and peptide degradation and larger genome sizes when compared to other Flavobacteriaceae genera.We hypothesize that these traits may be related to an opportunistic lifestyle adapted to free-living and host-associated phases.
In contrast, besides the vast diversity of MAGs representing novel, uncultured genera with reduced genomes, we also found four formally described genera with cultured representatives (isolate genomes) in the Flavobacteriaceae family possessing a mean genome size lower than 3 Mb: the non-marine Capnocytophaga and the marine genera Cellulophaga, Muriicola and Psychroflexus.
Capnocythophaga includes species that are part of the oral microbiome of humans and other mammals.C. canimorsus and C. cynodegmi are two commensal species of the oral microbiomes of dogs and cats which can cause rare but severe infections in humans (10).The host specificity displayed by species in this genus points to the possibility of genome reduction due to niche specialization.The remaining three marine genera have been mainly isolated from seawater (11)(12)(13) thus pointing towards a free-living, planktonic lifestyle.Genome reduction in free-living marine bacteria is also a well-documented phenomenon (14).Our data support the recent hypothesis by Xue et al. (15) that marine, pelagic Flavobacteriaceae genera such as Nonlabens possess smaller genomes (average 3.2 Mb) and a reduced number of CAZymes than closest relatives such as Leeuwenhoekiella (average genome size: 4.0 Mb), that preferably explore polysaccharide-rich environments such as macroalgal hosts.In this case, genome reduction seems to be related to oligotrophic environments where nutrient scarcity favours cells with reduced replication burden, as increased UV exposure leads to higher mutation rates and frequency of gene transfer and gene loss (16).

Only 103
BGCs in the entire dataset shared 100% homology with validated BGCs present in the MIBiG database.Of these, 68 BGCs coding for bisucaberin B were present on 50 Tenacibaculum genomes, 17 Aquimarina genomes and one genome of the unclassified lineage GCA-2733415.The remainder were classified as non-ribosomal peptide-synthetases (NRPS) BGCs (n = 30) encoding the biosynthesis of diverse peptides such as the cyclic hexapeptide anabaenopeptin NZ857/nostamide A (8 BGCs from Flavobacterium and Tenacibaculum genomes), the cyclodepsipeptide xenematide (7 BGCs from Chryseobacterium, Flavobacterium and Nonlabens genomes and from one MAG), the two-tailed lipocyclopeptide antibiotic icosalide A/B (6 BGCs from Flavobacterium and Chryseobacterium genomes), rhizomides (6 BGCs from Chryseobacterium, Flavobacterium, and Tenacibaculum genomes) and the β-lactam antibiotic monobactam (2 BGCs from Flavobacterium genomes), among others.In contrast with a minority of BGCs showing 100% homologous genes with validated BGCs in the MIBiG database (n = 103), a wealth of BGCs annotated by antiSMASH (n = 8,866) possessed < 60% homologous genes with BGCs in the MIBiG database, encompassing 4,987, 689, 2,430 and 160 BGCs presenting 0%, 1 to 20%, 21 to 40% and 41 to 59% homologous genes with MIBiG entries, respectively.Among these, we found a large diversity of BGCs exclusive (or almost) to Flavobacteriaceae genomes showing low to moderately low proportions of homologous genes with MIBiG BGCs yet showcasing the biosynthetic potential of drug-like molecules within the family, as a large diversity of such BGCs have been estimated to code for natural products in the polyketides, non-ribosomal peptides and ribosomally synthesized and post-translationally modified (RiPPs) compound classes.