On the limits of 16S rRNA gene-based metagenome prediction and functional profiling

Molecular profiling techniques such as metagenomics, metatranscriptomics or metabolomics offer important insights into the functional diversity of the microbiome. In contrast, 16S rRNA gene sequencing, a widespread and cost-effective technique to measure microbial diversity, only allows for indirect estimation of microbial function. To mitigate this, tools such as PICRUSt2, Tax4Fun2, PanFP and MetGEM infer functional profiles from 16S rRNA gene sequencing data using different algorithms. Prior studies have cast doubts on the quality of these predictions, motivating us to systematically evaluate these tools using matched 16S rRNA gene sequencing, metagenomic datasets, and simulated data. Our contribution is threefold: (i) using simulated data, we investigate if technical biases could explain the discordance between inferred and expected results; (ii) considering human cohorts for type two diabetes, colorectal cancer and obesity, we test if health-related differential abundance measures of functional categories are concordant between 16S rRNA gene-inferred and metagenome-derived profiles and; (iii) since 16S rRNA gene copy number is an important confounder in functional profiles inference, we investigate if a customised copy number normalisation with the rrnDB database could improve the results. Our results show that 16S rRNA gene-based functional inference tools generally do not have the necessary sensitivity to delineate health-related functional changes in the microbiome and should thus be used with care. Furthermore, we outline important differences in the individual tools tested and offer recommendations for tool selection.


INTRODUCTION
Microbial profiling through next-generation sequencing techniques has linked microbial composition to diseases [1][2][3].A common microbiome profiling approach is 16S rRNA gene sequencing which captures information on taxonomy and microbial diversity [4,5].While accessible and cost-effective, 16S rRNA gene sequencing is limited due to short read lengths, sequencing errors [6], differences caused by the use of different variable regions [7][8][9] and difficulties in clustering sequences into operational taxonomic units (OTUs) [10].Furthermore, using a single marker gene to evaluate species diversity is challenging, due to the prevalence of horizontal gene transfer, the difficulty of classifying bacterial species [11], and limited resolution between closely related species when looking only at 16S rRNA gene sequences.Also, differences in the diversity of microbes were observed between 16S rRNA gene and metagenome sequencing [12,13].Additionally, 16S rRNA profiling does not provide direct information about the functional capabilities or the functional genes and pathways that bacteria possess [14,15].Microbial species can have different functional capabilities, even if they are taxonomically similar.For example, different strains or species of bacteria may have different metabolic pathways, allowing them to utilize different nutrients or tolerate changing environmental conditions [16,17].Therefore, predicting functional profiles from 16S rRNA gene data relies on statistical inference and machine learning algorithms that use taxonomic information as a proxy for functional potential.These insights into microbial genes and alterations in pathways are crucial to understanding the role of the microbiome in health and disease [18].
In contrast to 16S rRNA gene sequencing, shotgun metagenome sequencing (MGS) provides rich information on functional genes, which can be translated to the pathway level using tools such as the HMP Unified Metabolic Analysis Network (verison 3) (HUMAnN3) [19], Kraken [20], or MGS-FAST [21].Especially, de novo assembly of genomes can lead to more confident gene prediction and analysis of genomic context than is attainable from unassembled data [22].However, the considerably higher cost of MGS hinders its application in clinical research, particularly in large cohort studies, where adequate statistical power is essential for robustly detecting meaningful differences [23,24].Additionally, MGS can be challenging for low biomass samples or samples that are dominated by non-microbial DNA [25,26].
To overcome the lack of functional information in 16S rRNA gene profiles, tools such as Phylogenetic Investigation of Communities by Reconstruction of Unobserved States (PICRUSt2) [27], Tax4Fun2 [28], Pangenome-based Functional Profiles (PanFP) [29], Piphilin [30], COWPI [31] and metagenome-scale models (MetGEM) [32] attempt to predict abundances of functional genes based on recorded genomic information available in the Kyoto Encyclopaedia of Genes and Genomes (KEGG) database [33] or based on genomic models [34][35][36].PICRUSt2 is a widely used prediction tool employing a hidden state prediction algorithm to infer functions from 16S rRNA gene phylotypes [27].In contrast, Tax4Fun2 uses sequences within a defined similarity cutoff of reference sequences [28].PanFP is based on a functionally annotated pan-genome reconstruction which is then weighted with the microbial abundance observed in a given sample [29].Even though the accuracy of these tools has been validated, they are generally limited by the quality of the reference genomes and annotation, which suffer from ambiguous or missing coding regions [28,31].To overcome this limitation, MetGEM introduced a generalised genome-scale model, where metagenome-scale networks are constructed using the assembly of gut organisms through reconstruction and analysis (AGORA) collections [34] and the Human Microbiome Project (HMP) [37].
Another bias in amplicon sequencing is the number of 16S rRNA gene copies which varies considerably, confounding the abundance prediction [38,39].Several tools have recently been developed for predicting genomic copy numbers using phylogenetic methods [27] and sequenced genomes [40].The Ribosomal RNA Operon Copy Number Database (rrnDB) [41] offers accurate and well-annotated information on rRNA operon copy numbers among prokaryotes.Each entry (organism) in rrnDB contains detailed information linked directly to external websites, including the Ribosomal Database Project [42], GenBank [43], PubMed, and several culture collections.

Impact Statement
Despite the frequent use of metagenomics sequencing data, 16S rRNA sequencing still finds widespread use as a cost-effective and robust alternative in large-scale microbiome studies.Mainly used for taxonomic profiling, 16S rRNA profiling data is also frequently employed for functional analysis.Several popular computational tools such as PICRUSt2 have been developed to allow researchers to glimpse into the functional activity of the microbiome.The ability to robustly predict a hypothetical gene repertoire through phylogenetic analysis and metagenome inference has already been assessed in prior studies, but thus far it has not been investigated if tools such as PICRUSt2 have the necessary resolution to identify subtle differences in functional activity related to human health.In this study, we use simulation and real-world data of sample-matched 16S rRNA and metagenomics profiling data to chart the limits of functional profiling in 16S data.Our results suggest that caution is advised in this type of analysis, as health-related differences can not be captured accurately.
For an accurate analysis and interpretation of functional profiles from 16S rRNA gene profiles, it is crucial to understand the limitations of these tools compared to more comprehensive profiling techniques such as MGS.Furthermore, it is essential to offer potential users guidelines on tool selection.To date, there are only a few reviews evaluating the performance of functional inference tools in different environments [44][45][46][47].For example, Djemiel et al. [45] used a text-mining approach to review 100 published articles on functional profiles.They discussed limitations, including the lack of reference genomes, especially for soil ecosystems.
Sun et al. [46] investigated the performance of three commonly used metagenome inference tools (PICRUSt, PICRUSt2, and Tax4Fun) across seven different datasets for which matched 16S rRNA gene and MGS data was available.Even though MGS data is not a truly gold standard of functional activity in the microbiome, it can serve as a gold standard for benchmarking prediction methods that infer functionality from 16S rRNA gene profiling to approximate MGS data.A striking finding of Sun et al. [46] was that inferred abundances showed a high Spearman correlation between 16S-inferred and MGS-derived gene abundances even when sample labels were permuted.This highlights that functional profiles do not differ as much as changes in taxonomic composition suggest and that correlation is not a suitable performance measure to assess the performance of functional inference tools.The authors further compared two distinct geographical conditions, namely urban and rural subjects in China, as well as subjects from within and outside the United States.A Wilcoxon rank-sum test yielded p-values that were moderately correlated between 16S-inferred and MGS-derived genes and very lowly correlated after sample permutation.In their study, the authors concluded that functional inference tools worked well for humans rather than for other organisms and environmental samples and that core functions are better represented than niche-specific functions.However, the contrasts that were investigated were related to geography, where microbial composition and function are known to differ considerably [48].
It remains an open question if functional inference tools are also suited for more subtle contrasts related to human health.Sun et al. [46] could not detect any performance differences between the tested methods, suggesting that a more comprehensive benchmark is needed to recommend tool selection guidelines and establish the limits of metagenome inference tools in human disease research.Hence, we consider the most widely used metagenome inference tools PICRUSt2 [27], PanFP [29], Tax4Fun2 [28], and MetGEM [32] in a systematic benchmark.We omit Piphillin [30] as it is only available as a web server (the command line version is in the testing stage).We test the reliability of the inference tools using matched 16S rRNA gene sequencing and MGS human datasets.We employ simulated dataset from the Critical Assessment of Metagenome Interpretation Simulated Metagenome (CAMISIM) simulator and data from several cohorts, namely Cooperative Health Research in the Augsburg Region (KORA) (type two diabetes) [49], The Food Chain Plus (FoCus) [50], PopGen (obesity) [51], and colorectal cancer (CRC) [52].We use these data to examine the concordance of health-related functional changes in the microbiome between 16S-inferred and metagenome-derived profiles across commonly used software tools.The number of copies of the 16S rRNA gene varies widely across different bacteria.Hence, normalizing for 16S copy number is a crucial step for obtaining accurate estimates of microbial abundances [53].Without normalization, differences in 16S copy numbers between taxa can lead to biassed abundance estimates, potentially misrepresenting the true composition of microbial communities [39].We further explore the potential impact of 16S copy number as a confounding factor in functional profiles and investigate if a customised copy number normalisation using the rrnDB database could improve the results.In summary, our results suggest that current metagenome inference tools lack the sensitivity to identify health-related functional changes in the microbiome.

Simulated dataset
We evaluated the performance of the functional inference tools using simulated metagenomic samples downloaded from the second Critical Assessment of Metagenome Interpretation (CAMI) Challenge [54,55].CAMI includes a set of simulated metagenomic datasets with known ground truth, allowing for objective comparison of tool performance.We retrieved 40 simulated metagenomic datasets from CAMI, which represent typical microbiomes from four human body sites such as gastrointestinal tract (GI) (n=10), skin (n=10), oral cavity (n=10), and airways (n=10).

Population-based cohorts
We also used real datasets from four population / patient cohorts.Naturally, these suffer from the usual technical bias and variability due to differences in sequencing platforms, protocols, and sample preparation methods.This allows us to assess the impact of those in comparison to well-controlled simulated data.We selected four cohorts with paired 16S rRNA gene sequencing and MGS data for functional profiling analysis.The CRC dataset was downloaded from the Sequence Read Archive (SRA) [56] under project number PRJEB6070.Datasets of FoCus, KORA, and PopGen are under controlled access due to the informed consent given by the cohort study participants and are available upon request from https://epi.helmholtz-muenchen.de/ and https://portal.popgen.de,respectively.For the prospective KORA cohort (2018), sample preparation and sequencing of the V3-V4 region in paired-end mode on an Illumina MiSeq was performed by the ZIEL -Core Facility Microbiome in Freising.For the FoCus cohort and PopGen cohort, a detailed overview is given in Table 1.The overall workflow is represented in Fig. 1.
First, paired WGS-16S rRNA gene datasets are selected for comparison.Here, simulated data (a) or real datasets (b) are considered.In the next step, we retrieve the functional profiles using HUMAnN3 for the MGS datasets and four different functional inference tools (PICRUSt2, Tax4Fun2, PanFP, and MetGEM) for 16S rRNA gene sequencing datasets.A Wilcoxon rank-sum test was performed to compare the KO terms resulting from the two methods.

Functional profiling of metagenomics data
For public datasets, sequence quality control and removal of human reads (GRCh38) were performed using KneadData 0.7.4.[57].Bacterial gene abundances were calculated with HUMAnN3 (version 3.9) [19] using nucleotide-based alignment against the ChocoPhlAn database (201 901b) [58].Gene abundance tables were then grouped using the uniref90_ko command and relative abundance was calculated using the humann_renorm_table command in the HUMAnN3 pipeline.We followed the same strategy for the simulation datasets w.r.t.metagenome functional profiling, whereas functional profiling of 16S rRNA gene was done in two steps: first, 16S rRNA full length reads were filtered using silva (v138) [59] as a reference database using FilterReads [60].Once the 16S rRNA gene sequences were obtained, functional profiles using all four tools PICRUSt2, Tax4Fun2, PanFP, and MetGEM, were inferred as described below.We then compared the functional profiling from 16S rRNA gene analysis with the metagenome profiles and tested their agreement using functional diversity as mentioned in parallel-meta suite [61,62].

Functional inference of 16S rRNA gene sequencing data
Raw 16S rRNA gene reads were processed using DADA2 (1.26) [63] to obtain an amplicon sequence variants (ASV) abundance table.The representative sequences were then used as input for functional inference tools.We screened the literature for commonly used 16S rRNA gene-based functional inference tools and selected PICRUSt2 (v2.5.1),Tax4Fun2, PanFP and MetGEM for the benchmark analysis.PICRUSt2 and Tax4Fun2 are currently the most frequently used tools followed by PanFP.MetGEM represents a more recent development and has thus far not been benchmarked.The workflow for each tool is described in the corresponding methods subsection.

PICRUSt2
ASV abundance tables with representative sequences were given as input for PICRUSt2 [27].PICRUSt2 was executed with the default cutoff for the Nearest Sequenced Taxon Index (NSTI) of 2.0.The values in the copy-normalized ASV table produced by  PICRUSt2 were multiplied by the expected 16S rRNA gene copies per taxon (also output by PICRUSt2) to obtain an untransformed ASV table of contributing sequences.The totals of each column were then recorded as the total number of sequences used by PICRUSt2 in each sample.

Tax4Fun2
Tax4Fun2 was run with the default pre-calculated files and a parameter of 99 % similarity in its BLASTn command.Tax4Fun2 outputs relative abundances by default, so no transformation was required after producing the prediction profile.An adjustment was made to the Tax4Fun2 script make_FunctionalPredictions() to output the ratio of used sequences instead of the default ratio of unused sequences.Note that at the time of manuscript submission, Tax4Fun2 was no longer available at https://github.com/bwemheu/Tax4Fun2.

MetGEM
The MetGEM is based on the genome-scale metabolic models (GEMs) for inferring the metagenomic content from 16S rRNA gene sequences with an overall focus on annotating the metabolic function of the human gut microbiome.ASV abundance tables along with corresponding taxonomic groups were given as input.Default models such as k_core and e_core were chosen to predict KEGG orthologs (KO) and Enzyme commission (EC) number abundances, respectively, which were previously shown to provide good estimation in typical situations [32].Since MetGEM does not provide pathway abundances, the output from MetGEM was subjected to the PICRUSt2 pathway prediction step.

PanFP
PanFP [29] is based on the pangenome reconstruction of a 16S rRNA gene OTUs from known genes and genomes pooled from the OTU's taxonomic lineage.PanFP includes complete genomes of prokaryotes with their full taxonomic lineages obtained from the National Centre for Biotechnology Information (NCBI) [64] and mapped KO terms to proteins using the cross-reference ID mapping between NCBI RefSeq [65] and UniProt provided by UniProt KnowledgeBase (UniProtKB) [66].For each OTU, PanFP builds a pangenome by making a superset of all genes present in organisms pooled from the dataset of prokaryote genomes in the given OTU's taxonomic lineage.Therefore, OTUs with the same taxonomic lineages have the same pangenome.Then, PanFP derives a functional profile of the pangenome by accumulating functional compositions in the superset.An OTU-sample table is converted into a lineage-sample table by summing the frequencies of OTUs with the same lineage in a sample.Finally, the function-sample table is derived by combining functional profiles of lineages with weights corresponding to the lineage abundance in the sample.

Customized normalization using rrnDB
To test the effect of copy number normalization on the functional profiling, we repeated the entire workflow described above, replacing the gene copy number normalization step by obtaining the copy numbers from rrnDB.We processed rrnDB with the PICRUSt2 place_ seq.py command (see https://github.com/mruehlemann/16s_cnv_correction_databases for details) and used the resulting abundance tables as input for other tools by skipping their built-in copy number normalization steps.

Validating inference tools with shotgun metagenomic sequencing data
We used two methods to evaluate the consistency and accuracy of functional inference tools.It should be noted that a direct comparison of functional profiles inferred with all four tools is virtually impossible due to several changes in the KEGG Orthology since PICRUSt2, Tax4Fun2, PanFP, and MetGEM were developed.Hence, inferred functional profiles as well as those obtained by metagenomic shotgun sequencing were converted to relative abundances prior to comparison.For metagenomics, the counts were converted into relative abundances using the renorm table function in HUMAnN 3.0.We analysed the Spearman correlation between inferred gene composition and those from metagenome sequencing.Only functions present in the metagenomic profile and in the inferred profile were considered in each comparison.PCA was computed using the dudi.pca() function from the ade4 package [67].The PCA results were used to assess and compare the predictive performance of four distinct tools in estimating the functional potential of a microbial community, relying on 16S rRNA gene sequencing data.

Differential abundance analysis between cases
As proposed by Sun et al. [46], differential analysis is better suited than correlation analysis to assess whether metagenome inference tools are able to detect biological variation between samples.We thus evaluated the performance of inference tools in predicting biological functions using the differential abundance testing method.To do so, we removed KO terms with low prevalence (<5 % of samples) and retained overlapping KO and pathway terms between the prediction tool and MGS for differential analysis.We performed a Wilcoxon rank-sum test of the two groups (disease versus control) in each dataset for both default and customized normalization.
Using simulated datasets, we compared the KO abundance between GI versus skin and also GI versus airways to test the accuracy of the tools.Similarly, using the CRC dataset, we compared the KO abundance between cancer (n=41) and healthy tissues (n=50).Using the PopGen and FoCus datasets, we compared the differentially abundant KO terms between obese and healthy samples.Patients with a body mass index (BMI) >30 kg m − ² were considered obese.For the KORA dataset, differentially abundant KO terms were tested between type two diabetes and healthy controls.The Wilcoxon rank-sum test was applied to each cohort to test the difference between groups for MGS and prediction results.The inferred KO abundance and significant KO terms with an p-value<0.05were extracted (without correcting for multiple testing).Overlapping significant KO terms between MGS and inference tools were extracted and compared to evaluate the performance of inference tools using the F1 score, recall, and precision metrics.

16S-based functional inference can uncover functional differences between diverse environments in simulated data
Simulated datasets provide a controlled environment for testing the performance of tools and can be constructed free of bias and variability due to differences in sequencing platforms, protocols, and sample preparation methods.Hence, we used this setup to show how close functional inference tools can be expected to approximate metagenome-derived functional profiles under optimal conditions.PCA was used to compare the performance of four different tools in predicting the functional potential of a microbial community based on 16S rRNA gene sequencing data (Fig. 2).Based on the results of the PCA analysis, it was observed that PICRUSt2 and PanFP, using either default or customized normalization, were embedded more closely to the metagenomic profiles than other tools.This suggests that these tools were more accurate in predicting the functional potential of the microbial community, based on the 16S rRNA gene sequencing data.In contrast, Tax4Fun2 and MetGEM showed a high discrepancy from their corresponding metagenomes.This indicates that these tools may not be as accurate as the other tools in predicting the functional potential of the microbial community based on 16S rRNA gene data.
We further conducted differential abundance tests using KO terms detected in simulated MGS and 16S rRNA gene sequencing data.We focused on the comparisons of GI and oral cavity as well as GI and airways.We expected to find considerable variation in the functional profiles due to the distinct microbial communities between these environments.
We considered significantly differentially abundant KO terms detected in MGS as ground truth and compared this list against significantly differentially abundant KO terms reported by functional inference tools, allowing us to calculate F1 score, recall and precision.In the GI vs. oral cavity comparison, PICRUSt2 and Tax4Fun2 displayed a moderately higher F1 score (ranging from 0.5 to 0.8) and recall compared to low-performing tools PanFP and MetGEM.In the GI vs airways comparison, PICRUSt2 again performed well with higher recall and F1 scores followed by MetGEM, Tax4Fun2, and PanFP.Notably, all tools displayed low precision (Fig. 3).This suggests that even in the simulated datasets, functional inference tools tend to produce a large number of false positive results.
Fig. 2. Principal Component Analysis (PCA) was conducted for the functional profiling of simulated datasets using both functional inference tools and metagenome analysis.The left panel displays PCs 1 and 2. Since MetGEM appears as an outlier, we also plotted PCs 2 and 3 in the right panel, which shows differences between the tools more clearly.The term 'custom' indicates that 16S rRNA gene copy number normalization was carried out using the rrnDB database.When using simulated datasets from various environments, inference tools such as PICRUSt2 and PanFP generated functional profiles that were overall similar to the ground truth metagenome functional profiling of HUMAnN3.In contrast, functional profiles from Tax4Fun2 and the MetGEM were clustered far from the ground truth.

Functional inference tools show good performance in predicting the presence of KO terms in real data
While simulated microbial datasets generated from 16S rRNA gene sequencing can be useful for benchmarking, they do not fully capture the complexity and variability of real-world microbial communities.One major limitation is the lack of amplification bias that occurs during PCR amplification [7,47,68] of the 16S rRNA gene.This bias can result in uneven amplification of different bacterial taxa, leading to overrepresentation or underrepresentation of certain microbial species in the resulting amplicon library and functional profile variation among phylogenetically related genomes.As a consequence, microbiome functional profiles inferred from the 16S rRNA gene can deviate from MGS-derived predictions.To evaluate the performance of analysis tools more accurately, we used paired MGS and 16S rRNA gene datasets from four different population cohorts (CRC, FoCus, PopGen, and KORA).
We confirmed the findings of an earlier study by Sun et al. [46], who reported that Spearman correlation values are not affected by label permutation and are thus not suited to robustly assess the performance of functional inference tools (Fig. S1, available in the online version of this article).As an alternative, we consider here which KO terms are reported and compare them to those identified in MGS as ground truth.KO terms that are uniquely identified by 16S-based inference tools represent false positives, whereas missing KO terms in 16S functional inference tools that are reported by MGS represent false negatives, allowing us to compute F1 score, recall, and precision.Overall, Tax4Fun2, PICRUSt2, and PanFP showed similar performance in terms of F1 score and recall in contrast to MetGEM, which had already shown poor performance in the simulation study which is confirmed here (Fig. 4).

Functional inference tools show poor performance in predicting differentially abundant KO terms in real data
An interesting question for biomedical research is whether these methods are accurate enough to pick up on health-related differences that may be subtle on a functional level.Hence, we performed a Wilcoxon rank sum test to identify differential pathways between healthy and disease groups, where we found that PICRUSt2, Tax4Fun2, and PanFP had varying degrees of overlap with MGS results.In the CRC cohort (Fig. 5), PICRUSt2 shows the largest overlap of significant KO terms (n=654, 61.5 %) with MGS results followed by Tax4Fun2 (56.4 %) and PanFP (54.6 %).The number of overlaps of significant KO terms dropped significantly in other cohorts including KORA, FoCus, and PopGen (Figs S2-S5 including for simulation dataset).In the KORA cohort, Tax4Fun2 has the largest overlap of significant KO terms (n=112) followed by PICRUSt2 with custom normalization (n=108) and PICRUSt2 (n=66).PanFP with custom normalization (n=94) showed higher overlap than the PanFP with default normalization (n=13).In the PopGen and FoCus cohorts, the size of the overlaps dropped considerably, as only a few overlapping KO terms were found.PICRUSt2 (~4-5 %), Tax4Fun2 (2-3.9 %) and PanFP with customized normalization (0.8-3 %) showed a somewhat good overlap of significant KO terms in the PopGen and FoCus cohorts.and MetGEM showed poor overlap (0.8-1 %) of KO terms across all cohorts.we calculated the F1 score, recall, and precision considering MGS as the ground truth for differentially abundant KO terms (Fig. 6).Comparing the healthy and disease groups we obtained overall low F1 scores for all functional inference tools with markedly better performance in the CRC cohort due to a higher recall.The F1 scores of PICRUSt2, Tax4Fun, and PanFP were similar and could not be improved by customized copy number normalization, while MetGEM showed a slightly improved F1 score after copy number correction.
We performed label randomization to assess the significance and robustness of the F1 score, recall, and precision values obtained from the original data (Fig. 7).Randomization was performed only on the inferred functional profiles (not on the MGS data) at the sample levels and differential abundance testing was performed using the Wilcoxon rank sum test.Subsequently, the inferred p-values were compared with the p-values obtained from differential abundance testing performed on MGS.After label randomization, the groups we compare are random; hence, the differentially abundant pathways are not meaningful, and we obtain a baseline for our comparisons.Compared to the random baseline, we observed considerably higher recall in the CRC cohort and slightly higher precision in the KORA and PopGen cohorts.In contrast, we observed equal or even worse performance in the FoCus cohort.

Many disease-relevant KO terms were differentially abundant in a targeted analysis
We next focused on known disease-specific KO terms to assess if functional inference tools can detect differences in them.For the CRC cohort, we focused on KO terms classified under pathways in glycans biosynthesis and metabolism, lipid metabolism, carbohydrate and amino acid metabolism.These KO terms are reported to be enriched in CRC patients [69][70][71][72][73].For the KORA cohort, we focused on KO terms classified under pathways in carbohydrate [74] and amino acid metabolism [75,76].In diabetes, particularly in type two diabetes, impaired carbohydrate metabolism is a hallmark feature.The body's ability to regulate blood glucose levels becomes compromised, leading to hyperglycaemia.Insulin resistance and impaired insulin secretion contribute to this dysregulation [77].Also, amino acids play essential roles in cellular metabolism and have implications for glucose homeostasis and insulin action.Perturbations in amino acid metabolism have been observed in diabetes [78,79].Since the FoCus and PopGen cohort studies include both healthy and obese groups, we focused on KEGG functional categories such as carbohydrate metabolism, amino acid metabolic pathways, and sugar transport which have been reported as enriched in obesity [80,81].We selected significant KO terms (unadjusted p-value<0.05) in the MGS data under the above-mentioned categories and compared them with KO terms obtained from the functional inference tools (Fig. 8).We also performed the same analysis within each KEGG higher level category and the results are discussed in Fig. S6.PICRUSt2, Tax4Fun2 and PanFP performed comparatively than MetGEM in the CRC cohorts.For the biosynthesis and metabolism of the glycans category, PICRUSt2, Tax4Fun2 and PanFP (default and customized normalization) showed only 15 % overlapped significant KO terms with MGS.There were only 3 % overlapping significant KO terms between MGS and MetGEM.Similarly, the study compared KO terms in lipid metabolism and found that PICRUSt2 (default and custom normalization) showed a bit high number of overlaps (~7 %) with MGS, followed by PanFP (~6 %) and Tax4Fun2 (~5 %) (default and customized normalization).
In other cohorts, the performance of these tools, including PICRUSt2 and Tax4Fun2, was poor.For example, in the KORA cohort and the pathway of carbohydrate metabolism, only PICRUSt2 (with default and custom normalization) showed a few overlapping significant KO terms with MGS.This indicates that the performance of other inference tools may be limited in predicting disease-specific KO terms related to carbohydrate metabolism in the context of the KORA cohort.Similarly, FoCus and PopGen cohorts showed very poor overlaps with MGS (individual abundance differences of KO terms for these functional categories were shown in Figs S7-S10).In summary, these results show that functional inference tools have limited success in identifying disease-related functional activities.

DISCUSSION
Metagenome analysis is often the first choice for inferring functional relationships within the microbiome and between microbiomes and their ecosystem [82,83].However, due to the lower costs, 16S rRNA gene profiling is still popular for studying microbial abundances and taxonomic profiles [84][85][86].PICRUSt2, Tax4Fun2, PanFP, and MetGEM were developed to predict microbial functions from 16S rRNA gene sequencing datasets.They achieve this by utilizing reference genome databases such as KEGG and ancestral state reconstruction methods [27][28][29].Each tool follows a different algorithmic approach and relies on different references, leading to a considerable prediction this study, we sought to understand the limitations of these tools in human disease research.
We initially considered a simulated data set in which we generated full-length 16S profiles from simulated MGS data without the influence of technical and platform-related differences, e.g.related to PCR bias.In a global PCA analysis, PICRUSt2 and PanFP results closely matched the MGS-derived pathway abundances.Tax4Fun2 showed a shift from the ground truth but maintained the overall differences of the sample groups, whereas MetGEM showed poor performance.All inference tools except MetGEM showed overall performance in predicting the presence of KO terms in real data.PICRUSt2 performed slightly better than other tools, possibly because the hidden state prediction increases the sensitivity to detect ubiquitous functions [47].Another explanation could be that PICRUSt2 uses the most recent reference, i.e. the Integrated Microbial Genomes and Microbiomes (IMG/M) database, for predicting the functional content of microbial communities.On the other hand, MetGEM showed the lowest precision.One reason might be that the AGORA collections used as reference contain 818 genome-scale models from the human gut microbiome, which covered only 1470 KO terms, 983 EC numbers across 226 genera, and 690 species in total [32,34].
Prior studies established a high Spearman correlation between inferred functional profiles and MGS-derived profiles that occurs even after label randomization [46].In this study, we focused on the differential abundance of functional terms as a more challenging scenario.We observed poor performance across all functional inference tools, with PICRUSt2, PanFP and Tax4Fun2 showing comparable performance, whereas MetGEM showed very poor performance.All tools showed better performance in the CRC cohort compared to the other cohorts.We hypothesize that cancer-associated dysbiosis functional changes may be more pronounced than those in diabetes and obesity.Prior research suggests a close association between CRC and the gut microbiome, with modifications in the microbiome contributing to CRC development through factors such as inflammatory disorders, microbial metabolites, or virulence factors [87][88][89][90].In contrast, other conditions like diabetes or obesity are suspected to be multi-factorial and subject to a complex interplay of genetic and environmental factors [91,92].
Rather than looking at the differential abundances across the entire functional profile, one can also focus specifically on KO terms previously associated with the disease.Here, we observed significant differences in many of the KO terms also identified in MGS.This suggests that in contrast to a global analysis which is limited by false positives, a more targeted 16S functional profiling can pick up health-related differences to a certain extent.
A clear finding of this study is that functional profiles from 16S rRNA gene data are generally not suited to systematically pick up differences related to changes in human health in typical settings such as CRC, obesity or type two diabetes.Limitations in the performance of functional inference tools can be explained by incomplete reference genomes [93][94][95].However, even with vastly improved reference genome databases, limitations of the 16S rRNA with respect to taxonomic granularity remain that cannot be overcome.We also note that the effect of improved copy number normalization with rrnDB was mostly negligible.While functional inference tools can pick up major differences, e.g. between ecological niches, they should not be used as a replacement for MGS when studying the association of microbial function and human health.
Among the available tools, we recommend using PICRUSt2, which appeared to be the most robust, followed by PanFP, which could be improved by introducing a customized copy number database.If researchers intend to produce functional profiles from 16S rRNA gene data for hypothesis generation they should be aware of these limitations and implement control strategies such as sample label randomization.We using more advanced techniques, i.e. third-generation sequencing or (shallow) metagenomics [96,97] for more accurate analysis.

Fig. 1 .
Fig.1.Overall benchmarking workflow to evaluate the performance of 16S rRNA-based functional inference tools (workflow in grey) against metagenomics (workflow in blue) on simulated and real datasets.First, paired WGS-16S rRNA gene datasets are selected for comparison.Here, simulated data (a) or real datasets (b) are considered.In the next step, we retrieve the functional profiles using HUMAnN3 for the MGS datasets and four different functional inference tools (PICRUSt2, Tax4Fun2, PanFP, and MetGEM) for 16S rRNA gene sequencing datasets.A Wilcoxon rank-sum test was performed to compare the KO terms resulting from the two methods.

Fig. 3 .
Fig. 3. Comparison of significantly differentially abundant KOs between inferred and MGS results using evaluation metrics.The left panel indicates the results from the differential abundance between airways versus GI, while the right panel indicates the results from GI versus oral.Default 16S rRNA gene copy number normalization is represented by solid colour bars, whereas the striped colour bars indicate that customized 16S rRNA gene copy number normalization was performed using rrnDB.F1 score, recall, and precision scores are reported for each category compared to the MGS data, both for default and customized normalization.

Fig. 4 .
Fig. 4. Comparison of detected KEGG terms between inferred and MGS.F1 score, recall and precision are reported for each category compared to the MGS data.PanFP, PICRUSt2, and Tax4Fun2 showed comparable and relatively consistent performance across datasets while MetGEM shows poor recall.

Fig. 5 .
Fig. 5.The overlap of significant KO terms between different functional inference tools and MGS results in the CRC cohort.The number of overlapping significant KO terms can be used as a quantitative indicator of the accuracy of the predictions.

Fig. 6 .
Fig. 6.Comparison of significantly differentially abundant KO terms between inferred and MGS.F1 score, recall, and precision are reported for each category compared to the MGS data.

Fig. 7 .
Fig. 7. Comparison of significantly differentially abundant KO terms between inferred metagenomes and MGS was conducted before (original) and after 20 rounds of randomization.The randomization process was repeated 20 times to assess the significance and robustness of the F1 score, recall, and precision.A dumbbell plot was generated, depicting the mean value of the 20 randomizations against the original (ground truth) score.

Fig. 8 .
Fig. 8. Heatmap representing the number of significant KO terms under each KEGG metabolic function between functional inference tools and MGS.Each cell in the heatmap corresponds to a specific combination of a disease-specific KO term and a tool.The colour of the cell indicates the number of overlapping KO terms detected by both MGS and the functional inference tools.Darker colour represents a higher number of overlapping KO terms, while paler colour indicates a lower number of overlaps.

Table 1 .
Basic characteristics of population cohorts used in the benchmark study