PhyloPlus: a Universal Tool for Phylogenetic Interrogation of Metagenomic Communities

ABSTRACT Phylogeny is a powerful tool that can be incorporated into quantitative descriptions of community diversity, yet its use has been limited largely due to the difficulty in constructing phylogenies which incorporate the wide genomic diversity of microbial communities. Here, we describe the development of a web portal, PhyloPlus, which enables users to generate customized phylogenies that may be applied to any bacterial or archaeal communities. We demonstrate the power of phylogeny by comparing metrics that employ phylogeny with those that do not when applied to data sets from two metagenomic studies (fermented food, n = 58; human microbiome, n = 60). This example shows how inclusion of all bacterial species identified by taxonomic classifiers (Kraken2 and Kaiju) made the phylogeny perfectly congruent to the corresponding classification outputs. Our phylogeny-based approach also enabled the construction of more constrained null models which (i) shed light into community structure and (ii) minimize potential inflation of type I errors. Construction of such null models allowed for the observation of under-dispersion in 44 (75.86%) food samples, with the metacommunity defined as bacteria that were found in different food matrices. We also observed that closely related species with high abundance and uneven distribution across different sites could potentially exaggerate the dissimilarity between phylogenetically similar communities if they were measured using traditional species-based metrics (Padj. = 0.003), whereas this effect was mitigated by incorporating phylogeny (Padj. = 1). In summary, our tool can provide additional insights into microbial communities of interest and facilitate the use of phylogeny-based approaches in metagenomic analyses.


Full Lineage Information Extraction.
Each of the non-redundant taxIDs retrieved above was searched against the NCBI Taxonomy database to fetch its full lineage information. As for the retrieved full lineage information, taxIDs that corresponded to different taxonomic ranks were collected, including superkingdom, phylum, class, order, family, genus, species group, and species. The lineage information was recorded in three individual text files, representing full lineage for each of the tips present in the original phylogeny, for each of the non-redundant species-level taxIDs identified in the original phylogeny, and for each of the non-redundant species-level taxIDs provided via the user input, respectively. All searches against NCBI databases were processed in batch using the Biopython package [2].

Appending Species to the Original Phylogeny.
For each of the unique species identified in the user input and GTDB phylogeny, the sequence of taxIDs at different taxonomic ranks was searched against full lineage information appended to the tip labels, starting from the lowest rank (species) to the highest (superkingdom). This determined the lowest taxonomic rank possible where at least one of the genome assemblies in the original GTDB phylogeny shared with the query species.
Upon determination of the taxonomic rank to map the query species, all tips within the original phylogeny that shared the same taxID at the corresponding rank were extracted (e.g., all tips sharing genus-level taxID 226 for locating species Alteromonas sp. 76-1, where the exact species was not found in the original phylogeny, therefore its location was inferred using its congeneric taxa), their most recent common ancestor (MRCA) node was identified using getMRCA function from the ape R package [3]. A subtree rooting at the MRCA node was extracted, and the average distance of all its children tips that shared the same taxID at the determined taxonomic rank to the MRCA node was calculated. Then the species label (taxID or scientific name) was appended to the original phylogeny using add.tips function from the phangorn R package with the MRCA node being the place to bind the tip and the computed average distance being the branch length [4].

Removal of Potential Outliers and Tree Pruning.
Taxonomic misclassification can result in extreme outlier tips that could cause the computed MRCA node to reside close to the base root of the phylogeny, leading to abnormally long branch length assigned to the query species. Therefore, removal of potential outlier reference tips was done during the process of determining the MRCA node and computing the branch length.
For each member in the group of reference tips that were used to map a particular query species, its average distance to other group members (davg) was calculated. Then the mean and standard deviation were calculated for all davg values within the reference group. Potential outliers were defined as tips whose davg value exceeds mean plus n times standard deviations (n = 1, 2, 3, 4, and 5, respectively). For species mapped at different taxonomic ranks, different values of n were applied where the finalized threshold values were determined by considering how much improvement has been made in terms of branch length distributions and how much phylogenetic information was retained after removal of potential outlier tips ( Fig. 2 and Table 1).
For groups containing only two reference tips where detection of potential outlier tips based on mean and standard deviation was impractical, the fraction ℎ ℎ for the more distant tip was used to indicate if an outlier was present in the reference group (threshold used in this case: fraction ≥ 0.75). The potential outlier tip was determined based on the lowest taxonomic rank it shared with its neighboring tips, where two nodes back were taken to extract the corresponding subtree. The one with a comparatively higher-level taxonomic rank shared with the neighboring tips was determined to be the outlier.
Lastly, the phylogeny was pruned by removing all original tip labels which represented a single genome assembly, so that all inter-tip distances within the modified output phylogeny represented interspecific distances under NCBI taxonomic system.

Supplementary Notes
The taxa Isorropodon fossajaponicum symbiont (taxID 883811) and Abyssogena phaseoliformis symbiont (taxID 596095) were identified in the Kraken2 standard bacterial library and were assigned taxonomic rank species by NCBI, but were not added to the modified phylogeny, as they can only be mapped at the superkingdom level where the inference of their location using the entire bacterial phylogeny was computationally infeasible and biologically meaningless.
Updates and merging of NCBI taxIDs are continuous, for example, during the time of writing this manuscript, taxIDs 147467 and 861208 that were retrieved from the Kraken2 source have been merged into taxIDs 1296 and 1183401, respectively. These updates can cause fatal errors for phylogenetic tree parsers when the corresponding species scientific names were used as tip labels, due to the fact that multiple tips share identical labels.
To handle the above two cases, some of the taxIDs uploaded by the user to the web portal may be (1) deleted prior to processing if they belong to the wrong superkingdom (e.g., adding an archaeal species to the bacterial phylogeny) or can only be mapped at the superkingdom level; or (2) modified or deleted if they are updated or merged into existing taxIDs, respectively. Any changes to the user input are recorded in the note.txt file, and the user may need to modify the taxIDs present in the metagenomic classification reports accordingly to allow full compatibility between the classification outputs and the phylogeny.
During insertion of each of the non-redundant species-level taxIDs identified in the original phylogeny, not all species names are mapped back into the original phylogeny, as some of them may represent "pseudo" species-level names (e.g., Enterobacteriaceae bacterium, taxID: 1849603, indeed represents a family but this taxID is assigned the rank species by NCBI and was included in the Kraken2 standard bacterial library). A filter for at least having a recorded speciesand genus-level taxID in the full lineage was applied and only the filtered species were inserted back to the original phylogeny. Fig. 2. (A) Changes in branch length distribution after removal of potential outlier tips defined using different thresholds. The figure is faceted by the taxonomic rank at which a query species could be mapped to the original GTDB phylogeny. For the first facet, species that were mapped with a branch length of 0 (i.e., inferred from only one reference tip) were excluded from the plot for better evaluation of the effects of removing outliers. (B) Overall comparison of branch length distribution after removal of all outliers. Final threshold used in this study are: ≥1sd for species level; ≥2sd for species group level; and ≥2sd for genus level and above.