Methods for automatic reference trees and multilevel phylogenetic placement

Abstract Motivation In most metagenomic sequencing studies, the initial analysis step consists in assessing the evolutionary provenance of the sequences. Phylogenetic (or Evolutionary) Placement methods can be employed to determine the evolutionary position of sequences with respect to a given reference phylogeny. These placement methods do however face certain limitations: The manual selection of reference sequences is labor-intensive; the computational effort to infer reference phylogenies is substantially larger than for methods that rely on sequence similarity; the number of taxa in the reference phylogeny should be small enough to allow for visually inspecting the results. Results We present algorithms to overcome the above limitations. First, we introduce a method to automatically construct representative sequences from databases to infer reference phylogenies. Second, we present an approach for conducting large-scale phylogenetic placements on nested phylogenies. Third, we describe a preprocessing pipeline that allows for handling huge sequence datasets. Our experiments on empirical data show that our methods substantially accelerate the workflow and yield highly accurate placement results. Availability and implementation Freely available under GPLv3 at http://github.com/lczech/gappa. Supplementary information Supplementary data are available at Bioinformatics online.


Details about the Phylogenetic Automatic (Reference) Tree Evaluation
To test the Phylogenetic Automatic (Reference) Tree (PhAT) method, we used the "SSU Ref NR 99" sequences of the Silva database (Quast et al., 2013) version 123.1 and the corresponding taxonomic framework (Yilmaz et al., 2014), which are available at http://www.arb-silva.de. The database contains 598 470 aligned sequences from all three domains of life, classified into 11 860 distinct taxonomic labels, and mainly contains bacterial sequences. In detail, there are • 22 913 sequences with 347 taxonomic labels for the Archaea, • 62 436 sequences with 7441 taxonomic labels for the Eukaryota, and • 513 121 sequences with 4072 taxonomic labels for the Bacteria.
The overall number of taxonomic labels is counted here, that is, it includes higher level labels. As explained in the main text, we constructed four sets of consensus sequences from these data using our PhAT method: a General set ("all of life"), as well as separate sets for the domains Archaea, Bacteria, and Eukaryota. The assembly of the four data sets with our method required in total about 30 min and 10 GB of memory on a standard laptop computer. This includes counting alignment characters, calculating entropies and constructing consensus sequences. The resulting data set sizes and the fraction of sequences from each domain the PhATs contain are shown in Table S1.
Our implementation of the method furthermore contains some details that are worth mentioning for reproducibility: It is possible to constrain the maximal size of clades in order to not build a consensus sequence for an overly large clade, which might not be a good representative of that clade. For the same reason, it is possible to first expand the highest ranks of the taxonomy into separate candidates. We used conservative values for these two constraints (a maximal clade size of 2000 and an expansion of only the first two taxonomic ranks), in order to give more weight to the sequence entropy. Lastly, some clades contain only one sub-clade. Those were immediately expanded, as they do not change the length of the candidate list during the algorithm.
We then inferred unconstrained and constrained maximum likelihood trees on the sequences, running 50 independent tree searches for each tree and selecting the best-scoring tree. Unconstrained trees were inferred using RAxML 8.2.8 (Stamatakis, 2014). Constrained trees were inferred with Sativa 0.9-55 (Kozlov et al., 2016), which internally again relies on RAxML, and offers a convenient way to transform a taxonomy into a constraint tree.
The relative Robinson-Foulds distances (Robinson and Foulds, 1981) between the four pairs of trees (constrained versus unconstrained) are between 45.8% and 49.7%. The differences between the trees however mostly concern inner branches. As query sequences (QSs) generally tend to be placed more towards the terminal branches of the tree, the differences in the inner branches thus are acceptable for our evaluation purposes. Furthermore, we performed significance tests comparing the constrained trees to the unconstrained ones, as shown in Table S2. The tests show that in all cases, the unconstrained trees fit the data significantly better.
Given these eight PhATs, the evaluation was conducted as explained in the main text. As the sequences in Silva are already aligned to each other, no alignment step was necessary. As they contain no phylogenetic signal, we removed sites consisting entirely of gaps from the alignment, in order to reduce the memory footprint of downstream steps. Phylogenetic placement was conducted using EPA-ng (Barbera et al., 2018), which is substantially faster and more scalable than RAxML-EPA  and pplacer (Matsen et al., 2010).
We evaluated the accuracy of the placements using the taxonomic labels of the sequences in Silva as an indicator of the expected branch of each sequence. Thus, we have to assume the taxonomic label of each sequence to be correct. However, errors are expected due to incongruity between the taxonomy and the phylogeny (Moreira and Philippe, 2000), as well as due to taxonomically mislabeled sequences (Kozlov et al., 2016). For example, Sativa (Kozlov et al., 2016), found 9934 mislabeled sequences in the Silva database. Furthermore, 17 452 sequences contain one of "incertae", "unclassified" or "unknown" in their name, indicating that those sequences might not be reliable. In total, there are 25 910 (or 4.3%) such dubious sequences in version 123.1 of the Silva database. Not all sequences are hence expected to be placed on their expected branches. We also evaluated how these dubious sequences affect tree accuracy. To this end, we used the same four trees as before (i.e., they were inferred on all sequences, including the dubious ones), but for the evaluation step we then excluded the dubious sequences. That is, those sequences were not placed on the trees, and their distance to the expected branch was not used for the evaluation. In most cases, this improved the results slightly (data not shown). Therefore, we decided to only report the unfiltered results.

Bacterial Vaginosis
For testing the accuracy of our unconstrained Bacteria tree on real data, we used a vaginal microbiome dataset of 220 women (Srinivasan et al., 2012), which was provided by Sujatha Srinivasan. See Figure S5 for the respective results. This small dataset with a total of 426 612 query sequences, thereof 15 060 unique, was already analyzed with phylogenetic placement methods in the original publication. We used it as an example of a well-designed study to asses our results using a PhAT as reference tree. In addition to the Bacteria tree, we re-inferred the reference tree of the original publication using their alignment, again using RAxML 8.2.8 (Stamatakis, 2014). The query sequences of the dataset were then aligned to our reference tree and alignment, as well as to the reference alignment of the original publication and our re-inferred tree. For aligning, we used a custom MPI wrapper of PaPaRa 2.0 (Berger andStamatakis, 2011, 2012), which is available at (Berger and Czech, 2016). Finally, the query sequences were placed on these trees using EPA-ng (Barbera et al., 2018), and the analyses were subsequently performed as explained in Figure S5.

Human Microbiome Project
We used the Human Microbiome Project (HMP) dataset (Huttenhower et al., 2012;Methé et al., 2012) for further testing our methods (see Figure 4). In particular, we used the "HM16STR" data of their initial phase "HMP1", which are available from http://www.hmpdacc.org/hmp/. The dataset consists of trimmed 16S rRNA sequences of the V1V3, V3V5, and V6V9 regions. Each sample of the dataset is labeled with the human body site where it was obtained. See Table S4 for an overview of those labels. The data are further divided into a "by sample" set and a "healthy" set, which we merged in order to obtain one large dataset, with a total of 9811 samples. We then removed 10 samples that were larger than 70 MB as well as 605 samples that had fewer than 1500 sequences, because we considered them as outliers, and finally 2 samples that did not have a valid body site label assigned to them. This resulted in a set of 9192 samples containing a total of 118 702 967 sequences with an average length of 413 bps. From these samples, sequences with a length less than 150 bps as well as sequences longer than 540 bps were removed. No further preprocessing (e.g., chimera detection) was applied. This resulted in a total of 116 520 289 sequences, of which 63 221 538 were unique. These were split into 1265 chunks of size 50 000 each, which were subsequently aligned to and placed on the unconstrained Bacteria tree with 2059 tips using the steps explained above. The chunk placements were then transformed again into per-sample placement files, before finally running the steps explained in Figure 4.

Pipeline and Implementation
An implementation of the methods described here is freely available in our tool gappa, which is published under GPLv3 at http://github.com/lczech/gappa. gappa is based on our C++ library genesis, which offers functionality concerning phylogenies and phylogenetic placement data, but also has functions to work with sequences, taxonomies and many other data types. genesis is also published under GPLv3 and is available at http://github.com/lczech/genesis. gappa offers a command line interface for typical tasks of working with phylogenetic placements. The methods described in this paper are implemented via four sub-commands: • phat: Phylogenetic Automatic (Reference) Tree method. The command expects a taxonomy file and a sequence file of a sequence database, e.g., Silva (Quast et al., 2013;Yilmaz et al., 2014), as well as the target number of consensus sequences to be generated for the intended phylogeny. The result is a fasta file with consensus sequences representing taxonomic clades. The command can be further customized, e.g., by changing the consensus sequence method, using only a specified subclade of the taxonomy for running the algorithm, as well as several detail settings for the method. It can also output additional info files that allow to inspect details of the calculations, like the number of sequences and their entropy per clade.
• extract: Extract/collect placements in specific sub-clades of the tree. The command performs the main step of the multilevel placement approach. Its input is a set of jplace files containing placements on the backbone tree, as well as a file listing the clade name that each taxon of the backbone tree belongs to. For each clade, it then writes a new jplace file, containing all queries that were placed in that clade with more than a customizable threshold of their placement mass. Furthermore, if provided with the sequence files that were used to make the input jplace files, the corresponding sequence of each query are also written to fasta files per clade. Thus, a per-clade collection of sequences is created, where each result file contains the sequences that were placed in this clade of the backbone tree. These can then be used for the second level placement on separate clade-specific trees.
• chunkify: Split a set of fasta files into chunks of equal size, and write abundance maps. The command re-names the sequences using a configurable hash function (MD5, SHA1 or SHA256), and de-duplicates across all input sequences. Its output are chunk files of sequences, as well as an abundance map file for each input sequences file. The sequence chunk files can then be used to perform phylogenetic placement to obtain per-chunk jplace files.
• unchunkify: Take the per-chunk jplace files as well as the abundance map files, and generate a jplace for each original sequence file, including the correct abundances. This command is the second step of the chunkify command, and reverts its effect, so that the resulting jplace files are as if they were created using the original sequence files.
• assign: Perform taxonomic assignment using phylogenetic placements. While this is not the main focus of this work, we briefly introduce this method here. The command uses a taxonomic labeling of the tips of the reference tree to annotate all inner branches of the tree with the longest common taxonomic label for the induced subtree of the inner branch, in analogy to Sativa (Kozlov et al., 2016). Then, each query sequence in the provided jplace files is taxonomically assigned according to the labels of the branches where it does have placement mass. This can subsequently either be used for taxonomic assignment of the query sequences themselves, or to obtain a taxonomic profile of one or more samples.
These are the commands of gappa relevant for this paper, but it also offers more commands that are useful when working with phylogenetic placements. For details on the commands, and additional potentially useful commands, see the gappa documentation at https://github.com/lczech/gappa/wiki. At the time of writing, it is under active development, and more functions are planned for the near future. Furthermore, we provide prototype implementations, scripts, data and other tools used for the tests and figures in this paper at http://github.com/lczech/placement-methods-paper.

Taxonomic Assignment and Profiling
Taxonomic profiling is one potential application of phylogenetic placement. Thus, we also tested how well our PhATs are suited for this purpose. The CAMI Challenge (Sczyrba et al., 2017) is a community-driven effort to assess taxonomic profiling methods using a common set of benchmark data sets. To assess the feasibility of using trees generated with PhAT to perform taxonomic profiling of microbiome data, we utilized the mouse gut data set of the 2nd CAMI Challenge (Bremges and McHardy, 2018). For this data set, we performed read preprocessing, phylogenetic placement against our constrained and unconstrained Bacterial trees, and subsequent taxonomic assignment.
More specifically, we used the unpaired HiSeq reads of the mouse gut data set from CAMI, which comprises 64 samples of simulated reads. The preprocessing involved read de-interleaving following Watson-Haigh (2012), paired-end read merging using PEAR (Zhang et al., 2014), as well as quality filtering and conversion to fasta using VSEARCH2 (Rognes et al., 2016). This yielded a total of 800 341 409 reads. As our trees are based on small ribosomal subunit sequences, we also performed read filtering to obtain reads from the 16S rDNA locus. This filtering was performed using the protocol of Logares et al. (2014), which relies on HMMER (Eddy, 1998(Eddy, , 2009, and respective profiles for the 16S rDNA locus. We performed a global identity based de-replication step on the resulting reads that yielded 616 405 query sequences. We aligned these query sequences to our Bacteria reference alignment using PaPaRa 2.0 (Berger andStamatakis, 2011, 2012). We then performed phylogenetic placement of the aligned query sequences onto the unconstrained and constrained reference trees, respectively, using EPA-ng (Barbera et al., 2018). We performed de-dereplication to obtain per-sample data again, resulting in 64 jplace files (one per original sample) with placements of the 16S rDNA sequences, for each of the two trees.
Finally, we performed taxonomic assignment and taxonomic profiling of the per-sample results using the assign command implemented in gappa, which works analogously to the method used in Sativa (Kozlov et al., 2016). Its basic steps are described in Section 3. The CAMI challenge requires taxonomic assignments that conform with the NCBI taxonomy (Sayers et al., 2009;Benson et al., 2009). As our reference tree is however based on the Silva taxonomy (Yilmaz et al., 2014), we developed a dedicated mapping procedure to, in a best effort approach, map our results to NCBI taxonomic names and IDs. The mapping is based on the loose mapping procedure by Balvočiūtė and Huson (2017). More specifically, we try to map taxonomic paths to their name, rank, and ID in the NCBI taxonomy, if we find a name-based match between the two. When this fails, the phylogenetic placement mass assigned to a taxonomic path by our approach is instead added to the last successful mapping further up in the taxonomic hierarchy. By initiating this procedure for each taxonomic path from its root downwards, we ensure that all placement masses are taken into account.
This mapping is a major disadvantage of our approach when using a Silva-based reference, as the Silva and NCBI taxonomies are far from being congruent (Balvočiūtė and Huson, 2017). Also note that our reference is limited to 16S rDNA locus. This substantially reduces the volume of data we can evaluate. In this particular test, only ≈ 0.08% of the total data was identified as belonging to 16S. This means that our taxonomic profiling only uses a small fraction of the available data.
Finally, we used the CAMI evaluation tool for taxonomic profilers, OPAL (Sczyrba et al., 2017), to compare our approach to competing software and the "gold standard" result for the data set. Despite the aforementioned caveats of taxonomic profiling using phylogenetic placements in the CAMI setting, we find that the performance of our approach is "middle of the field". Note that this is a comparison to tools that are dedicated to taxonomic profiling, which also typically can assign more of the available reads.
We show the most important OPAL results in Supplementary Figure S7 as well as in Supplementary Table S5. Furthermore, the full OPAL output can be accessed at http://github.com/lczech/placement-methods-paper. Table S1: Taxonomic composition of the four PhATs. The table lists the four trees and their sizes (in number of tips), as well as how many of these tips originate from each of the three domains of life. The target size of the General tree was 2000 taxa, while the Bacteria and Eukaryota tree were targeting 1800 domain-specific taxa, which is approximately reached, but not exactly (underlined values). This is because the sizes of sub-clades in the taxonomy vary. Because each tip of the tree is a consensus sequence that represents the respective lowest taxonomic level, the number of available taxa is smaller than the total number of taxonomic labels in the Silva database. For example, the Archaea have a total of 347 taxonomic labels across all ranks, but only 248 labels at Genus level. Thus, the Archaea tree shown here represents the Archaea taxonomy resolved at the Genus level. In the three domain specific trees, we furthermore included consensus sequences at the Phylum level of the respective two remaining domains, in order to make sure that the evaluation also works well if such "outgroups" are included.   Nguyen et al., 2015) under the GTR+G model and 10 000 resamplings using the RELL method (Kishino et al., 1990). The table shows that the unconstrained trees fit the data significantly better in all four cases and in all tests. Columns are as follows. logL and deltaL: log likelihood and difference between constrained and unconstrained tree. bp: bootstrap proportion using RELL method (Kishino et al., 1990). p-(W)KH: p-value of the one sided and the weighted Kishino-Hasegawa test (Kishino and Hasegawa, 1989). p-(W)SH: p-value of the (weighted) Shimodaira-Hasegawa test (Shimodaira and Hasegawa, 1999). c-ELW: Expected Likelihood Weight (Strimmer and Rambaut, 2002). p-AU: p-value of approximately unbiased (AU) test (Shimodaira, 2002).  Table S3: Overview of the PhATs and their evaluation statistics. Details of four unconstrained (U) and four constrained (C) trees are shown. "Size" is the number of leaves of the tree, that is, the number of consensus sequences that the tree was inferred from. "% Seqs." the percentage of sequences from Silva placed on it. The General tree does not cover all sequences, because there are some sequences labels in the database that could not be mapped to the taxonomy. "∅ Br. Len." is the average branch length in the tree. The evaluation results are reported in the remaining columns: Average distances of the sequences to their respective expected branch are listed in numbers of branches (Discrete) and in branch length units (Continuous), as explained in the text. Furthermore, "Exp. Br. Hits" shows how often the most probable placement was placed exactly on the expected branch. Lastly, the average expected distance between placement locations (EDPL) is shown. The EDPL is the sum of the distances between the placements of a sequences weighted by their probability (Matsen et al., 2010).   (Huttenhower et al., 2012;Methé et al., 2012). We used this dataset to evaluate the applicability of typical analysis methods for phylogenetic placement using our PhATs, see Supplementary Section 2 and Figure 4 for details. In order to simplify the visualization in Figure 4, we summarized some of the labels into eight location regions, as shown in the second column. The last column lists how many samples from each body site were used in our evaluation.   Figure S1: Accuracy comparison between unconstrained and constrained trees. Here, we compare how a taxonomic constraint changes the weighted distances to correct edges when placing our evaluation sequences on PhATs. The evaluation method is explained in the main text. Subfigures (a) and (c) are identical to Figure 3 of the main text and included here for ease of comparison. Subfigures (b) and (d) show the results when using the Silva taxonomy as constraint for the tree inference. The relative Robinson-Foulds distances (Robinson and Foulds, 1981) between the four pairs of trees range between 45.8% and 49.7%. The differences probably occur because our trees span diverse clades, whose ancient branches are hard to resolve. Also, single gene data might not be sufficient to resolve these clades. See Supplementary Section 1 and Table S2 for a more detailed comparison of the constrained and unconstrained trees. However, the differences mostly concern inner branches. Most of the placements are however expected to be near the terminal branches, which are more stable across the trees. This is confirmed by the fact that, overall, the results are similar between the unconstrained and constrained trees. A slight improvement can be observed for the constrained General tree (blue), which performs better according to both distance measures. However, when considering only the distance of the most likely placement (highest LWR) to its correct edge instead of using average distances weighted by the LWR per QS, the constrained trees consistently yield better results (data not shown). For example, the most significant change is observed for the Eukaryota tree, with 84% correct placements for the unconstrained tree, but 89% for the constrained one. We suspect that this is an artifact of our evaluation process, as we consider a sequence to be correctly placed if the placement branch belongs to the consensus sequence to which the sequence contributed. As the selection of sequences for each consensus sequence is guided by the taxonomy, using the same taxonomy as constraint for the tree thus might also improve the placement accuracy.  Figure S2: Effect of different consensus sequence methods on placement accuracy. In the main evaluation of our PhAT method, we used a reference tree and alignment based on majority rule consensus sequences (May, 1952;Day and McMorris, 1992a) of the Silva database sequences. Here, we evaluate the effect of using other consensus sequence methods on phylogenetic placement accuracy. The evaluation method is described in the main text. In addition to (a) majority rule consensus, we tested (b) Cavener's method (Cavener, 1987;Cavener and Ray, 1991), as well as threshold consensus sequences (Day and McMorris, 1992a,b) using thresholds of 50%, 60%, 70%, 80%, and 90%, of which two are shown in (c) and (d). The three remaining threshold methods exhibit accuracies almost exactly in between the shown plots, that is, accuracy decreases with increasing thresholds. For comparison, we also included Figure 3(a) of the main text again, here as Subfigure (a), using the same y-axis scaling as the other plots. We only show distances measured in number of branches here, because this is more relevant in the context of our methods (e.g., Multilevel Placement). By using alternative consensus methods, the consensus sequences and thus the sites in the alignments change. Hence, the obtained reference trees (not shown) differ substantially from each other. Across the corresponding trees of the consensus methods tested here, we observed an average relative RF distance of 49.5%. This is similar to our findings depicted in in Figure S1. Here too, the accuracy of the constrained variants of these trees (data not shown) does not change much compared to the accuracy obtained for the unconstrained trees shown here. Thus, the differences in accuracy seen here are most likely due to the interplay of alignment and placed sequences (which is what we are interested in), and not due to differences in the trees (which are not of interest here). The first three plots (a)-(c) exhibit similar accuracies. On average, majority rule, Cavener's, and low threshold (≤ 70%) consensus methods place 82-83% of the sequences on the expected branch. As a general trend, the Archaea, being the smallest tree, tend to have the highest accuracy. On the other hand, the Bacteria, having the most sequences in Silva, score worst. This changes for high consensus thresholds. At high thresholds, many sites contain ambiguity characters, thus blurring the phylogenetic signal. The General tree, representing the highest diversity, is most affected by this.

Majorities
Single Sequences Figure S3: Effect of using actual sequences (instead of consensus sequences) on placement accuracy. For most of our evaluation of the PhAT method, we used some form of consensus sequence representation for the clades of the taxonomy, see, for instance, Figures S1 and S2. However, we also tested how the method behaves when using actual sequences instead, thus avoiding a potentially fuzzy phylogenetic signal, and related potential drawbacks of consensus sequences. As manually selecting representative sequences from the database was not practical, we used the following automated approach. First, we took the 90% threshold consensus sequences of the PhAT method that were already evaluated in Figure S2(d). By using a high threshold, most of the diversity of a clade is represented. Then, for each such consensus sequence, we calculated a score for all sequences from the database that were used to construct this consensus sequence. This score is the number of different nucleotides between the consensus sequence and the database sequence. The sequence with the lowest score (that is, with most matching nucleotides) was then used as clade representative. Thereby, the taxonomic clades are represented by actual sequences from the database. However, as these sequences are close to the respective consensus sequence, they are still good representatives of the diversity of the clade. Using this set of sequences, we again inferred a tree and conducted the evaluation procedure by placing all sequences of the database on that tree, as described in the main text. Subfigures (a) and (c) are identical to Figure 3 of the main text and included here for ease of comparison, however with the y-axis scaled to fit the remaining subfigures. That is, they show the evaluation of the majority rule consensus sequences. Subfigures (b) and (d) show the evaluation of the approach as described here. The resulting accuracy is worse in all cases. That is, on average, the sequences were placed further from their respective expected branch. We suspect that this is because single sequences do not capture the diversity of their clade as well as consensus sequences, and because they do not incorporate as much biological information (e.g., in form of ambiguity characters).

Firmicutes
Bacteroidetes Cyanobacteria Actinobacteria Others Figure S4: Unconstrained Bacteria tree with five bacterial sub-clades. This tree is the result of our PhAT method applied to the Bacteria sequences in Silva. The tree contains a total of 1914 taxa. Colorized are the five Phylum level sub-clades that we used for testing multilevel placement: Proteobacteria (505 taxa), Bacteroidetes (362 taxa), Firmicutes (360 taxa), Cyanobacteria (39 taxa) and Actinobacteria (53 taxa). The incongruence between taxonomy and phylogeny is visible here as non-monophyletic colored branches. We thus here define a clade to consist of all branches that are part of a monophyletic split of the tree with respect to the taxa in the clade. In other words, all branches on one side of a split are considered to belong to a clade, if that side of the split only contains taxa from that clade. These branches then receive the same color here. Then, for multilevel placement, a sequence is considered to be part of a clade if its most probable placement falls into that clade. For example, a sequence that is placed onto one of the orange branches on this tree is subsequently placed in the Cyanobacteria tree for the second level placement. Each of the five sub-clades is represented by multiple branches here, which we call the "overlap" with the Bacteria tree. Here, we test the behavior of the unconstrained Bacteria tree obtained via our PhAT method when used for standard post-analyses of phylogenetic placement results. For this, we used a sequence dataset of the vaginal microbiome of 220 women (Srinivasan et al., 2012). For details on the processing, see Supplementary Section 2. The original study showed associations between the presence of certain bacterial species and the diagnosis of Bacterial Vaginosis (BV), a condition caused by changes in the vaginal microbiome. In the study, the Nugent score (Nugent et al., 1991) was used as a clinical diagnostic criterion for BV, which ranges from 0 (healthy) to 10 (severe illness). We placed the sequences of the dataset on their original tree and on our PhAT, and reproduce some of the results from the original study to assess differences induced by using distinct references trees. Squash Clustering is a hierarchical clustering method for phylogenetic placement data (Matsen and Evans, 2011) that uses the phylogenetic Kantorovich-Rubinstein (KR) distance (Evans and Matsen, 2012) to measure cluster similarity. The result of Squash Clustering is a cluster tree of samples, where samples that are similar according to the KR distance are closer to each other, with branch lengths according to that distance. The left side of the figure compares the cluster trees resulting from using (a) our tree and (b) the original reference tree. Subfigure (b) is a recalculation of Figure 1(A) of (Srinivasan et al., 2012). The tips, which correspond to samples, are colored by the Nugent score of each sample, and thus indicate which women are affected by BV. The general features of the two cluster trees are comparable, indicating that our tree is able to distinguish between healthy and sick patients. However, there is a major difference in the lower half of the trees: While (b) shows some small branch lengths and even a separated sub-clade of samples with low Nugent score, these branches have a length of virtually zero in (a). As shown in (Srinivasan et al., 2012), the healthy patients are divided into two classes, based on the presence of two species of Lactobacillus. The original reference tree contains sequences of those species, and can thus distinguish between them. Our broad Bacteria tree however does not have this degree of species-level resolution and thus treats them the same, yielding a negligible KR distance between the samples. Although this finding is expected, it serves as an example for the limits of our method. This can also be seen in the right side of the figure, where we compare Edge PCA using (c) our PhAT and (d) the original reference tree. Edge PCA (Matsen and Evans, 2011) is a dimensionality reduction method for phylogenetic placement data, which can visualize differences between samples. Subfigure (d) is a recalculation of Figure 3(A) of (Srinivasan et al., 2012). While our tree is able to separate samples by Nugent score, it does not exhibit the further separation of the two Lactobacillus species that is apparent when using the original tree. Thus, the samples with low Nugent score form one blob in (c). This limitation can be overcome by either resolving the PhAT down to species level, or by using multilevel placement with a refined tree that, for example, contains species sequences of the relevant Lactobacillus clades. Note that analogous issues of deficient resolution or missing species can potentially also arise when hand-selecting reference sequences. In the end, it is the responsibility of the researcher to ensure that the selection of reference sequences is suitable for the dataset to be placed.
Unconstrained Constrained Figure S6: Accuracy of the PhAT of five bacterial sub-clades. We used five sub-clades of the Bacteria in Silva, which were already scrutinized in (Kozlov et al., 2016), to test how our PhAT method works for less diverse sets of sequences. These five clades are also highlighted in Figure S4; see there for a description of the clades. The evaluation was conducted as explained in the main text; in short, we placed the Silva sequences of the clades on their respective tree, and measured how far each of them is away from the branch of the consensus sequence it is represented by. The placement accuracy on these sub-clades is slightly worse compared to the broad Bacteria tree, which can be seen by comparison to Figure S1. On average, 73.4% of the sequences were placed exactly on their expected branch, dominated by Proteobacteria and Firmicutes, which combined make up 75% of the sequences in the five clades, and have an accuracy of 71%. The Actinobacteria have the highest accuracy, with 82% of their sequences placed on the expected branch. The two smallest clades, Actinobacteria and Cyanobacteria, exhibit the shortest distances in branch length units. In fact, the longest distance of any sequence from its expected branch in the Cyanobacteria clade is around 0.4, which is indicated by the end of the red line in the lower two plots. On the other hand, the Firmicutes generally have the lowest accuracy here. In Figure S4, which shows the unconstrained Bacteria tree, the Firmicutes clade exhibits many paraphyletic branches, which is a known issue (Parks et al., 2018). This indicates that there is a high incongruence between the Firmicutes taxonomy and phylogeny in Silva, which might explain why the Firmicutes score worst here. These results are likely due to the inability of 16S SSU sequences to properly resolve lower taxonomic levels (Mignard and Flandrois, 2006;Petti, 2007;Janda and Abbott, 2007). For example, Table 2 of (Janda and Abbott, 2007) lists 10 bacterial genera that are known to be hard to identify using 16S sequences. These genera account for 7.9% of the 2846 taxa that are represented by the five bacterial trees tested here. Furthermore, 95 553 of the 450 313 sequences that were placed on those trees (21.2%) belong to one of these genera. This might explain the worse scores of these clade trees. Lastly, the consensus sequences at the tips of the trees represent the Genus level. Thus, these have short branches, which increases the probability of misplacements.  Figure S7: CAMI: Profiling Results. The figure shows the profiling results of different tools evaluated with data from, and following the protocol of, the 2nd CAMI challenge (Sczyrba et al., 2017;Bremges and McHardy, 2018). Here, we compare the taxonomic assignment and profiling conducted with our assign command to the tools that took part in the 2nd CAMI challenge. For this, we used the unconstrained and constrained Bacterial tree, which are abbreviated here as "PhAT (U)" and "PhAT (C)", respectively. See Supplementary Section 4 for details on our evaluation, and see Sczyrba et al. (2017) for details on CAMI and their metrics. Subfigure (a) shows the relative performance of the tools for different taxonomic ranks by displaying the error metrics used by CAMI: Weighted Unifrac error, L1 norm error, recall (completeness), precision (purity) and false positives. The error metrics were divided by their respective maximal value to allow viewing them on the same scale and to allow relative performance comparisons. Subfigure (b) shows the absolute recall (completeness) and precision (purity) for each tool across the taxonomic ranks. In both subfigures, the red text for our PhAT evaluations indicates that no predictions at the corresponding taxonomic rank were returned. This is because our Silva-based tree does not have species resolution and does hence not allow for taxonomic profiling at this level.