Phylogenetic factorization of compositional data yields lineage-level associations in microbiome datasets

Marker gene sequencing of microbial communities has generated big datasets of microbial relative abundances varying across environmental conditions, sample sites and treatments. These data often come with putative phylogenies, providing unique opportunities to investigate how shared evolutionary history affects microbial abundance patterns. Here, we present a method to identify the phylogenetic factors driving patterns in microbial community composition. We use the method, “phylofactorization,” to re-analyze datasets from the human body and soil microbial communities, demonstrating how phylofactorization is a dimensionality-reducing tool, an ordination-visualization tool, and an inferential tool for identifying edges in the phylogeny along which putative functional ecological traits may have arisen.


Methods
Zeros OTU tables are lled with zeros, especially OTU tables from disparate sites.
In our tables,~90% of the entries were zero, and the multiplicative method [3] with a high δ k would cause taxa present with low sequence counts to have abundances lower than δ k , and a low δ k would cause zeros to have an outsized impact on our dataset. For simplicity, we used Aitchison's additive method, setting the remaining zeros to δ k = 0.65 of a sequence count. Further discussion of how the various methods for dealing with zeros aect phylofactorization is left for future research.
Isometric Log-Ratio Transform To introduce terminology and notation, we'll summarize the isometric log-ratio (ilr) transform, though see [4] for a more thorough treatment. At its heart, the ilr transform [2] is a change of basis. A traditional compositional vector of D parts, such as the relative abundance of D species in a site, lies in the D − 1 dimensional simplex, y ∈ ∆ D , where y i represents the coordinates of the corresponding elementary basis, The ilr transform projects the composition onto a D − 1 element basis that can be seen as representing sequential binary partitions of the composition [1] (see gure S1). The coordinates of an ILR transform, referred to as balances, represent the balance of relative abundances on each side of a sequential binary partition. For an arbitrary split separating the compositions in a group, R with r elements from the compositions in a group, S with s elements, from the rest, the coordinate, which we'll refer to as x * R/S , are x * R/S = rs r + s log g(y R ) g(y S ) (2) where g(y R ) is the geometric mean of all y i for i ∈ R. This transform can be inverted, allowing one to analyze the coordinates, x, and obtain estimates of relative abundances, y. Changing variables from y to x can be represented by Aitchison projection, < y, b R/S > a , on a unit-norm balancing element b R/S = (0, ..., 0, b r , ..., b r , b s , ..., b s , 0, ..., 0) where b r = s /r(r+s) and b s = r /s(r+s). In other words,

Interpretation and Intuition
To build intuition about the isometric log-ratio transform, it's helpful to see the balances as a ratio of relative abundance on each side of the partition -if the taxa in R are more abundant, this ratio will be positive. If the absolute abundance of all taxa in R increase geometrically by a factor, α R , while the abundances of taxa in S remain constant, the coordinate x * R/S will change by the addition of (rs)/(r + s) log(α R ). The ILR serves as a measure of contrast between the two taxa -when the taxa on each side of a partition have the same relative abundances, the ILR coordinate corresponding to the partition will be 0.
The geometric mean is a natural way to group taxa into an aggregate relative abundance. Unlike the total relative abundance, p R = i∈R y i , the geometric mean produces compatible analysis of the original parts and their amalgamations under Aitchison distances. With the geometric mean as a measure of central tendency and amalgamation, the center of an amalgamation is the amalgamation of the centers and distances between centers of amalgamations are monotonically related to the distances between amalgamations of centers.
To build more intuition, we note that the ILR is a specic measure of contrast between taxa, that the geometric means and log-ratios are particular to the compositional nature of the data, and that the ILR can be extended easily to random variables that are not compositional: for Gaussian random variables, the natural way to measure contrast would be to look at the dierence between the average abundances of taxa on two sides of a partition. A similar phylofactorization can be conducted for any random variables once researchers have an agreeable measure of central tendency and a compatible measure of contrast between two measures of central tendency (all of which are easily identiable by looking for the most useful/informative metric space for a given random variable).
More intuition on the ilr-transform can be obtained from re-writing the balance in equation (2) in several dierent ways: The rst expression of the balance, equation (5), writes the balance as a logratio of geometric means between two groups. The second, equation (6), reveals that the balance is proportional to the dierence between the means the logtransformed abundances of the taxa in the two groups. The third, equation (7), expresses the ratio of geometric means in equation (5) as the geometric mean of all possible ratios between taxa in the dierent groups. Together, (5-7) show that the balances from the ilr transforms indicate how dierent, on average, the taxa on two sides of a split are. When an ilr coordinate has a strong correlation with an independent variable, it means that the taxa split by that coordinate are dierent and that dierence grows, shrinks, or changes sign with the indendent variable.
The nal expression for the ilr coordinate, equation (8), reveals the coordinate to be proportional to an additive amalgamation of the centered-log ratio transformed coordinates, z i = log ( yi /g(y)), of the sub-compositional data from one of the two groups being split. Intuitively, if we had absolute abundance data, one can assess the signicance of a clade by amalgamating (adding) the abundances of members in that clade and then performing regression on the amalgamated abundance. As seen in equation (8), the ilr transform follows this intuition, except the abundances to be amalgamated are the clr-abundances within the sub-composition of all taxa split at a given partition. Combining all of these perspectives, the ilr coordinate can be intuited as a measure of contrast between two groups written as either a dierence between their mean abundance or an amalgamated clr-abundance of one of the two groups. A signicant association of an ilr coordinate can indicate that the ilr coordinate identies groups with, on average, some meaningful distinction. This is precisely what is desired when searching for phylognetic factors in microbiome data: the edges of the phylogeny which meaningfully distinguish taxa with dierent functional traits.
The ilr transform for a phylogeny rooted at the most-recent common ancestor Figure 1: The ilr transform on a rooted phylogeny is well-suited for analyzing eects isolated within clades but poorly suited for more general eects causing geometric increases in taxa. For example, consider the partition illustrated in gure S1. The third balance, x * 3 , corresponds exactly to the ratio of the relative abundance species 1 to the ratio of relative abundance of species2. Suppose the species are sampled in environments E 1 and E 2 and have respective relative abundances y 11 and y 21 in site 1 and abundances y 12 and y 22 in site 2. If the absolute abundance of species 1 went up by a factor of α from E 1 to E 2 , and the abundance of all other species remained unchanged, the balances of the taxa in two environments, then, would be x * 3 (E 2 ) = x * 3 (E 1 ) + log(α). However, all balances along the path from the node splitting species 1 and 2 to the root of the tree will also see an increase. The bases are orthogonal, but their coordinates are correlated and inference based on these coordinates would not elegantly identify species 1 as the species which changed by a factor of α.
The ilr transform of the rooted phylogeny will likely still nd its use, however, in other questions where comparisons of sister taxa are more appropriate.
For instance, if a researcher is interested in nding clades with zero-sum competition over a limiting resource contained within the clade, the ilr transform may be well-suited to identify these clades and isolate their internal, zero-sum competition from the rest. In another sense, the ILR transform of the rooted phylogeny compares sister taxa while controlling for their shared evolutionary history. However, for more general purposes of correctly estimating eects α i for a set of aected taxa, i ∈ A, we use phylofactorization, an approach to constructing an ilr basis which is consistent to the general group-structure of the phylogeny while not rooted at the common ancestor.

PhyloFactor
Phylofactorization of Compositional Data (a) The Challenge: Traits driving community structure or responses to meta-data can be shared within clades, either because they are vertically transmitted or because they can only be horizontally transmitted within a monophyletic group. Here, we present a method to develop an appropriate phylogenetic partition -identifying clades {1,2} and {4,5} in the gure -which yields easily-interpretable inferences on the tree of life and can be a tool for dimensionality reduction by collapsing species into their appropriate clades. Amplicon sequence-count data are inherently compositional -they contain only information about relative and not absolute abundances -and so developing a method for phylogenetic inference of compositional data will allow researchers studying microbiomes to ll in the microbial tree of life with clades with known associations with treatments or other meta-data.
(b) Rooted ILR: The isometric log-ratio transform constructs an orthonormal basis for the sample space of compositional data. The coordinates x * In order to identify aected clades, i ∈ A, that change by some factor α i (t) with some independent variable, t, a more robust approach is to unroot the phylogeny and form a sequential binary partition via a greedy algorithm. At its most basic, the input for this phylofactorization is a dataset, a phylogeny, a formula for regression, and an objective function. Given an OTUTable, Data, a phylogeny, tree, and an independent variable, X, our function, PhyloFactor(Data,tree,X) will perform phylofactorization with a default objective function of minimizing variance in the clr-transformed residuals.
Because the total variance in a compositional dataset is constant [4], and is equivalent to the sum of variance the ILR balances for any given sequential binary partition, we reduce the computational cost by maximizing the nonnormalized explained variance (the dierence between the null deviance and the deviance in the regression on ILR coordinates).
Before going further, we dene a few terms for clarity and convenience: Terms group : a set of taxa.
partition : a split between elements of a group, signied with |, as in factor : a particular partition chosen in phylofactorization. bins : the minimal set of groups not split by a given partition. In the set {g 1 |g 2 , g 3 |g 4 , g 5 , g 6 }, the bins are {g 1 }, {g 2 , g 3 }, and {g 4 , g 5 , g 6 }.
group complement: the complement of a group, g i , within its bin. In the above example, the group complement of g 4 is {g 5 , g 6 }. One could also label {g 5 , g 6 } the group and g 4 its complement.

Algorithm
Phylofactorization iterates through 6 steps, visualized in gure 3 of the main manuscript.
2. Project the data onto the set of putative ilr basis elements, b g , corresponding to the balance each group, g over its compliment.
3. Perform regression on the coordinates from (2) 4. Determine g max , the group which maximizes the objective function. By the symmetry of the log-ratio transform, whether regression is done with based on g max or its complement will not aect the choice of g max for objective functions that unaected by the sign of the ilr coordinate.
5. Add balancing element b gmax to the basis. In eect, this is re-rooting a sub-tree formed by previous partitions along the edge separating g max from its compliment.

Error Types
We anticipate there being four main types of errors. We can't examine all of them in this already-lengthy manuscript, but we list some fo them here to invite

Relation to Factor Analysis
There is an art to dening new terms in science, and we are not artists but have nonetheless deliberately included the word factor in the new term, phylofactor, in an eort to connect our method to factor analysis and matrix factorization. To avoid confusion, we want to clarify what we mean by a factor in phylo-factorization, and be explicit about how our method relates to factor analysis.
For a real-valued vector of D dependent variables in sample j ∈ {1, ..., n}, z j ∈ R D , with a mean across samples µ ∈ R D , factor analysis aims to nd a set of K ≤ D unobserved latent variables or factors, F k ∈ R n , and a set of K loadings L k ∈ R D to minimize the residual varaince, 2 . Dening the standardized observation matrix [Z] .,j = z j − µ factor analysis aims to nd F and L Z = LF + which minimize the o-diagonal elements of the noise covariance matrix, T .
Factor analysis is a type of matrix factorization because when K = D we can completely factor our data matrix into the product of two matrices, L and F.
Phylo-factorization, in its most general form, is the process of sequentially choosing ILR basis elements that correspond to structures edges in the phylogeny and maximize some objective function. In the context of equation (S9) above, the centered log-ratio transformed data are Z and the ILR basis elements from equation (S3) are the loadings L k . For a direct link between phylofactor and factor analysis, and for phylofactorization to be a matrix factorization, the factors are the projections of CLR data onto the loadings / balancing element whose balances maximized the objective function (i.e. the factors are the observed balances, x * ). In this sense, phylofactorization is a change of basis into coordinates which correspond to interesting edges in the phylogeny, where interesting is dened by the objective function.
Throughout the paper, we refer to edges maximizing the objective function as factors, e.g. saying that a factor splits a bin in two or a factor corresponding to an edge. Our use of factor here, which may seem to be more accurately described as the loading that strictly denes the bin-separation and correspond to the edge, is intended to be consistent with the other feature of factors in factor analysis -that they are latent variables. The reason for using the evolutionary tree as a scaolding to constrain the set of possible loadings is to allow inference about latent or unobserved biological features -a trait, perhaps -which could account for dierential abundance patterns across an edge in the tree. Thus, there would be a factor -a latent variable -corresponding to the edge, separating the bin of amniotes into bins of birds and other amniotes, which can later be observed as feathers, wings. and other traits unique to birds.
Thus, phylofactorization chooses loadings and can be interpreted as a constrained factor analysis whose loadings correspond to balancing elements constructed from edges in the phylogeny and whose factors are the balances corresponding to those balancing elements. When a researcher says we identied a phylogenetic factor predicting inammatory bowel disease in a patient, such a statement should be interpreted as: We have found an edge in the phylogeny that can be used to predict the presence of inammatory bowel disease by constructing an ILR balancing element partitioning the taxa on each side of that edge, and one can immediately hypothesize that a latent, vertically transmitted trait(s) explain these predictive, dierential abundances of taxa on each side of the edge.

Computational Benchmarking
The computational costs of phylofactorization for large datasets can be large, The benchmarks are shown below in gure S2. The current algorithm for phylofactorization in the R package phylofactor performs 1-clade factorization for D= 3, 000 and p= 1, 000 in~1,000 seconds on a home desktop without employing parallelization. The phylofactorization of a dataset is highly parallelizable because most of the work is in amalgamating the data for many groups and computing regressions of many ilr-coordinates against independent variables. The function PhyloFactor() accepts an argument ncores which allows user-friendly parallelization of phylofactorization. The speedup from parallelization can be signicant, especially for big datasets as the percent of the code that is parallelizable approaches 1 for large numbers of OTUs, D. The simulation time for the current algorithm appears to increase hyper-exponentially in D and p, however, with increasing D and p the speedup from parallelization increases and the precent of the code which is parallelizable gets close to 1 for large n and p. The current function, PhyloFactor, accepts the argument ncores for ecient parallelization of phylofactorization on servers, especially for large datasets. Unless noted otherwise, the default parameter values used for plotting are D=316, p=316, ncores=1, nfactors=1, and choice='var'.      Successive phylogenetic factors can have competing eects -for example, while amniotes are more likely to live on land and fresh-water due to a basal evolutionary event (leading to eggs preventing desiccation, an adaptation to life on land), more distal lineages (e.g. Cetaceans, Pinnipeds, marine turtles and sea snakes) subsequently evolved to return to the oceans. Factor-based analysis tells us the location of putative traits and evolutionary transitions, and bin-based analysisgrouping the un-split taxa and performing regression on these BPUs -allows the prediction abundances and functional ecology conditioned on the set of traits deemed to be important from phylofactorization.