Analysis of relative abundances with zeros on environmental gradients: a multinomial regression model

Ecologists often analyze relative abundances, which are an example of compositional data. However, they have made surprisingly little use of recent advances in the field of compositional data analysis. Compositions form a vector space in which addition and scalar multiplication are replaced by operations known as perturbation and powering. This algebraic structure makes it easy to understand how relative abundances change along environmental gradients. We illustrate this with an analysis of changes in hard-substrate marine communities along a depth gradient. We fit a quadratic multivariate regression model with multinomial observations to point count data obtained from video transects. As well as being an appropriate observation model in this case, the multinomial deals with the problem of zeros, which often makes compositional data analysis difficult. We show how the algebra of compositions can be used to understand patterns in dissimilarity. We use the calculus of simplex-valued functions to estimate rates of change, and to summarize the structure of the community over a vertical slice. We discuss the benefits of the compositional approach in the interpretation and visualization of relative abundance data.

Euclidean vectors do not satisfy the vector space axioms when applied to relative abundances. 23 For example, let a = (1/3, 1/3, 1/3) T be a relative abundance vector (throughout, we work with 24 column vectors, so T denotes transpose). Then neither a + a nor 2a is a relative abundance 25 vector, so the axiom of closure is not satisfied. 26 There are in fact operations corresponding to addition and scalar multiplication that make two compositions a and b, we can transform a into b by the closure of the unequal scaling b = C b 1 a 1 a 1 , b 2 a 2 a 2 , . . . , b s a s a s = b ⊕ ((−1) a) ⊕ a. 40 We can thus define the compositional difference b a as between adjacent pairs of communities must be constant. From the definition of compositional 57 difference (Equation S1), ρ j+1 ρ j = a, where a = (a 1 , a 2 , . . . , a s ) is a constant perturbation. 58 Then we can write ρ j+1 = a ⊕ ρ j , and ρ j+2 = a ⊕ ρ j+1 , and we require that d(ρ j , ρ j+1 ) = d(a ⊕ 59 ρ j , a ⊕ ρ j+1 ). In general, any meaningful dissimilarity measure d for compositions must satisfy 60 the perturbation invariance property d(ρ 1 , ρ 2 ) = d(a ⊕ ρ 1 , a ⊕ ρ 2 ) for all compositions ρ 1 , ρ 2 , a. 61 Most of the popular measures of community dissimilarity are not perturbation invariant, and are 62 therefore misleading. For example, let ρ 1 = ( 1 6 , 1 3 , 1 2 ) T , ρ 2 = ( 1 2 , 1 3 , 1 6 ) T , a = ( 1 3 , 1 6 , 1 2 ) T . Then using   Table S1: Coefficients β 0 (intercept), β 1 (depth), β 2 (squared depth) from the regression model described in the main text. Values are posterior means and 95% highest posterior density credible intervals in the 8 isometric logratio coordinates corresponding to 9 taxa, with the default basis from the R package compositions. Note that these coordinates have no simple interpretation in terms of the 9 taxa. β 0 β 1 β 2 -1.11 (-1.30, -0.94) -0.01 (-0.14, 0.11) -0.62 ( Figure S2: Posterior distributions (black lines) from 100 simulated data sets of estimated coefficients β 0 (intercept, first column), β 1 (linear term in orthogonal polynomial for depth, second column) and β 2 (quadratic term in orthogonal polynomial for depth, third column). Vertical green lines: true parameter values. Rows are the 8 ilr components corresponding to 9 taxa, with the default basis from the R package compositions. Note that these coordinates have no simple interpretation in terms of the 9 taxa. The long-tailed distributions and poorer performance of credible intervals for parameters associated with components 7 and 8 is probably a consequence of the number of observations 103 for the taxa whose distributions they reflect. Component 7 is proportional to the log of the 104 ratio of Aurelia aurita to the geometric mean of all taxa other than Aurelia aurita and colonial 105 ascidians, and component 8 to the log of the ratio of colonial ascidians to all other taxa. Since 106 Aurelia aurita and colonial ascidians were the least abundant taxa, with non-zero point counts 107 in only 7 and 15 out of 125 stills respectively, it is not surprising that estimation of parameters 108 describing their distributions is more difficult than for other taxa. 110 We compared the performance of the model with quadratic depth effect described in the main 111 text against that of models with linear and cubic depth effects. We fitted the linear and cubic 112 models exactly as described for the quadratic model in the main text, with extra coefficient 113 vector β 3 describing the cubed depth effect in the cubic model, and with higher-order coefficient 114 vectors set to 0 where necessary (e.g. in the quadratic model, β 3 = 0). 115 We evaluated the ability of each model to predict new stills, rather than new points from existing stills. Since each still has its own intercept, it corresponds to a cluster in the language of hierarchical models. We therefore used a leave-one-out cross-validation procedure (e.g. Garthwaite et al., 2002, section 9.4) in which each still was omitted in turn, with loss function the Bayesian leave-one-cluster-out estimate of out-of-sample prediction error (expected log predictive density) elpd loco :

S5 Comparison with linear and cubic depth effects
where m is the number of stills and f (y i |y −i ) is the posterior density of the ith still y i , given 116 the data set y −i in which the ith still is excluded (Vehtari et al., 2017). The required posterior 117 density is given by where θ = {β 0 , β 1 , β 2 , β 3 , Σ} is set of regression coefficients and the covariance matrix Σ, 119 f (y i |θ) is the density of the ith still given parameters θ, and f (θ|y −i ) is the posterior density of 120 θ given the dataset y −i . Because we want to predict a new still, we need to integrate over the 121 distribution of still-specific intercepts ε i . Thus  However, diagnostics suggested that the Pareto smoothing was not performing well enough to 134 be reliable. 135 We computed standard errors of differences in elpd loco among models using the compare() suggesting that a fairly conservative interpretation of differences among models is sensible.  Figure S4: Estimated relationships between relative abundance and depth for bare wall and eight taxa, fitted using a quadratic model as described in the main text (black lines with grey 95% credible bands), a linear model (green lines) and a cubic model (orange lines). Circles are sample estimates of relative abundance from point counts. 144 Constructing a meaningful ilr basis is not too difficult for a fairly small number of taxa whose 145 biology is well understood. In our case, the obvious choice for a first coordinate is to contrast 146 bare wall with macroscopic organisms, since bare wall is likely to behave as a passive component 147 in this system. A natural second coordinate is to contrast algae with animals. Algae are the 148 only photosynthetic organisms in our study, and are thus expected to depend very strongly 149 on light availability, which is likely to decrease rapidly with increasing depth. For the third 150 coordinate, we contrasted predators (Aurelia aurita and Diadumene cincta) with filter-feeding 151 animals. The filter-feeders are likely to rely heavily on phytoplankton, whose abundance is 152 expected to decrease with depth. In contrast, the predators use tentacles to catch relatively large 153 and highly motile zooplankton. For the fourth coordinate, we contrasted the two predators. For 154 the fifth coordinate, we contrasted Mytilus edulis, which is the only mollusc and has a very high 155 filtration rate, with the other filter feeders. For the sixth coordinate, we contrasted sponges with 156 bryozoans and ascidians, and for the seventh coordinate, we contrasted bryozoans with ascidians.

157
These two coordinates were chosen taxonomically, because taxonomic differences are likely to 158 be associated with differences in responses to depth. Finally, we chose the eighth coordinate 159 as the contrast between solitary and colonial ascidians, which have very different morphologies 160 and are thus likely to respond differently to depth.

161
For each coordinate, we code the taxa on one side of the contrast as +1, those on the other side 162 as −1, and those not involved in the contrast as 0. Rescaling the positive and negative elements 163 to give a vector with zero sum, and then rescaling the entire vector to unit length, gives one 164 column of the required basis matrix (Table S5). The columns are orthogonal by construction, 165 because each contrast after the first is among taxa on only one side of the preceding contrasts.   167 We compared the results from our approach with those obtained by fitting overdispersed Poisson

181
We also fitted a multivariate regression model by penalized likelihood, using the R package 182 glmnet (Friedman et al., 2010). We chose the value of the regularization parameter λ by cross-183 validation, and used the type.multinomial = "grouped" option in order to ensure that all 184 elements of a coefficient vector were either included or excluded together.

185
All approaches had similar behaviour for taxa with high relative abundance ( Figure S5a-e).   Figure S5: Estimated relationships between relative abundance and depth for bare wall and eight taxa, fitted using Stan as described in this paper (black lines with grey 95% credible bands), overdispersed Poisson regression in HMSC (Ovaskainen et al., 2017) transformed into compositions (cyan), multivariate linear models fitted to ilr-transformed count data with Laplace (orange), Jeffreys (purple) and Perks (pink) pseudocounts, and a multinomial regularized generalized linear model fitted using glmnet (green). Circles are sample estimates of relative abundance from point counts. Note that y-axis scales differ between panels in order to show detail for taxa with low relative abundances.  Table S5) and depth. Grey bands are 95% HPD credible bands, and black lines are posterior means. Note the difference in y-axis scales among panels.  216 We fitted essentially the same model as that used in the main text to the mite data set in the 217 R package vegan, described in Borcard et al. (1992). The data consist of counts of 35 taxa of 218 mites in 70 Sphagnum moss cores. There are two quantitative non-spatial explanatory variables, 219 substrate density (g l −1 ) and substrate water content (g l −1 ). Additional categorical explanatory 220 variables and spatial coordinates are also available, but for simplicity, we did not use these 221 in the model. For the same reason, we included only linear effects of substrate density and 222 water content. We fitted the model using the same approach as for the marine community data 223 described in the main text (although for speed, we ran only 1000 warmup and 1000 sampling 224 iterations of each chain). We conditioned on the observed total counts, in order to study patterns 225 in relative abundance, and used the same multinomial observation model as for the marine 226 community data. 227 We would expect this to be a challenging data set to model, because it is high-dimensional 228 and contains many rare taxa. For serious applications, it might be appropriate to use a more 229 constrained covariance structure, and to use a hierarchical model for the regression coefficients.

230
Nevertheless, the fitted model appeared plausible. For those taxa that were not rare, relationships 231 with substrate density generally appeared fairly weak ( Figure S9), compared to those for water 232 content ( Figure S10). In these figures, predictions were made with the explanatory variable that 233 is not on the x-axis set to its mean value. In order to give a better impression of the fit of the 234 model to data, Figure S11 shows posterior mean predicted relative abundances against observed Figure S10: Relationships between relative abundance and substrate water content (g l −1 ) for 35 mite taxa in 70 Sphagnum moss cores. Data from the mite data set in R package vegan, originally described in Borcard et al. (1992). Dots: observed relative abundances. Black lines: posterior mean predictions, with substrate density set to its mean value. Grey bands: 95% HPD credible band. Figure S11: Posterior mean against observed relative abundances for 35 mite taxa in 70 Sphagnum moss cores. Data from the mite data set in R package vegan, originally described in Borcard et al. (1992). Each dot represents one core. Lines have slope 1, y-intercept 0.
As above (Section S8), we plotted Bray-Curtis dissimilarities between adjacent predicted 242 relative abundances on a grid of 100 equally-spaced values of each of substrate water content 243 ( Figure S12a) and substrate density ( Figure S12b). These plots give rough estimates of the 244 rate of change in relative abundance, as measured by Bray-Curtis dissimilarity. Under a model 245 with linear effects of each explanatory variable, the true rate of change of composition with 246 respect to each explanatory variable, as measured using a perturbation-invariant distance, is a 247 constant. However, the Bray-Curtis dissimilarity, which is not perturbation-invariant, changed 248 in complicated ways with both water content ( Figure S12a) and substrate density ( Figure S12b).

249
In particular, the apparent local maximum in rate of change at a water content of about 500 g l −1 ,  Figure S12: Bray-Curtis dissimilarities between adjacent predicted relative abundances on a grid of 100 equally-spaced values of (a) substrate water content (g l −1 ) and (b) substrate density (g l −1 ). In each case, the explanatory variable not on the x-axis was set to its mean value. White lines: posterior means. Grey envelopes: 95% HPD credible bands.