Joint Species Distribution Modeling : Dimension Reduction Using Dirichlet Processes

Species distribution models are used to evaluate the variables that affect the distribution and abundance of species and to predict biodiversity. Historically, such models have been fitted to each species independently. While independent models can provide useful information regarding distribution and abundance, they ignore the fact that, after accounting for environmental covariates, residual interspecies dependence persists. With stacking of individual models, misleading behaviors, may arise. In particular, individual models often imply too many species per location. Recently developed joint species distribution models have application to presence–absence, continuous or discrete abundance, abundance with large numbers of zeros, and discrete, ordinal, and compositional data. Here, we deal with the challenge of joint modeling for a large number of species. To appreciate the challenge in the simplest way, with just presence/absence (binary) response and say, S species, we have an S-way contingency table with 2 cell probabilities. Even if S is as small as 100 this is an enormous table, infeasible to work with without some structure to reduce dimension. We develop a computationally feasible approach to accommodate a large number of species (say order 10) that allows us to: 1) assess the dependence structure across species; 2) identify clusters of species that have similar dependence patterns; and 3) jointly predict species distributions. To do so, we build hierarchical models capturing dependence between species at the first or “data” stage rather than at a second or “mean” stage. We employ the Dirichlet process for clustering in a novel way to reduce dimension in the joint covariance structure. This last step makes computation tractable. We use Forest Inventory Analysis (FIA) data in the eastern region of the United States to demonstrate our method. It consists of presence–absence measurements for 112 tree species, observed east of the Mississippi. As a proof of concept for our dimension reduction approach, we also include simulations using continuous and binary data. ∗Postdoctoral Associate, SAMSI/Duke University, Research Triangle Park, NC 27709, taylor-rodriguez@samsi.info †Postdoctoral Researcher, SAMSI/North Carolina State University, Research Triangle Park, NC 27709, kkaufeld@samsi.info ‡Assistant Professor, Department of Statistics, University of Missouri, Columbia, MO 65211, schliepe@missouri.edu §Professor, Nicholas School of the Environment, Department of Statistical Science, Duke University, Durham, NC 27708, jimclark@duke.edu ¶Professor, Department of Statistical Science, Duke University, Durham, NC 27708, alan@stat.duke.edu c © 2017 International Society for Bayesian Analysis DOI: 10.1214/16-BA1031 940 Joint Species Distribution Modeling with Dirichlet Processes


Introduction
Understanding the processes that determine the distribution and abundance of species is a goal of ecological research.Species distribution models are used to evaluate the variables that affect the distribution and abundance of species and to predict biodiversity, including responses to climate change (Midgley et al., 2002;Guisan and Thuiller, 2005;Gelfand et al., 2006;Iverson et al., 2008;Botkin et al., 2007;Fischlin et al., 2007;McMahon et al., 2011;Thuiller et al., 2011).These models are used to infer a species range either in geographic space or in climate space (Midgley et al., 2002), to identify and manage conservation areas (Austin and Meyers, 1996), and to provide evidence of competition among species (Leathwick, 2002).The core objective is interpolation, to predict species response at plots that have not been sampled.Species distribution models (SDMs) are most commonly fitted to presence/absence data (binary) or abundance data (counts, ordinal classifications, or proportions).Occasionally, continuous responses are used such as biomass (Dormann et al., 2012).Prediction of species over space can be accommodated using a spatially explicit specification (Gelfand et al., 2005(Gelfand et al., , 2006;;Latimer et al., 2006).
Within a Bayesian framework, SDMs can be fitted using hierarchical models.Hierarchical models provide a flexible way to include information regarding the distribution of a species as well as the uncertainty (Gelfand et al., 2006).The stages in a hierarchical model can describe latent processes that are ecologically important.These models enable separation of the measurement and biological process models (MacKenzie and Royle, 2005;Latimer et al., 2006;Gelfand et al., 2005).
Customarily, SDMs are fitted independently across a collection of species (Thuiller, 2003;Latimer et al., 2006;Elith and Leathwick, 2009;Chakraborty et al., 2011).To make predictions at the community scale, independent models for individual species are aggregated or stacked (Calabrese et al., 2014).However, collectively, the independent models tend to predict too many species per location (Guisan and Rahbek, 2011), as well as other misleading results (see Clark et al., 2014, for some examples).At least one source of the problem is the omission of the residual dependence between species.Modeling species individually does not allow underlying joint relationships to be exploited (Clark et al., 2011;Ovaskainen and Soininen, 2011).
Joint species distribution models (JSDMs) that incorporate species dependence now include applications to presence-absence (Pollock et al., 2014;Ovaskainen et al., 2010;Ovaskainen and Soininen, 2011), continuous or discrete abundance (Latimer et al., 2009;Thorson et al., 2015), abundance with large number of zeros (Clark et al., 2014) and, recently, discrete, ordinal, andcompositional data (Clark et al., 2016).JSDMs jointly characterize the presence and/or abundance of multiple species at a set of locations, partitioning the drivers into two components, one associated with environmental suitability, the other accounting for species dependence through the residuals, i.e., adjusted for the environment.
Joint distributions are meaningful at all scales, but interpretation should properly reflect scale.Clark et al. (2014) discuss why the covariance matrix does not quantify 'species interactions'.For example, competition is a mutually negative interaction.However, the strongest competitors will typically have a positive correlation -tendency to co-occur is the cause, not the consequence, of competition.Likewise, predation, disease, and parasitism are not quantified by the covariance matrix.Both are asymmetric, whereas, a covariance matrix cannot be.
Although the covariance matrix cannot be interpreted as the species interactions themselves, it does depend on them.The joint distribution is critical in models of forest composition, because the canopy is 'full'.So, regardless of scale, a species cannot increase unless others decline.Accounting for co-dependence is critical for accurate prediction.Although the nature of the covariance matrix changes with aggregation (Simpson's Paradox), the need to accommodate it applies across scales (Clark et al., 2014).
These JSDMs are also specified through hierarchical models, introducing latent multivariate normal structure, capturing dependence through the associated covariance matrices.We envision vectors of data Y i collected at plot i with the jth entry in the vector being the response of the jth species at plot i.We assume the Y i are independent across plots, i.e., that the plots are sufficiently dispersed that a change in composition at one location does not perceptibly influence composition at another.We focus on dependence between species, because, after accounting for the environmental effects at each plot through the mean effect, residual interspecies dependence persists.We introduce this dependence at the first stage, viewing Y i as a componentwise function where V i is a latent multivariate normal.For example, with biomass, This is a customary Tobit specification (Cameron and Trivedi, 2005).With binary response (presence/absence), Y ij = 1 if and only if V ij > 0, i.e., g(V ij ) = 1 for V ij > 0. Note that we are not modeling dependence between the (random) probabilities of presence, e.g., between P (Y ij = 1) and P (Y ij = 1) (Ovaskainen et al., 2015).This would move the dependence to the second modeling stage, i.e., in the mean specification, introducing a probit link.We believe that, with regard to the process, modeling the dependence at the data level is more informative.Similar remarks follow for any of the responses above; we can capture them through suitable latent multivariate normals (see Clark et al., 2016).
JSDMs enhance understanding of the distribution of species, but their applicability has been limited due to computational challenges when there is a large number of species.To appreciate the challenge with even the simplest presence/absence (binary) response and S species, we have an S-way contingency table with 2 S cell probabilities at any given site.With observational data collection over space and time, as in large ecological databases, the number of species is on the order of hundreds to thousands, rendering contingency table analysis infeasible.There is need for strategies to fit joint models in a computationally tractable manner.
To deal with these large datasets it is necessary to consider data dimension reduction techniques.Common approaches include principal component analysis (Artemiou andLi, 2009, 2013), partial least squares (Naik and Tsai, 2000), and sliced inverse regression (Li, 1991).Machine learning approaches include clustering techniques such as the Chinese Restaurant process (Blei et al., 2010), Indian buffet process (Ghahramani and Griffiths, 2005), and latent Dirichlet allocation (Blei et al., 2003).
Here, we propose a Bayesian nonparametric approach, working with Dirichlet processes, to capture dependence among species while reducing the effective dimensionality of the problem.We use a stick-breaking representation based upon MacEachern (1994) which defines a dependent Dirichlet process (see also Dunson and Park (2008); Chung and Dunson (2011)).The attractiveness of stick-breaking constructions is that they are computationally efficient and easy to implement.In our work we employ the Dirichlet process in a novel way to reduce dimension in the dependence structure among species.In joint species modeling, a much different Bayesian nonparametric approach by Arbel et al. (2015) models species dependence across plots through a Gaussian process and a location dependent covariance structure.However, this approach assumes independence among species.
The primary contribution of this article is to develop a computationally feasible approach for a large number of species (say order 10 3 ), that allows us to: (i) assess the residual dependence structure among species, (ii) identify clusters of species that exhibit similar dependence patterns, and (iii) jointly predict species behavior.Again, the proposed hierarchical model captures dependence between species at the first or "data" stage rather than at a second or "mean" stage.
Dependence is introduced in the first-stage level through plot level latent multivariate normal variables whose dimension is the number of species.Dependence is introduced using low dimensional random effects described by a low dimensional covariance matrix, which is then made nonsingular through diagonal dominance.Dimension reduction in the form of clustering of the random effects is done with a Dirichlet process.Here, we employ the Dirichlet process for clustering in a novel way.Rather than the customary clustering of the replications (in our case, this would be plots), we cluster the components of the response (here, species).Instead of replacing Gaussian random effects with Dirichlet process random effects, we use the Dirichlet process to reduce dimension in the joint covariance structure.This last step makes computation tractable and is highly scalable.Altogether, we have a form of latent factor analysis but our implementation differs from that of Bhattacharya and Dunson (2011).
We use the Forest Inventory Analysis (FIA) data in the eastern United States to demonstrate our method.It consists of roughly 100 tree species considered on 1200 hectare sized plots.As a proof of concept for our dimension reduction approach, we also include a simulation using both continuous and binary response data.
Other potential settings for our Dirichlet process dimension reduction approach include: (i) gene expression data, anticipating dependence between some of the genes, where replicates would be expression data for individuals and (ii) stock price data for stocks associated with an index, anticipating dependence between some of the stocks.Replication here would be across time and could be accommodated through independent increments in a dynamic model.This article is structured as follows.We describe the illustrative dataset in Section 2. In Section 3, we introduce our dimension reduction approach for a multivariate collection of species, including details about the clustering step, an adaptation to binary data, and other necessary considerations.Section 5 discusses prediction under our approach.In Section 6 we provide simulation examples.Section 7 presents the Forest Inventory Analysis data example.In Section 8 we conclude with a summary and future work.

Motivating Example: Forest Inventory Analysis Data
A motivating challenge is joint modeling of multiple tree species in the Forest Inventory and Analysis (FIA) Data.It is a collection of annual inventory plots (2003)(2004)(2005)(2006)(2007)(2008)(2009) in 31 states in the eastern US (Woudenberg et al., 2010, FIADB version 4.0.) that includes species, size, health of trees, tree growth, mortality, and removal by harvest.It is a systematic inventory of all US forests (USDA) that is documented online in a national resource report (Smith et al., 2009).In our analysis we use presence/absence of tree species from over 20,000 FIA monitored plots from which there are measurements of roughly 1 million trees spanning more than 200 tree species.
The FIA has a sampling protocol that is applied consistently throughout the country.This protocol consists of a quasi-systematic design that covers all ownerships across the United States, and these sampling efforts yield a national sampling intensity of one plot for every 2,438 hectares (Bechtold and Patterson, 2005).Each FIA plot is made up of four circular subplots with a radius of 7.32 meters (m) each, and the centroids for the subplots are 35.58m apart.The aggregate area of the four subplots is 673.34 m 2 or ≈ 0.067 hectares (ha).
FIA plots are too small to adequately summarize local distribution and abundance.With a plot area equal to 25 m on a side, each supports only a handful of large trees, and most species occur sporadically.Predicting the composition on a plot this small is neither feasible, nor desirable.Instead we focus on the 1-ha scale, an area large enough to accurately represent forest structure; we aggregate plots to model at this scale.Given that FIA plots are located relatively far, nearest neighbors in geographic space need not represent similar environments.As such, geographic aggregation of FIA data can potentially group ridgetops with bottomlands.To avoid this issue, we combine FIA plots to the hectare scale by grouping them according to their covariate values.This translates the problem from a joint species distribution analysis in geographic space to one in covariate space.Prediction at this scale is more relevant to questions in ecology and climate change.The resulting joint distribution describes the tendency of species to respond together to the environment at the (meaningful) ha-scale, beyond what is captured in the mean structure of the model.To avoid ambiguity when discussing this application in what follows, we refer to the plots aggregated in covariate space simply as "plots", and to the original ones as "FIA plots".
To construct hectare-size plots in covariate space we follow the procedure described in Schliep et al. (2016).The authors used a k-means clustering algorithm based on stand age, temperature, and precipitation.Through this method, groups of 16 FIA plots (to make up a 1.07 ha sized plots) are identified, each of which minimizes the distance between the observations in the group and the cluster centroid, providing joint presence/absence of species at the one hectare scale (in covariate space).
Figure 1: Presence/absence status of four randomly selected species at hectare sized plots in covariate space.Each location represents the centroid of the 16 combined FIA plots.
Species that were either too abundant (present in all but 50 plots or fewer) or too sparse (in fewer than 50) were removed such that inference on the regression parameters can be made for the species retained in the model.Data used here consist of presence/absence for S = 112 species.We use standard climate predictors, temperature and hydrothermal deficit, where the latter is the number of cumulative degree hours for months with a negative water balance.The values of these covariates for the plots are the means of the 16 FIA plots.Because the clustering is done in covariate space, the covariate values within the 16 plots are similar.After grouping, we have approximately 1200 hectare sized plots with presence/absence data.Figure 1 displays presence/absence status at the plots for four randomly selected species mapped onto the clustering covariate space.

A Joint Species Distribution Model Using the Dirichlet Process
In this section, we formalize the joint species distribution model.We begin with the model for the latent process V i .We then connect it to presence/absence data.As noted in the introduction, we do not specify Dirichlet process models for the plots, i.e., for the V i across i; we do not replace normal models for errors or random effects with Dirichlet process models.Rather, we use the Dirichlet process to provide a dimension reduction in specifying the dependence structure.
Suppose there are j = 1, • • • , S species at i = 1, • • • , n plots.For plot i, an Sdimensional vector, V i , is modeled as where This model has O(S2 ) parameters, S 2 + S from Σ and, with p predictors in x i , including an intercept, there are pS parameters in B. For example, for S = 100 species and p = 3 predictors, the model contains 5,350 parameters.We propose a dimension reduction approximation to Σ that allows the number of parameters to grow linearly in S.
We approximate Σ with Σ = AA + σ 2 ε I and replace (1) with: where the random vectors w i are i.i.d. with w i ∼ N r (0 r , I r ) and ε i ∼ N S (0 S , σ 2 ε I), and A is an S × r matrix with r < S.
In fact, we think of r << S and A as 'tall and skinny.'Σ has Sr + 1 unknowns clarifying the conversion of an O(S 2 ) problem to an O(S) problem.Evidently, the rank of AA is only r but adding σ 2 ε I yields diagonal dominance and a nonsingular matrix.1We require i.i.d.w i in order to ensure that the V i are conditionally independent given A and σ 2 as they are in (1) given Σ.A further reduction in the number of parameters can be attained by finding common rows in A; a method to achieve this is described in the following section.

Clustering the Rows of A with a Dirichlet Process
Briefly, the Dirichlet process (DP) provides stochastic models for random distributions.It is widely used in the Bayesian nonparametric literature as a prior for distributions rather than using customary parametric distributions adopting priors for the parameters (Escobar and West, 1995;Ishwaran and James, 2001;Papaspiliopoulos and Roberts, 2008).It finds attractive application in hierarchical modeling to provide random distributions for random effects.
A constructive and useful representation of the DP is the stick-breaking formulation of Sethuraman (1994).It is ideally suited for implementation within a Gibbs sampling setting (see Escobar, 1994;Escobar and West, 1995;MacEachern, 1994;Bush and MacEachern, 1996;Neal, 2000) due to a Pólya urn scheme representation which provides easy sampling from full conditional distributions.Under a stick breaking construction, we say the random distribution, G, follows a DP with base measure H and precision parameter α, G ∼ DP(αH), if where , and δ θ l (•) is the Dirac delta function at θ l where θ l ∼ H.Because it is almost surely a discrete distribution, it enables ties; hence, it enables clusters.We make use of this feature to allow some rows of A to be common, which corresponds to clustering species in their dependence behavior.
As shown in Ishwaran and Zarepour (2000), the sum in (3) can be approximated with a truncated version The weights defined in this fashion come from a generalized Dirichlet distribution GD(a, b), which is conjugate with multinomial sampling.
We refer to this finite approximation as DP N (αH), and it is the version that we employ in developing our clustering strategy.The DP N (αH) formulation facilitates sampling with the blocked Gibbs sampler of Ishwaran and James (2001), which avoids marginalizing out the base measure, instead allowing the prior to be involved in the Gibbs sampling scheme.
To specify the hierarchical formulation for this model, let Z = (Z j ) N j=1 (with Z j iid ∼ H) denote the N × r matrix whose rows make up all potential atoms (i.e., vector values that the rows a l of A may take).In this setup, we need a vector of grouping labels , and e k l is the N -dimensional vector with a 1 in position k l and 0's elsewhere.Using this notation, our approximate model is given by where , and δ j (•) represents a point mass at j.
As suggested in Ishwaran and Zarepour (2000), even for a very large number of species (S), a moderate level of truncation should suffice to approximate a DP(αH).In both our simulations and case study we set N = min {150, S}.The specification for the prior of D z follows the noninformative strategy to sample covariance matrices described in Huang and Wand (2013).Further details on implementation of the sampling algorithm for the hierarchical setup in (4) are provided in Supplementary Appendix A (Taylor-Rodríguez et al., 2016).Finally, we note that this approach for generating a reduced rank matrix A, differs considerably from the approach in Bhattacharya and Dunson (2011).There, entries in A arise from inverse gamma scaled normals.

Adaptation for Binary Response (Presence/Absence)
In ecological surveys, by far the most common response is the set of presence/absence indicators for the set of species recorded at each plot.The FIA data described in Section 2 is an example of this.To work with binary responses, we resort to an adaptation of the data-augmentation algorithm proposed by Chib (1998) for multivariate probit regression, which improves the mixing of the MCMC algorithm.This modification is known in the literature as the parameter-expansion data-augmentation (PX-DA) algorithm (Liu and Wu, 1999;Lawrence et al., 2008;Schliep and Hoeting, 2015).It allows us to use the machinery proposed for the continuous case as a latent model.The PX-DA strategy was also considered by (Clark et al., 2016) to model non-continuous responses.In either case, approximation is needed to handle a large collection of species.The augmentation consists of introducing multivariate normal latent random variables, which are used to obtain a full conditional posterior density where the entire covariance matrix can be sampled.The sampled covariance is then re-scaled as a correlation matrix to accommodate the identifiability constraints imposed by the probit link.In our case, the approach is further modified to accommodate the dimension reduction step as described below.
Again, let Σ = AA + σ 2 ε I for some S × r matrix A, and denote by R = D −1/2 Σ D −1/2 , where D is the diagonal matrix containing diag(Σ ).We can use data augmentation with the binary likelihood, assuming that, for plot i, V ∼ N n×S (XB , R, I n ), where X is the n × p matrix of predictors and B is the S × p matrix of regression coefficients.Assume that the matrix of binary responses is given by Y where so that the contribution to the likelihood for species j in plot i is . With y i = (y i1 . . ., y iS ) the binary vector of observed presences indicators at plot i, we have , where B = D 1/2 B. This change of variable doesn't affect the probabilities for Y i (Lawrence et al., 2008).Hence, which in turn implies the expanded likelihood given by As in the continuous response case, we can introduce an n×r matrix of standard normal random variables W, such that where N a×b (M, V col , V row ) represents the a × b-dimensional matrix normal distribution with mean M , and column and row covariance matrix V col and V row , respectively.
Hence, the expanded likelihood can now be expressed as With the expanded likelihood in (6), the sampling algorithm becomes . This proposal density is convenient as it corresponds to drawing from univariate truncated normal random variables.

Draw
ε as in the continuous case (see in Supplementary Appendix A sampling steps for Z, k, p and σ 2 ε ), where the full conditional densities for A and σ 2 ε depend on V , B , and W.
3. Assuming a flat prior on B , the full conditional posterior distribution for B is given by N 4. Finally, obtain the variables on the correlation scale using the transformations

Non-Negative Response (Continuous Abundance)
In many species distribution surveys, the biomass of living organisms is recorded as an indication of species abundance.This type of response is continuous, but takes on 0 values whenever the species is absent from a plot.Functionally, we have a Tobit model defined by where Y ij corresponds to the random variable that measures continuous abundance for species j at plot i, and V ij is an element of the S-dimensional vector V i as specified in (2).
Using the connection between Y ij and V ij provided by ( 7), the likelihood for continuous abundance data is given by This formulation provides for direct Gibbs sampling, with full conditionals for the latent variables given by where β j and a j correspond to the jth row of B and A, respectively.Sampling for B, A and σ 2 ε is performed in the same way as in the continuous case (see Supplementary Appendix A).

Prediction
While explanation through the covariates is usually the objective in species distribution modeling, prediction also plays a key role.Prediction can be envisioned for unobserved plots, for instance, predicting probabilities of presence-absence, or predicting the biomass for the set of species at unobserved plots.With independent plots, this becomes prediction of a latent multivariate vector using the fitted regression.Given that species are correlated within plots, if the response for some of the species is available at a new plot (e.g., presence/absence), one may take advantage of this fact to further inform prediction of the responses at the new plot for the unobserved species, conditionally on values for the observed ones.This is useful if only information on subset of species is available for some new plots and prediction on the remaining species is of interest.Alternatively, this feature can be exploited in formulating hypothetical experiments.If we assume certain species are present at a site, which species are likely to join them; if we assume certain species are absent at a site, which species are likely to replace them.
As mentioned earlier, individual level models when aggregated tend to overestimate richness -the number of distinct species at a plot (Clark et al., 2016;Calabrese et al., 2014).Under joint modeling the data we are fitting are vectors, each of 0's and 1's, such that each vector has an implicit observed richness.As a result, when the joint model is fitted, these observations inform about the number of distinct species at a plot.We will expect to see substitution among species that are equally suited to the available environment at the new plot.However, we will not expect the overall number of species to be overestimated.
We briefly describe how such prediction is implemented for the latent variables and then add the modifications for binary and non-negative data.First we derive the posterior predictive distribution.Recall that our latent fitting model is is the S × N matrix that assigns the rows of Z to the rows of A according to the clustering specified by the vector of groupings k.Also recall that the N × r matrix Z is such that its rows, denoted by Z j , are iid with prior distribution N r (0, D z ), for j = 1, . . ., N. Finally, we have w i iid ∼ N q (0, I r ), ε i iid ∼ N q (0, σ 2 ε I S ) for i = 1, . . ., n, and the labels k l |p iid ∼ N j=1 p j δ j (k l ) for l = 1, . . ., S, with the p j 's drawn from a generalized Dirichlet process.
We sort the columns of V to have in the first S p columns the response for species that will be predicted at the new plots.The remaining S −S p correspond to those species specified as observed at the new plots.The response matrix at the n o new locations is given by and letting X o be the matrix of environmental features at the new plots, the posterior predictive distribution after integrating out the w i 's, is Now, letting where φ k (•|μ, Σ) denotes the k-dimensional normal density with mean vector μ and variance-covariance matrix Σ, and To sample the predicted responses, instead of integrating out the parameters, we can obtain samples from Pr(V o pred |V, V o obs ) by using the MCMC draws of θ from p(θ|V), and then drawing V o pred from p(V o pred |θ, V, V o obs ).
Adaptation to the binary case is clear.The latent variables for the responses of the observed species need to be drawn to predict the latent variables for the unobserved species.Positive latent responses for the unobserved species are set to 1 (presence), and negative latent responses are set to 0 (absence).With presence/absence, Y o is now a matrix of binary responses at new locations, while V o represents the matrix with the associated latent variables.Using analogous notation to that from Section 3.2, we have Prediction with continuous abundance data follows a similar strategy.For the observed species at the new plots, the latent variables at new plots are drawn from normals truncated to be negative for species that are absent, and are set equal to the response for those species that take on positive values of the response.In this case where where μ (hj|obs) = E[V hj |θ, V o h,obs ] and σ 2 (hj|obs) = var(V hj |θ, V o h,obs ).Prediction can also be used for model validation/comparison.That is, a portion of the data can be held out from the fitting and used for validation.In Section 5, using simulated data, we offer a few examples of how this can be done.

Simulations
In this section, we conduct simulations for both the continuous and the binary case.The simulation for the continuous case is merely intended as a proof of concept.That is, on the latent scale, we show how well we can recover the specified multivariate dependence structure.The binary case is intended to demonstrate what we can expect to recover with real presence/absence data.The performance of the algorithm is assessed in terms of out-of-sample prediction.For the continuous case, we generate data for a large number of species using two illustrative simulation mechanisms according to how the variancecovariance matrix is specified.At plot i, the continuous joint response vector (e.g., biomass) for S species is drawn from y i ∼ N S (0 S , Σ).For the binary case, we generate the data using one choice of covariance structure and compare the parameter estimates and predictions with and without dimension reduction, as well as assuming independent models for each species.To fit the models without dimension reduction and compare the results with the reduced dimension approach, we consider a smaller number of species than in the continuous case.
In generating the continuous responses we consider two types of covariance matrices for 1000 species (see Figure 2 for an illustration).For each data generating mechanism, we employ r = 5, 10, 20 to assess the effect of the approximation.In the binary case, we only use a single covariance structure (imposed on the latent scale) and simulate data for 100 species.This is manageable to fit even without implementing a dimension reduction scheme, and allows us to compare the estimates and prediction with and without dimension reduction.With both types of data we also obtain the results from the independent stacked models.Below we describe in detail the simulation setup used in each case.

Continuous Case
The two covariance structures considered for the simulations with continuous responses are given by 1. Expected equicorrelation: Assumes Σ = ΦΦ where the ith row of Φ is given by 2. Clustered covariance: This structure assumes that Σ = A true A true + τ 2 I S , where A true is an S × q matrix (with q ≤ S).Each row of A true is assigned a label randomly from κ = 1, . . ., K true , where K true < S, such that if the lth row is assigned label κ, then We view equicorrelation as essentially a straw man; it is not what we expect in practice.
The second scenario, clustering of species, is more likely in real data and is, in fact, what our DP dimension reduction approach is designed to capture.Below, we will see that, indeed, our approach does work well.
Using these two alternatives, 10 independent data sets were generated in order to examine the ability of the algorithm to recover the true covariance structure and to perform out-of-sample prediction.Assessing out-of-sample predictive performance is done with respect to a hold out sample of species within plots as opposed to using a hold out sample of plots on all species.Predictive performance is assessed by calculating the Euclidean distances between the true values and the conditional predictions, predicting 5%, 10%, 20% and 50% of the species, conditional on the remaining 95%, 90%, 80% and 50% species, respectively.We denote the out-of-sample response matrix by V o .The rows of V o , denoted by V o i , correspond to some permutation of (V o i,pred V o i,obs ), where V o i,obs is the vector of responses for species that are assumed observed and V o i,pred are those considered for prediction.
The prediction vector, V o i,pred is given by (V im1 , . . ., V im Sp ), with S p denoting the number of out of sample species chosen to make prediction for, and m 1 , m 2 , . . ., m Sp denote the set of indices chosen at random for the species to be predicted.Similarly, Vo i,pred denotes the vector of predicted values.Therefore, V o i,obs corresponds to the remaining S − m Sp responses.The criterion used to assess predictive ability of the algorithm is the root mean squared predictive error (RMSPE), given by To test the algorithm we built the A matrix in model ( 2) setting r = 5, 10, 20 columns.The parameters chosen for each of the covariance types are: , where ψ = 2.This choice of parameters yields E[φ i φ i ] = 1 for the diagonal elements, and E[φ i φ j ] = 0.3 for the off-diagonal ones.
For the simulation experiments with continuous response, we assume S = 1000 species on n = 3000 plots.With every covariance structure, 10 datasets were generated at random from y i ∼ N S (0 S , Σ) for i = 1, . . ., n.Using each of these datasets, the algorithm was applied individually for the different values of r.The results extracted from each run of the algorithm are the Euclidean distances between the true out of sample dataset V o pred and their posterior predictive means Vo pred .In Supplementary Appendix B we included Figures 1 and 2 which display, for a single dataset, a comparison between: (i) the true and estimated covariance parameters, and (ii) the observed responses with the out-of-sample predicted values.The figures shown were obtained for A with r = 5 columns, corresponding to a single dataset generated either with the equicorrelation or the clustered covariance.
When the observed data is generated using the expected equicorrelation structure, our methodology attractively recovers two distinctly different groups of parameters, namely, those on the diagonal and those off the diagonal.This is expected, following from the difference in magnitude in their expected values, which are for the diagonal and off-diagonal elements, respectively.
At a first glance, in terms of prediction the method appears to perform poorly under equicorrelation.However, the very nature of the data generating mechanism induces this behavior.To better understand the role that expected equicorrelation plays, consider the conditional expectation at a single plot replacing Σ by its expected (equicorrelation ).The diagonal entries of Λ are given by Sσ 2 v (θ 2 0 +θ 2 1 ) and the off-diagonal elements are Sσ 2 v θ 2 0 , such that the posterior conditional predictive means are where J is an S p × (S − S p ) matrix of ones and Vobs is the average of the entries in V obs .Therefore, the conditional mean for y pred at a given plot is equal for all of its entries.So even though prediction from our method appears to render poor results, it is in fact behaving as if the true covariance matrix was equicorrelated.
With the clustered covariance structure our approach fares quite well, both recovering the true covariance and predicting new observations accurately (see Figures 1 and 2 in Supplementary Appendix B (Taylor-Rodríguez et al., 2016)).
The notable improvement in conditional prediction by accounting for the interspecific dependence is also observed in Figure 3.Under both covariance structures used to generate the true data, modeling the dependence dramatically reduces the error in out-of-sample prediction.In this figure we can also assess the effect of increasing r, the number of columns in A. For the equicorrelated covariance, the results indicate that adding more columns yields an insubstantial improvement.With the clustered covariance structure, the RMSPE slowly decreases as r grows until r = q = 20, the number of columns in A true .Hence, when the true covariance structure is of the same form as that from our approximation, the error is minimized when rank(A) = rank(A true ).Although not shown here, it is also reassuring that a posteriori the number of clusters is highly concentrated about 80, the true number of clusters K true in A true .

Binary Case
For presence-absence data, we consider a single covariance type (imposed on the latent scale), of the form Σ = (ΨΨ ) −1 , where Ψ is an S × S matrix generated from independent standard normal random variables, which yields inverse Wishart draws.In this simulation, we considered only S = 100 species on 1000 different plots in order to be able to compare our results with those from the estimation procedure that considers dependence among species but has no dimension reduction.The mean structure is defined by an intercept and a single predictor.Thus, the true model contains 100 × 2 regression coefficients and 100 2 = 4950 correlation parameters.
Our dimension reduction approach requires a choice of r, the number of columns to be used for the A matrix.Noting the similarity of this formulation with factor analysis models, some guidelines regarding this choice can be extracted from this extensive body of literature.A pervasive notion in this literature is that the relevant number of factors (i.e., columns in A) generally ranges from small to moderate (e.g., see Lopes and West, 2004).
In this spirit, letting K denote the number of unique cluster labels, then rank(A) = min {r, K}.Given that A must be of full column rank for the model to be well identified (Geweke and Singleton, 1980), then K (or an approximation) can act as an upper bound on r.The number of unique clusters is unknown, but in our experience, for a given r, K can be assessed relatively fast by monitoring the number of unique clusters drawn throughout the MCMC algorithm.If this value appears too small relative to r, then r must be reduced.Another indication that r is inadequately large is the appearance of multimodalities in the posterior densities of the elements in A (Aguilar and West, 2010).
A straightforward approach for selecting r is to perform sensitivity analysis varying r, and selecting the value that optimizes some criterion.Here, for the dimension reduced approach we fit models with r = 3, 5, 10, 15, 30, 50, 75, 100.To identify a suitable number of columns we use the Tjur R 2 coefficient of determination (Tjur, 2009), which compares the estimated probabilities of presence between the observed ones and the observed zeros.For species j, this quantity is given by T R j = ( π(1) j − π(0) j ), where π(1) j and π(0) j are the average probabilities of presence for the observed ones and zeros of the jth species, respectively.The larger the T R j , the better the discrimination.An overall criterion is to obtain the average over all species given by T R = 1 S S j=1 T R j , which we employ in this section.We would hope to find a peak in T R as we increase r.If there is little variation in T R over r, then we need not be concerned about the choice of r and simply choose a suitably parsimonious model.
Although the analysis that follows is conducted for a single dataset, to observe the behavior of the Tjur R 2 for the proposed simulation scenario, we generate 50 independent datasets, and for each of them calculate the posterior mean of T R for the different values of r (displayed in Figure 4).Additionally, for these 50 datasets we extracted the posterior mean for K to determine if identifiability issues might arise.
Here we observe that the model fit is similar for the different values of r, with r = 3 being slightly better than the rest.Also, for values r ≥ 50 there might be model identifiability problems given that K ≤ r.
We focus on a single dataset chosen at random from the 50 that were generated, and use the results from the model with r = 3, given the behavior observed in Figure 4.For comparison, we also perform the estimation using (i) the joint model without reducing the dimensionality fitted with the gjam R package (Clark et al., 2016), and (ii) a model where the species are fitted independently and the results are stacked together.
In Figure 5, we compare the parameter estimates obtained from the joint model without reduction (left column), the joint model with dimension reduction (middle column), and the independence model (right column).The top row in Figure 5 compares the ability of the three methods to recover the correlation parameters by contrasting their estimates to the true values (using the means and 95% credible intervals).In this case, the joint method without dimension reduction attempts to estimate all of the 100 2 = 4950 parameters in the correlation matrix; however, it appears to have some difficulties recovering the parameters for the sample size considered.Conversely, the  3, 5, 10, 15, 30, 50, 75, 100. (b) Posterior mean for the unique number of clusters (K) drawn for each of 50 datasets for the value of r considered.dimension reduction approach, by only requiring 301 parameters, identifies an underlying lower dimensional structure, which, in this scenario, yields outstanding recovery of the correlation matrix.The bottom row in the figure compares the fitted regression coefficients to the true ones (means and 95% credible intervals).Here we observe that all three approaches do a reasonably good job, with slightly more accuracy using either the dimension reduced or the independence approach.
In the simulation exercise considered, our method, in addition to fairing well for estimation, outperforms the joint unreduced and independent models in terms of prediction.To assess out-of-sample conditional prediction, for each of 10 species, we obtained the probability of presence at sites where the species was in fact present, and the probability of presence when the species was in fact absent.The conditioning is on the presenceabsence status of 90 (top row) and 60 (bottom row) species (Figure 6), respectively, for each of the three approaches considered.The 10 species being predicted are sorted from most to least prevalent, thus the decreasing trend observed in the probabilities.
For the simulated dataset considered, the joint approach without reduction struggles when attempting to estimate this many parameters for the number of samples available.
For some species, it clearly separates the probabilities of presence at occupied and unoccupied plots, but for other species the distributions of these probabilities appear to be practically indistinguishable.This behavior seems to be affected by the number of species on which the conditioning takes place for the joint unreduced model (compare the top and bottom plots on the left column in Figure 6).With so many parameters and the available sample size, both estimation and prediction appear to breakdown without doing reduction.On the other hand, while prediction using the independence model roughly captures the trend dictated by prevalence for each species, the probabilities of presence at occupied and unoccupied plots (unaffected by conditioning when assuming independence) barely separate from each other.In contrast, our reduced dimension approach assigns visibly higher probabilities to plot/species combinations where the species was in fact present than to those plot/species combinations where the species was absent.

Forest Inventory Data Analysis
We apply our methodology to hectare scale plots obtained by aggregating FIA data in covariate space, since, as clarified in Section 2, FIA plots are too small to allow for meaningful fitting or interpretation.The model yields predictions in covariate space and enables the analysis of species response to changes in the environment.The aggregated data consists of presence-absence data in covariate space for 112 species on 1200 plots.Of these, 1100 are used in model fitting the model, whereas the remaining 100 were held out for out-of-sample prediction and subsequent validation.Also, we only consider temperature and deficit as predictors.In this analysis, we compared the results obtained for the joint dimension reduced approach against the independence model.We focus on comparing (i) goodness of fit, and (ii) predicted species richness.
To assess model goodness-of-fit and compare the independent approach against reduced dimension joint models for r = 3, 5, 10, 15, 30, 50, we obtained the median and 95% credible sets for T R (see Figure 7), as defined in Section 5.2.As expected, the joint approach with dimension reduction yields a remarkably better fit than the independent model for all values of r considered.For the choices of r considered, the median T R values were all similar.We present the analysis for r = 5 since this resulted in the largest median T R. Before moving forward with the analysis, we look at the posterior distribution of the number of unique clusters, K, to establish if the model is well determined.With r = 5, the 95% credible interval for K lies between 42 and 51 with a median value of 47, indicating that we should not run into indeterminacy issues by fixing the A matrix to have r = 5 columns.
Tjur's R 2 can additionally be used to assess out-of-sample predictive performance making use of the dependence between species.In particular, by considering a holdout sample of plots (in covariate space in this application), one may obtain at each holdout plot the posterior predictive conditional expectations for species j given the presence/absence status of species l.With these, we may calculate the conditional Tjur R 2 , which we denote by T R j|Y l =0 and T R j|Y l =1 , if we condition on species l being absent or present, respectively.Being able to condition in this fashion can enhance the quality of the predictions when species j and l show strong dependence.We illustrate this alternative use for Tjur's R 2 at 100 holdout plots by conditioning on the presence-absence state of Fagus grandifolia (FAGR) -present in 66 plots and absent in the remaining 34 -and obtain the posterior probability of presence for Nyssa sylvatica (NYSY) at each pseudo-plot.Using the probabilities of presence, we calculate T R NYSY|YFAGR=1 and T R NYSY|YFAGR=0 under both the joint model with r = 5 and the stacked independent models (Table 1).The posterior 95% credible interval for the correlation parameter (on the latent scale) between FAGR and NYSY lies between 0.2957 and 0.4163, which is relatively high.As such, we expect that conditioning on the presence-absence state of FAGR should improve the quality of the predictions for NYSY.As observed in Table 1, the improved predicted ability of the joint model is evident, when FAGR is either absent or present.
Using the selected model, among the 6,216 correlation parameters in AA + σ 2 ε I, we found that 72.4% of the 95% credible intervals excluded 0 and were positive, 12% excluded 0 and were negative.For the 95% credible intervals of the 112 temperature Figure 8: Posterior distribution for the mean difference between the true and predicted species richness in 100 out-of-sample plots.coefficients, 69% were positive and excluded 0 while 26.8% were negative and excluded 0. For deficit, 25.9% of the credible sets excluded 0 and were positive while 40.2% did not contain 0 and were negative.The signs resulting for the regression coefficients of both predictors are consistent with the environment features the different species are best suited for (see Figure 4 in Supplementary Appendix C).For instance, the temperature coefficient for Quercus laurifolia (QULA3) is large and positive, which is consistent with the higher temperatures found throughout its natural range, i.e., southeastern and south-central US.Conversely, Betula papyrifera (BEPA) is found throughout northern continental US where the temperatures are on average lower and its coefficient is large and negative (see Figure 4).This predictive approach can be applied to scenarios for climate change as a joint response across all dominant species.
To validate the claim made in the literature that predicting richness from individual models overestimates the number of species (see, for example, Guisan and Rahbek, 2011;Clark et al., 2016;Calabrese et al., 2014), we considered a hold out sample of 100 plots.At each of these plots we calculated the expected richness using the MCMC parameter draws.
With the expected richness values estimated at each plot, we calculated the difference between the true richness and the predicted one, and with these derived the posterior distribution of the average differences across all plots (Figure 8).This analysis corroborates the claim made in Guisan and Rahbek (2011).When modeling the species distributions jointly, the mean differences between the true and predicted richness cluster more tightly about 0 than when using the independence model.The latter specification concentrates more mass on large negative values, confirming that it tends to overpredict richness.

Summary and Future Work
We have extended recent activity in joint species distribution modeling.In particular, we have noted that, with many species, perhaps hundreds or thousands, fitting explicit joint species distribution models over several plots will become computationally infeasible.We have offered a first stage latent multivariate normal modeling approach, incorporating a dimension reduction strategy to capture interspecies residual dependence, using Dirichlet processes in a novel way that scales linearly in the number of species.We have shown that this approximation enables estimation of dependence structure, prediction, and clustering of species.The versatility of the framework developed, and the ease with which it can be incorporated into strategies with other types of responses, make it a powerful tool to tackle novel problems.Among others, this approach can be used to address the challenges of microbiome data analysis (where presence-absence of hundreds or thousands of OTUs -Operational Taxonomic Units -is jointly observed), or the massive species distribution datasets currently being collected within the continental scale National Ecological Observatory Network (NEON) project.
Our approach can be applied to other data settings, particularly abundance which can be measured in various ways, e.g., counts, ordinal abundances, proportions of coverage, biomass, basal area, etc., each requiring a different first stage transformation from the latent multivariate normal.To make the methodology readily available for practitioners, it has been fully implemented in the gjam2.0package (Clark et al., 2016) in R for binary, count, and non-negative continuous responses.
At fine spatial scales, accounting for spatial dependence may be appropriate.A straightforward alternative could be to build into the mean component some function of the spatial coordinates -in the simplest case, by including the linear terms and interaction between the coordinates.However, other more sophisticated strategies, where the dependence is modeled directly, merit exploration.In these we will have both within plot and between plot dependence arising across many species and many plots.Lastly, utilizing climate scenarios, we can explore dynamic models to attempt to forecast the evolution of joint species distributions in response to changing climate.

Figure 3 :
Figure 3: Out-of-sample RMSPE for continuous simulations using the (a) unstructured and (b) clustered covariances.Results compared the independence model to the dimension reduced approach (assuming r = 5, 10, 20) when predicting 50, 100, 200 and 500 species (out of 1000) from 10 simulated datasets with n = 3000.

Figure 4 :
Figure 4: (a) Goodness-of-fit measured by the posterior mean of Tjur R 2 for 50 simulated datasets with binary response for 100 species in 1000 plots having r = 3, 5, 10, 15, 30, 50, 75, 100.(b) Posterior mean for the unique number of clusters (K) drawn for each of 50 datasets for the value of r considered.

Figure 5 :
Figure 5: Binary simulation results with 100 species at 1000 plots.Comparison between joint model without reduction (top row), joint model with dimension reduction (middle row), and independent model (bottom row).

Figure 6 :
Figure6: Binary simulation comparison of the predicted conditional probabilities of presence for 10 species given the observed presence status of 90 and 60 species, respectively.

Figure 7 :
Figure7: Goodness of fit measured by the posterior median and 95% credible sets of the Tjur's R 2 coefficient averaged over all species (T R), comparing the independent model to the dimension reduced joint models with r = 3, 5, 10, 15, 30 and 50.
pred|obs ), making the obvious modifications to μ h,pred|obs and Σ pred|obs in the definitions provided for the continuous case to accommodate V o instead of Y o and B instead of B. Similarly, prediction is done by using the MCMC draws from θ.
where V hj and Y hj represent the elements of the vectors V o h,pred and Y o h,pred , respectively.Here,p(V o h,pred |θ, Y, V o h,obs ) = φ Sp (V o h,pred |μ h,pred|obs , Σ

Table 1 :
Tjur R for NYSY conditional on FAGR in a holdout sample at 100 plots.