Integrated species distribution models: A comparison of approaches under different data quality scenarios

Integrated species distribution modelling has emerged as a useful tool for ecologists to exploit the range of information available on where species occur. In particular, the ability to combine large numbers of ad hoc or presence‐only (PO) records with more structured presence–absence (PA) data can allow ecologists to account for biases in PO data which often confound modelling efforts. A range of modelling techniques have been suggested to implement integrated species distribution models (IDMs) including joint likelihood models, including one dataset as a covariate or informative prior, and fitting a correlation structure between datasets. We aim to investigate the performance of different types of integrated models under realistic ecological data scenarios.


| INTRODUC TI ON
Integrated species distribution models (IDMs), which combine multiple data sources to model species distributions, are becoming increasingly common . IDMs allow data of different types, such as structured samples from formal scientific research and opportunistic data from alternative sources like museums and citizen science programmes, to be combined in a single model Miller et al., 2019;Zipkin et al., 2019).
Structured sampling is expensive and thus often spatially restricted while opportunistic data are often abundant and readily available. Combining both types of data can capitalize on the advantages of each data type and provide better predictions of species distributions and their drivers (Miller et al., 2019). Utilizing all available data is useful particularly for developing countries, which can face resource constraints hindering efforts for extensive conservation research (Bowler et al., 2019).
Studies have suggested different approaches to integrated distribution models. Fletcher et al. (2019) outline a broad range of integrated modelling techniques including data pooling, ensemble models, using auxiliary data and weighted joint likelihood. Data pooling is a simple method that is commonly used in practice where data are simply pooled by transforming or simplifying the more sophisticated dataset to accommodate the structure of a simpler second dataset, for example, by downgrading abundance data to presenceabsence data prior to pooling, which ultimately results in some lost information. Data pooling is not always appropriate since it does not account for differences in the data sources like sampling biases, spatial and temporal variation in sampling effort, and differences in collection protocols. Pacifici et al. (2017) outlined three approaches which move beyond data pooling towards true integrated species distribution models (IDMs) termed the joint likelihood, correlation and covariate models. The joint likelihood method simultaneously fits a likelihood to both data sources while accounting for different data types and other observation processes. The covariate method incorporates information from a second dataset via a fixed effect. For correlation modelling, the datasets are indirectly connected through a shared covariance matrix that captures similar patterns present in both data sources. Additional approaches, such as incorporating information from one dataset as an informative prior, have also been suggested (Miller et al., 2019).
In their study, Pacifici et al. (2017) found that integrated models consistently performed better than models fitted with single data sources. The relative performance between the three integrated methods tested depended on the information held by the unstructured data source. The joint likelihood method was found to be more sensitive to the quality of the unstructured data source compared to covariate and correlation modelling, but all performed relatively well. We may expect that joint likelihood models perform best when bias is low and data of both types are plentiful (Fletcher et al., 2019;Simmonds et al., 2020). It is not yet clear whether alternative approaches may outperform joint likelihood models when some data are spatially biased or low in quantity, or when information on spatial biases is lacking.
Opportunistic data are often biased towards areas of high human population density and to areas that are easily accessible for recording (Isaac & Pocock, 2015). In many cases, information on effort in unstructured data is lacking, although covariates, such as human population size or distance to roads, may be available (Fithian et al., 2015). However, in some cases the causes of spatial bias may be unknown or suitable covariates to explain spatial bias in recording may not be available. In these cases, researchers will need to consider whether spatially biased opportunistic data can still be used to model species distributions. Joint likelihood approaches to data integration have been shown to be very sensitive to unknown spatial biases , producing poorer results than single dataset analyses. However, integrated models that include a lower quality dataset via a correlation structure or covariate are hypothesized to be more robust to issues such as unknown spatial bias as the degree of information sharing is less than in a joint likelihood model (Pacifici et al., 2017).
Another challenge for integrating distribution data is lack of overlap in spatial extent of different data sources (Bowler et al., 2019).
High-quality professional survey data are very expensive to collect, and so researchers may often be faced with a case where highquality data are available for only a subset of the total area of interest, for example for a subset of countries across a larger region. Joint likelihood approaches have been shown to perform well when spatial extents are not the same under some conditions (Koshkina et al., 2017), but alternative approaches have not yet been tested.
We might expect the correlation approach in particular to perform poorly under conditions of low spatial overlap as there is less area from which to estimate the correlation between datasets. A key factor determining model performance might be the degree to which a spatially restricted PA survey covers both environmental gradients and any gradients in PO data effort.
To assist ecologists in choosing the most appropriate modelling approach, we need to consider the performance of each method under a range of different data conditions. To do so, we use a virtual ecologist approach (Zurell et al., 2010), whereby we simulate a species distribution and sample from it under different scenarios. The strength of this approach is that we know the true distribution and can compare model performance to this known truth. In our simulations, we consider a scenario where a researcher is in possession of both "high-quality" presence-absence (PA) data and an unstructured citizen science type dataset with presence-only (PO) information and needs to make a decision about whether to integrate the two datasets, and if so which approach to use. We consider the potential impacts on model choice of variation in sample size of PA data, the degree to which the PO dataset is spatially biased (e.g. due to spatial variation in the number of observers) and the overlap in spatial extents between the two datasets. We also consider the importance of knowledge of spatial bias in PO data by either including or excluding a covariate explaining the bias to test whether alternatives to the joint likelihood approach are more robust to unknown biases in data sources. We test three integrated modelling approaches under the restrictive conditions of a single focal species and absence of repeat visits.

| Data generation and sampling
To assess the performance of three integrated SDM methods, we constructed a simulation study. We use the approach for data generation and sampling described in Simmonds et al. (2020), with minor modifications. Briefly, we generate an intensity surface over a 100 by 100 grid, which represents the true spatial patterns of species distributions (where intensity is high we expect to observe more individuals; Figure 1). The intensity surface was generated as a log-Gaussian Cox process. The intensity function assumes that the species distribution is determined by both an environmental effect and a random spatial term, the latter meaning that locations closer together are more likely to have similar numbers of individuals. To simulate the locations of individuals (here assumed to be equivalent to points for simplicity), we produce a separate realization from the log-Gaussian Cox process for each simulated survey. By creating separate realizations, we assume that the underlying intensity is the same across surveys but it is unlikely that the same exact individuals are recorded in both surveys.
Two different observation processes were simulated: a PA dataset simulating a professional survey and a PO dataset simulating unstructured citizen science data. Firstly, to sample the PA data from the generated true data, the whole field with dimension (100 × 100) was divided into 25 (20 × 20) squares. A stratified sampling scheme was simulated, as this is often used to ensure representative coverage in real-world surveys. A preset number of "samples" of size 1 × 1 was divided equally across strata, so that, for example, for 100 total samples, 4 samples would be taken from each stratum. The samples were recorded as presences if they intersected with a point in the point pattern and absences if they did not. We assumed that each location was visited only once and that there was perfect detection in PA samples. This is a strong assumption for most mobile taxa where perfect detection is unlikely. However, for easily identified sessile taxa, for example many flowering plants, the assumption of perfect detection in PA data may be a reasonable approximation.
Although some unstructured data may contain information on absences, the citizen science data simulated here were assumed to be presence-only data since much opportunistic data only hold presence information. PO data often come with some form of bias due to F I G U R E 1 Visualization of the simulation methodology. Yellow indicates higher values, and blue indicates lower values throughout. The true species intensity is determined by an environmental gradient plus a random spatial term. From this intensity, a realization of individuals (small black points) is produced for each survey type. For PA data (top row), sampling locations are determined by a stratified sampling design with equal coverage across the domain. Observed PA data are represented by large black points for presences and large white points for absences. For PO data (bottom row), detection is made to vary across the sampling area with higher detection towards the left. Observed PO data are thinned by the detection probability so that the spatial distribution of PO data is influenced by both the true intensity and the detection surface

Species intensity
Each survey has a different realisaƟon from the point process DetecƟon in PO data varies over space "Observed" PO data "Observed" PA data PA data are sampled using a straƟfied design sampling effort, sampling frequency, coverage area and detectability (Isaac et al., 2014). For example, more sightings are expected near roads and towns where accessibility is high making it convenient for citizens to visit.
To incorporate bias in the PO data in the simulation, the strata were also used to vary detection probability across the domain setting it to be higher towards the left side, assuming higher concentration of opportunistic surveying in that area ( Figure 1). The bias pattern simulated had a pattern perpendicular to the environmental covariate which should allow these processes to be separated (Fithian et al., 2015). Presence-only data generated from a realization of the log-Gaussian Cox process were then thinned using these detection probabilities. The number of PO samples varied for each simulation because both the intensity and thinning were stochastic processes. In some cases, a modeller may be able to explain varying detection probability in PO data with a covariate, for example representing survey effort or observer density. To simulate a covariate the modeller may be available the probabilities were transformed into a smooth pattern that varied from 1 to 0 along the x axis, simulating a good but not perfect covariate for spatial variation in detection.

| Joint likelihood model
Perhaps the most intuitive approach when faced with two datasets that we wish to integrate into a single model is to use a joint likelihood. In our simulation study, models for each dataset can be defined as follows. The PA data, which was recorded as either present or absent, followed the Bernoulli distribution.
where Y i is the response variable (presence or absence status) in sample i, p i is the probability of success (presence), α 1 represents the intercept, β 1 the coefficient of environmental covariate x 1 and f(s) a random spatial term. The random spatial effect is modelled as a latent Gaussian field approximated using stochastic partial differential equation (SPDE) with a Matérn covariance. Integrated nested Laplace approximation (INLA) can be used to estimate the solution of the SPDE (Lindgren et al., 2011). The linear predictor was linked to the success probability by the complementary log-log link (cloglog) function log(−log(1 − p)) (see Kery & Royle, 2016). For the PO model, the dataset was generated from a log-Gaussian Cox process and we can model the expected number of individuals in an area with a Poisson distribution.
where N is the expected number of observations in an area A, λ is the mean intensity function, α 2 the intercept, β 1 the parameter coefficient for environmental covariate x 1 , β 2 the parameter coefficient for bias covariate x 2 and f(s) is the random spatial effect. Note that as in Simmonds et al. (2020) the thinning probability of the PO data is not explicitly estimated, but the covariate x 2 is used instead to explain variation in detection probability in PO data. In some models, covariate x 2 was not available to simulate situations where information on the spatial bias in PO data is missing.
To fit the PO data as a Poisson process, information on covariate values at integration points is also required.

| Informed prior model
Covariate models as defined by Pacifici et al. (2017) share information between datasets by including information from one dataset, usually the lower quality dataset, in the model of a second dataset via a fixed effect. One disadvantage of this approach using PO data as the first dataset is that a decision has to be made about the spatial scale at which PO data should be aggregated to produce a suitable covariate. Here we use a different approach whereby the first dataset influences the second via informative priors rather than a fixed effect (Miller et al., 2019). To do this, we sequentially model one dataset first and use the parameter estimates to derive informative priors for the second dataset model. The first dataset therefore contributes to the estimates in the second model, but the information shared is controlled by the influence of the prior. If the first model produces imprecise estimates of the shared parameters, then the priors will be less informative than if very precise estimates are obtained. We assumed that the PA dataset was a better quality dataset than the PO dataset as it was spatially unbiased and had perfect detection. Therefore, we modelled the PO dataset first and used the estimates to derive informative priors for the PA data model. The informative prior should influence the likelihood towards better inference because some knowledge about the species distribution can be derived from the abundant PO data within the same spatial domain.
Firstly, the PO data were modelled as in Equation 2 with uninformative priors. The posterior distributions of β 1 and f(s) were then extracted from the PO model to include in the PA model as informative priors. To extract parameters describing f(s), the internal parameterization of the variance and spatial scale from the Matérn covariance, θ 1 and θ 2 , were used. Priors on the intercepts in both models were kept as uninformative uniform distributions as intercepts were not shared between models.
To assess the impact of providing a suitable covariate for spatial bias in the PO model, the first model was fit both with and without knowledge of the bias covariate x 2 . Even though there was no bias covariate in the second, PA, model, whether or not it was estimated in the PO model would influence the estimates of the environmental covariate and spatial field and therefore the informative priors used in the second model. These two scenarios reflect the fact that the researcher may or may not have knowledge about the processes that determine spatial bias in the PO data.

| Correlation model
The correlation method assumes that there is some spatial correlation between different data sources which can be estimated (Pacifici et al., 2017). Where a species is abundant in one data source, then, naturally, it should also be abundant in the other data source when both sources cover the same spatial domain.
To fit this model using INLA, we allowed the spatial random effects to be correlated between datasets using the "group" function.
This approach is commonly used in spatio-temporal models, where there is a spatial correlation pattern which may be correlated over time (Blangiardo et al., 2013). Here, we estimated f 1 (s) for the PA data and f 2 (s) for PO data using SPDE (Equations 3 and 4) and estimated the correlation ρ between data sources using an exchangeable correlation structure whereby f 1 (s) = ρf 2 (s) + ε. This relationship between f 1 (s) and f 2 (s) is controlled by the correlation coefficient ρ and also includes some error ε. Note that although the spatial fields were correlated rather than jointly estimated, the environmental covariate was estimated via joint likelihood, that is β 1 is shared across both models. This model therefore assumes that while the spatial pattern in observations may not be identical between datasets, for example due to unknown biases, the effect of the environmental covariate is still shared.
Since this type of integrated species distribution model only assumes a spatial correlation function exists between data sources, the outcome of the model fitting will be two separate spatial fields. So, to make predictions one must choose which spatial field to use. With the same reasoning as before, the perfect detection in PA data made it the better choice to be used to create predictions from the correlation model. If the correlation between the datasets is high, the choice of grouping for predictions would not matter greatly and both groupings would perform similarly. The strength of the correlation indicates how much information is shared across the data sources.
Models were fit with and without covariate x 2 to assess how sensitive models were to information on spatial bias in PO data.
In addition to the three integrated models described above, models were also fit to the PA and PO datasets separately for comparison (Equations 1 and 2).

| Validation
In each set of simulations, predictions were made for an equally spaced sample of 400 locations across the domain. This coarser resolution for prediction is advantageous as it speeds up model fitting.
Three metrics were assessed. Firstly, the accuracy was calculated using the mean absolute eError (MAE) between the predicted log intensity and the true log intensity. This metric was chosen because it is less sensitive to outliers than the mean squared error. Smaller errors indicate that the predicted values are close to the truth and that the model has a good fit. Note that the MAE was not calculated relative to mean predictions (as in Simmonds et al., 2020) so reflects the degree to which the absolute intensity can be returned.
The Pearson correlation between predicted log intensity and actual log intensity was also calculated to capture the similarity in predicted spatial distributions. This metric was included because models with high MAE could still capture relative spatial patterns fairly well, which can be assessed by the correlation metric. Positive correlation means that the predicted log intensity increases with the true value, whereas negative correlation means prediction pattern moves in the opposite direction of the truth. The closer the correlation coefficient is to one, the closer the spatial match between predicted and actual values.
Lastly, bias in the estimate of the environmental covariate coefficient was assessed since the true parameter value was known. A mean estimate close to the true value indicates a well-fitting model and small credible intervals of the posterior distribution represents a precise model.

| Scenarios
We assessed model performance in two simple simulation studies.
For the first study (Table 1), the first scenario tested was to study the effect of the sample size of PA dataset on the performance of integrated models. In reality, PA sampling is expensive, sample size is often restricted and the researcher may need to decide how many samples to collect in future surveys. Hence, the simulation study The second parameter tested was the degree of bias in the PO sample. To analyse the effect of spatial bias, the detection probabilities that vary horizontally across the field as in Figure 1 were log ( (s)) = 2 + 1 x 1,(s) + 2 x 2,(s) + f 2 (s) changed to control the degree of thinning of the point process. Two new sets of probabilities were specified to represent an unbiased situation and a very biased situation. A constant detection probability of 0.2 was set across the whole field dimension to represent the unbiased PO sampling and for the very biased dataset, a set of probabilities with larger range was formed; (0.5, 0.4, 0.1, 0.01, 0.001), where points are more clustered towards the left side of the field compared to the default bias pattern (0.5, 0.3, 0.1, 0.05, 0.01).
The detection probabilities for the unbiased and very biased PO data were chosen carefully to avoid too large of a difference in the total number of PO observations between the scenarios to make them comparable.
A second simulation study was conducted to assess the impact of partial coverage of the domain by one dataset. Due to limited resources, running a research sampling survey sometimes may only cover a limited area while PO sampling often covers a larger proportion of the surrounding area especially where accessibility is high.
Hence, one of the objectives of this simulation study was to understand the effects of the size of PA survey area on the integrated model when combined with a PO sample with larger area of study.
The second objective was to vary the location of the PA survey in relation to the spatial patterns of bias and environmental covariates.
In data generation, the spatial variation in detection and environmental covariate were assumed to be perpendicular to each other ( Figure 1). So, a few aspects were tested that include the overlap of the PA sample with areas where there were large amounts TA B L E 1 Details of the parameters used in the scenarios investigating sample size of PA data and bias in PO data. The probability of observing PO data is given as five values, each associated with a stratum (see Figure 1), forming a gradient from high to low probability. In the Unbiased scenario, the probability of observing PO data is constant The coverage area was also varied, and the number of PA samples differed according to the size of area. The total number of PA samples for the whole field was set to 250 so that 1/5th of the total area had 50 samples and 3/5 th s had 150 samples. The bias in the PO observations was calculated in the same way as the default scenario in Table 1. The models were also fitted using PA data with full coverage area for comparison. All the new scenarios formed for the PA dataset were integrated with the PO data that covered the whole extent and performance was analysed.
For each simulation, all the implemented models were fitted and performance measurements were analysed and compared. Models include the single PA and PO models and the integrated joint likelihood, informed prior and correlation models. Additionally, an extra covariate x 2 was included in the PO, joint, informed prior and correlation models to account for bias in the PO data. Covariate x 2 could be either a variable strongly related to effort (e.g. human population density) or auxiliary information on effort provided alongside PO data. The models fitted without this covariate represent a situation where the data may be spatially biased but there is no information to explain it, that is there is no auxiliary information and other suitable covariates are not available. This is often the situation faced by researchers aiming to use ad hoc observations for modelling where sources of spatial bias can be complex and not always easily approximated by available covariates (Johnston et al., 2020).
All models were implemented in R-INLA (Rue et al., 2009). Code to generate the data and run the simulations is available at https:// github.com/NERC-CEH/IDM_compa risons.

| RE SULTS
Increasing the size of the PA data increased performance, as meas- If a covariate to explain the bias was not available (Figure 2b,d,f), then the degree of bias in PO data had a much larger impact on the joint likelihood model than on the informed prior or correlation models (Figure 2b). The joint likelihood model had lowest MAE of all models in the unbiased scenario and highest in the biased scenario while the informed prior and correlation models were relatively unaffected. However, the informed prior model also showed poor performance in all scenarios in relation to the best performing IDMs, suggesting again that this model provided little improvement over the PA-only model. Surprisingly, the PO-only models had higher error when a bias covariate was available, suggesting that β 2 was poorly estimated when no PA data was available.
All models estimated the environmental coefficient poorly The simulation study demonstrated that the joint model did not always perform better than the single PA model unlike the study by Pacifici et al. (2017) whose integrated models always produced better predictions than its single model when the underlying assumption that the two data sources were related was met. Pacifici et al looked at situations where either detection was constant across space or where spatial variation in effort was very well known. Here we considered that detection in PO data almost always varies in space and there may be no information to provide a suitable covariate for spatial variation in detection or effort. We demonstrate that the joint likelihood model performs poorly when the PO data source is biased and that bias cannot be accounted for by a covariate, also shown by Simmonds et al. (2020). The informed prior model was robust to unknown spatial bias in PO data but provided little benefit over the PA-only models in most scenarios. The correlation model was less sensitive to unknown spatial bias in PO data than the joint likelihood model and performed only slightly worse than the joint likelihood model in the unbiased scenario, suggesting this may be a good choice of model for ecologists faced with data of unknown quality.

F I G U R E 3
Performance of each integrated model and single dataset models when the structured PA data has either partial or full coverage of the total area of interest. Panels (a)  A visualization of the PA data placement is available in Appendix S1 in Supporting Information Both the informed prior and correlation IDMs were less affected by spatially biased PO data, even when no covariate was available to explain the bias. In the informed prior model, the PO data contribute via the prior while in the correlation model the spatial fields are allowed to be correlated but not completely shared. Therefore in both these alternative IDMs, the information contributed by the PO data is lower, reducing the sensitivity of these models to spatial bias in this data source. The informed prior model in particular was unaffected by bias in the PO data; however, this model also provided little benefit beyond modelling PA data only unless the PA data were limited in coverage. This indicates that the prior obtained from the PO data was sufficiently vague to mean that the posterior was largely informed by the PA data.
Only when the spatial coverage of the PA data was limited did the benefit of this IDM become apparent. In agreement with the research done by Koshkina et al. (2017), the performance of integrated models with PA data restricted to a small part of the total area of interest can be higher than using PO data only, suggesting that spatially restricted PA data can still be valuable in estimating species distributions. However, joint likelihood models were influenced by the location of the PA data in relation to gradients of bias. If the PA data did not cover the gradient of bias of the PO data, then joint likelihood models produced poorer results than if the PA data alone had been used unless a covariate was available to explain bias in the PO data. Therefore for PA data to be useful in separating bias from the true spatial distribution, it must cover both areas with high sampling effort or detectability and areas with low effort or detectability. It is likely to be impossible for a researcher to be able to estimate whether PA data cover this gra- Another limitation is the simplistic way in which the environmental covariate and spatial bias were assumed to have perpendicular gradients, so that their effects could be easily separated. If these variables are correlated, then integrated models may perform poorly , although the relative performance of IDM types under scenarios of correlation between environmental suitability and sampling bias has not yet been investigated. Simmonds et al. suggested that models containing two spatial fields, one capturing the spatial bias, could be useful where there is correlation between drivers of occurrence and sampling bias. The robustness of each type of IDM to other potential patterns in sampling such as preferential sampling in areas of high occupancy is also useful topics for future work.
Overall, the study confirms that joint likelihood models provide the best estimates when data are unbiased, or bias is well accounted for, but are very sensitive to unexplained spatial biases. The two alternative IDMs were robust to unknown spatial bias; however, the informed prior model showed little improvement over modelling the PA data alone unless the PA data were spatially constrained. The correlation model performed well under conditions of unexplained bias and provided an improvement over modelling single data sources.
We suggest that the correlation model may be the best choice in many applications where researchers are faced with spatially biased PO data. The cost of using this model when data are unbiased, or when effort can be explained by a covariate, is low, with only a small reduction in performance compared to the joint likelihood model.
The correlation model also performs well when PA data are spatially restricted. Understanding the complex spatial biases in PO data is a real challenge for researchers so providing integrated modelling approaches that are more robust to unknown biases is important to allow researchers to apply integrated approaches to a wider range of datasets.

ACK N OWLED G EM ENTS
The work presented here formed part of a Master's thesis by SSAS

PEER R E V I E W
The peer review history for this article is available at https://publo ns.com/publo n/10.1111/ddi.13255.