The unmarked R package: Twelve years of advances in occurrence and abundance modelling in ecology

Species distribution models (SDMs) are widely applied to understand the processes governing spatial and temporal variation in species abundance and distribution but often do not account for measurement errors such as false negatives and false positives. We describe unmarked, a package for the freely available and open‐source R software that provides a complete workflow for modelling species distribution and abundance while explicitly accounting for measurement errors. Here we focus on recent advances in unmarked functionality to support multi‐species, multi‐state, and multi‐season data, as well as support for fitting models with random effects. For illustration, we present an analysis of Acadian Flycatcher Empidonax virescens abundance on Roanoke River National Wildlife Refuge, North Carolina, USA, over 18 years. We found that Acadian Flycatcher abundance was initially greater in hardwood plantation habitat relative to bottomland hardwood forest along river levees but that abundance declined over time in both habitats. We plan for unmarked development to keep pace with advances in hierarchical modelling in ecology, including better handling of continuous‐time data from camera trap and automated recording units and integrated models for multiple data streams.


| INTRODUC TI ON
Understanding the processes governing spatial and temporal variation in species abundance and distribution is fundamental to ecology. Reliable inference requires a combination of appropriate study design and models of the data-generating processes. Studies of species distribution typically involve sampling multiple sites (locations), across a range of environmental conditions, and recording the presence and/or abundance of a given species at each site. A broad class of models called species distribution models (SDMs) have been developed to obtain inference from such data, such as identification of environmental covariates that influence where a species occurs (Elith & Leathwick, 2009).
A very common issue with SDMs is measurement error. Some individuals or species may be recorded as absent when they are present (false negatives, Williams et al., 2002), and others may be recorded as present when they are absent (false positives; Royle & Link, 2006). Ignoring measurement errors can result in biased estimates of distribution and abundance, including relationships with covariates (Gu & Swihart, 2004;Kéry & Schmidt, 2008). A specialized class of hierarchical SDMs has been developed to estimate rates of false negatives and positives (MacKenzie et al., 2002;Royle & Link, 2006). Such models often require repeated samples (also called 'occasions' or 'visits') at some or all sites. Individual organisms may be marked (i.e. uniquely identifiable) so they can be tracked across the repeated samples (Williams et al., 2002), while other models accommodate unmarked organisms. Common examples include occupancy modelling for the estimation of species occurrence (MacKenzie et al., 2002), and capture-recapture (Pollock et al., 1990), N-mixture (Royle, 2004), and distance models (Buckland et al., 2001) for species abundance.
The unmarked R package supports many model types and has been used widely based on citation records. As the name suggests, unmarked focuses on models of occurrence and abundance that do not require individually marked animals, although some of its methods are applicable to data from marked individuals. The package provides a complete analysis toolset, including formatting input data, model fitting, assessing goodness of fit, bootstrapping, model selection and generation of predictions. Package unmarked was first released in 2010 and has been cited more than 2000 times, averaging more than 250 citations per year over 2018-2022 based on a Google Scholar search in December 2022.
Since 2010, many features have been added to unmarked.
First, the available models and associated tools have been greatly expanded to reflect new developments in the field. Second, many models now support the estimation of random effects. Finally, several new auxiliary tools, including power analysis, have been added.
We give an overview of currently available models and associated analysis tools, describe a typical unmarked workflow, and provide an example analysis estimating the abundance of Acadian Flycatchers Empidonax virescens.

| AVAIL AB LE MODEL S
Models in unmarked can be divided into two categories: occupancy models (Table 1) and abundance models (Table 2). Generally, occupancy models require detection/nondetection data as input and abundance models require count data, but there are exceptions which we describe below. For brevity, we do not describe these models mathematically as this information can be found in the references provided in Table S1.

| Occupancy models
Detection/nondetection data distinguish two possible true states for a site, presence and absence, observations of which may be contaminated with false-negative and/or false-positive errors. When a single 'season' (e.g. 1 year) of such data for a single species is available, the single-season, single-species occupancy model can be fit using the occu function to account for false negatives (Table 1) to estimate occupancy using time-to-detection data, that is, the amount of time at a site before a given species is detected. Timeto-detection data is becoming increasingly easy to obtain with automated recording units (ARUs) and associated machine-learning software (Rhinehart et al., 2020). In unmarked, both single-season and dynamic time-to-detection occupancy models can be fit using the function occuTTD.

| Abundance models
Most abundance models in unmarked can be organized as pairwise combinations of sampling method and temporal structure ( Table 2).
The possible sampling methods include (1) repeated counts, (2) distance sampling, (3) double-observer sampling, and (4) removal sampling. The possible temporal structures are (1) single-season data; (2) multiple surveys within a single season to estimate changes in availability due to a form of 'temporary emigration' of individuals (Kendall et al., 1997) and (3) multiple seasons of data to estimate population growth, emigration and similar parameters (Dail & Madsen, 2011).
Repeated-count data allow for the estimation of detection probability and, thus, abundance using variations of the N-mixture model (Royle, 2004). The single-season version of this model can be fit with pcount, the temporary emigration version can be fit with gpcount (the g in this and similar functions stands for generalized), and the dynamic version can be fit with pcountOpen (Table 2). Despite the naming conventions, these functions are intended to be used for Nmixture models generally and not specifically avian point counts.
Distance sampling involves measuring the distance from a point or a transect to observed organisms and using the estimated relationship between distance and detectability to estimate detection probability (Buckland et al., 2001). unmarked implements variations of the hierarchical distance sampling model of Royle et al. (2004), which allows for simultaneous estimation of detection and abundance. The naming conventions for these functions follow a similar logic: the single-season model is distsamp, the temporary emigration model is gdistsamp, and the dynamic model is distsampOpen.
Double-observer models (as their name suggests) require two humans to simultaneously survey a site. Detection probability is estimated using information on how many individuals were observed by both or only one of the two observers. For removal sampling, a single observer divides a survey into several discrete time periods and counts how many individuals are first observed in each time period. These two sampling methods are handled by the multinomPois function for single-season data, the gmultmix function for temporary emigration and the dynamic version is multmixOpen. This family of functions can be customized to accommodate a wider range of related models, including capture-recapture data from marked organisms (Chapter 7 in Kéry & Royle, 2016).
There are three additional abundance models available, which do not readily fit into the organizational structure described above (Table 2). The first is the Royle-Nichols model (occuRN), which exploits the functional relationship between abundance and detection probability to estimate abundance from detection/nondetection data. The second is a model that estimates abundance from time-todetection data (nmixTTD). Finally, unmarked includes an abundance model (gdistremoval) which incorporates distance sampling data (as with distsamp) and removal sampling data (as with multinom-Pois) to estimate detection probability and availability parameters (Amundson et al., 2014).

| T YPIC AL unmarked WORK FLOW
Below we describe a typical unmarked workflow divided into three parts. Additional details are available in the package vignettes (https://cran.r-proje ct.org/web/packa ges/unmar ked/vigne ttes/). A summary of the workflow is found in Table 3.

| Model fitting and diagnostics
Model structure is defined using a formula-based notation similar to the one used widely in the R ecosystem (e.g. by the lm function; R Core Team, 2022). Given their hierarchical structure, unmarked models require two or more such formulas. For example, an occupancy model with a covariate for wind affecting detection and covariates for forest cover and elevation affecting occupancy would be ~wind ~forest+elev. Formulas may include certain functions indicating that covariates should be transformed before analysis. For example, the modified formula ~scale(wind) ~scale(forest)+scale(elev) indicates that each covariate should be standardized using the scale function to have mean 0 and standard deviation 1. Some unmarked functions (currently occu, pcount, multinomPois, distsamp and gdistremoval) also support estimation of random intercepts and slopes using the notation of the lme4 package (Bates et al., 2015). For example, adding a random intercept on occupancy by site ID to the previous formula would be ~wind ~forest+elev+(1|site_id).
The unmarkedFrame and formulas are supplied to the function corresponding to the desired model (Tables 1 and 2 A model selection table showing Akaike information criterion (AIC) scores, number of parameters, and model weights can be generated by applying the modSel function to a list of fitted models. There is limited support for model-averaged prediction with predict from a list of models, and more complete model averaging functionality is available via the AICcmodavg package (Mazerolle, 2020). Model predictive power can also be assessed and compared with crossvalidation via the crossVal function, which supports holdout, K-fold, and leave-one-out cross-validation (Stone, 1977).

| Model selection and inference
Once a top model has been selected, it can be used to generate estimates and 95% confidence intervals of model parameters using the predict function. Overdispersion-corrected prediction intervals and AIC scores for most unmarked analyses can be obtained using functions in AICcmodavg (Mazerolle, 2020). Species distribution maps (e.g. Sillett et al., 2012) can be generated by providing raster data to predict. The model can also be supplied to the ranef function to estimate posterior distributions of the latent parameters (e.g. occupancy state, abundance) at each sampled site using empirical Bayes methods (Carlin & Louis, 1996).

| Study design and data collection
We estimated population abundance from a dataset of Acadian

| Simulation and power analysis
We conducted a power analysis to evaluate our ability to detect differences in abundance between habitat types and over time, to demonstrate this new functionality in unmarked (typically, power analysis should be done before data collection). We first generated a template unmarkedFrame which contained the study design for which we calculated power using the simulate function. The study design consisted of 50 point count locations, equally assigned between two habitat types (designated A and B), each surveyed once annually for 15 years.

TA B L E 3
A typical workflow for an unmarked analysis, and unmarked functions used in each step.  Tables 1 and 2) 6. Model diagnostics parboot(), crossVal(), residuals(), vif(), functions in AICcmodavg package 7. Model selection modSel(), crossVal() 8. Model inference predict(), ranef(), posteriorSamples() Next, we provided the powerAnalysis function with the template study design and parameter values corresponding to the abundance and detection processes. We defined the parameters such that habitats A and B would have initial abundances of about 5 and 6 birds respectively, and abundance would decline about 2% per year. We set the distance-based detection parameter σ at the median distance of our point count radius (25 m) and the removal-based probability of availability at 0.27 (similar to the real data). With this information, the powerAnalysis function generated 100 simulated datasets and fit the removal-distance sampling model to each dataset. Finally, pow-erAnalysis calculated the proportion of datasets for which a given coefficient (e.g. for habitat differences) was statistically significant (with α = 0.05) as a measure of power.

| Model fitting and model selection
We analysed the actual dataset using the removal-distance sam- We compared models using AIC. Code and data to replicate this analysis is available from Zenodo (Kellner et al., 2023).

| Results and discussion
Our power to detect a difference in abundance between habitat A and B (5 vs. 6 birds per point) was 0.71, and power to detect a declining trend in abundance of approximately 2% yearly was 0.81. The top-ranked model for abundance fit to the real data included effects of habitat type and year, as well as their interaction (HAB × YEAR), on abundance (Table 4).
Based on the top-ranked model, Acadian Flycatcher abundance declined in both habitat types (Figures 1 and 2). This is consistent with range-wide declines in the species as documented by the North American Breeding Bird Survey, which indicated a 10% range-wide decline during 1970-2014 (Rosenberg et al., 2016).
Acadian Flycatcher abundance in our case study was initially about 55% greater in hardwood plantation habitat relative to bottomland hardwood forest along river levees. However, abundance declined more slowly in bottomland hardwoods (about 2.7% annually) compared to hardwood plantation habitats (7.0% annually), resulting in similar estimated abundances in both habitats by the end of the survey period ( Figure 2). The initial difference in abundance may relate to more favourable habitat structure for Acadian Flycatchers in the hardwood plantations at the beginning of the survey period, including ample space among plantation rows that promoted a diverse and complex midstory relative to that of bottomland forests (Hazler et al., 2006;J. Richter, pers. comm., 2022

| CON CLUS ION
The unmarked package has been cited widely in studies of abundance and occurrence since 2010 (based on Google Scholar search, December 2022). Notable recent improvements to the package include additional models for multi-species occupancy (Rota et al., 2016), multi-state (MacKenzie et al., 2009, and multiseason data (Dail & Madsen, 2011), power analysis tools, and support for random effects. In addition to new features, we improved the quality and reliability of the underlying code, including more extensive code testing. Such improvements will be crucial for successfully maintaining and expanding unmarked over the next decade.
We plan for unmarked development to continue to keep pace with advances in hierarchical modelling in ecology. This will involve support for new data types and models, along with improved tools for diagnostics and inference. We expect to improve support for large detection datasets collected in continuous time, such as from camera trap arrays and ARUs. Such data are becoming increasingly available due to advances in camera/ARU technology and machine learning (Rhinehart et al., 2020;Tabak et al., 2019). We are also developing models that incorporate data from multiple sampling designs and/or studies (i.e. integrated models). For example, we recently added support for a combined distance sampling/removal model (Amundson et al., 2014) and are implementing a model that can integrate distance sampling, point count and detection/nondetection data .

CO N FLI C T O F I NTE R E S T S TATE M E NT
No conflicts of interest have been declared.

PE E R R E V I E W
The peer review history for this article is available at https:// www. webof scien ce.com/api/gatew ay/wos/peer-revie w/ 10.1111/2041-210X.14123.

DATA AVA I L A B I L I T Y S TAT E M E N T
The unmarked package is available on CRAN (https://cran.r-proje ct.org/packa ge=unmarked). The development version of the package used in the application, and the package source code, can be found at https://github.com/rbcha n/unmarked. The code and data for running the application are available from Zenodo (Kellner et al., 2023).