A hierarchical framework for estimating abundance and population growth from imperfectly observed removals

Estimating abundance and growth of animal populations are central tasks in ecology and natural resource management. Removal models for estimating abundance have a long history in applied ecology, and recent developments provided hierarchical extensions that account for spatially replicated sampling and heterogeneous capture probabilities. Measurement error is common to removal data collected from many broad-scale monitoring programs, however, and a general framework for population assessment using removal data in the presence of measurement error is lacking. We developed a hierarchical framework for estimating abundance and population trends from removal experiments that are replicated in space and time that accommodates measurement error, as well as heterogeneity in capture probability and animal density. We describe the model for variable-effort removal sampling and use it to estimate region-specific abundance and population trends for wild turkeys (Meleagris gallopavo) in Michigan, USA. We used a Bayesian approach for estimation and inference and fit models using daily hunter harvest and effort estimates collected over 5 management regions for 14 annual hunting seasons. Our analyses provide evidence for spatially heterogeneous capture probabilities among regions and turkey densities that were heterogeneous in both space and time, and show that populations increased slightly over the study. Our framework provides a general approach for population assessment using removal data that are collected over broad scales in resource management contexts (e.g., animal harvesting), facilitating formal abundance estimation instead of reliance on unverified indices for tracking populations of managed species. Thus, we provide a useful tool for monitoring programs to assess populations over broad scales, and therefore inform decision makers about population status at spatial scales similar to those for which regulatory decisions are made.


INTRODUCTION
Estimating abundance and population growth are central tasks in ecology, conservation, and natural resource management. Ecology has been referred to as study of the distribution and abundance of organisms (K ery and Schaub 2012), and reliable information about population size is central to understanding a broad array of theoretical and applied concepts, including density dependence (Dennis 2002, Dennis et al. 2006, Lebreton 2009), harvest theory (Ricker 1954, Schaefer 1954, extinction risks (Pimm et al. 1988), and biogeographic patterns and processes (Brown 1984, Brown et al. 1996. Accurate abundance estimates are also vital in the context of conservation and natural resource management, where they play a critical role in making and evaluating the effectiveness of decisions (Williams et al. 2002, Nichols andWilliams 2006). Indeed, accurate information about population size is generally treated as a prerequisite for decision-analytic procedures commonly advocated for state-dependent decision making in conservation and resource management (Johnson et al. 1997, Kendall 2001, Martin et al. 2009). Yet estimates of abundance, rather than just indices that are assumed proportional to abundance, are often challenging for management agencies to generate at broad scales where decisions are made because they are logistically difficult and costly to obtain.
Removal models are a well-established framework for estimating abundance of animal populations (Moran 1951, Zippin 1956, Borchers et al. 2002. Removal models are developed from sampling designs (hereafter referred to as removal experiments) under which individual animals are captured and removed from a population over successive occasions (hereafter referred to as trials), where the population is assumed to be closed to additions or other removals for the duration of the experiment (over all trials). A typical example records the number of animals removed from a population over a series of days; however, removal trials can be defined in other discrete units (e.g., multi-day periods, single electrofishing passes). Sampling intensity is often assumed constant during each successive trial (fixed-effort sampling; Moran 1951, Zippin 1956, Dorazio et al. 2005, but models can be developed with variable sampling effort over trials when data on that effort are available (DeLury 1947, St. Clair et al. 2013. Removal experiments thus measure change in the absolute number of removals or removal rates over time, and as the remaining population is depleted, the expected number of removals for each trial gets smaller. Moreover, changes in both the observed and expected number of removals across trials provide the information needed to estimate abundance at the beginning of the experiment (DeLury 1947, Moran 1951, Zippin 1956. Given the nature of removal sampling, collection of data under such designs is natural for harvested species. Removal sampling is not limited to harvested animals, however, as a removal itself can be broadly defined. Consequently, removal designs are commonly employed to estimate local abundance in applied ecological studies (e.g., electrofishing surveys of stream fishes; Bohlin andSundstrӧm 1977, Wyatt 2002).
Basic removal models have typically used a multinomial or conditional binomial likelihood to model the data generating process, and in their original form assumed a constant capture probability (Moran 1951, Zippin 1956, Borchers et al. 2002. In reality, capture probabilities can be heterogeneous among individual animals, over trials within the same experiment, or among experiments replicated in space and time (Lewis and Farrar 1968, Bohlin and Sundstrӧm 1977, Borchers et al. 2002, Dorazio et al. 2005, M€ antyniemi et al. 2005, St. Clair et al. 2013. Hierarchical statistical models have been used to accommodate many of these heterogeneities, and to explicitly model the distribution of abundance and capture probabilities. For example, M€ antyniemi et al. (2005) developed a model that accommodated heterogeneous capture probabilities among animals by assuming the distribution of capture probabilities among individuals followed a beta distribution. Others explicitly modeled the distribution of capture probabilities and abundance among sites, years, or both for designs where entire experiments were replicated (Dorazio et al. 2005, Royle and Dorazio 2006, Rivot et al. 2008. Despite recent advances, there often remain practical limitations to the implementation of removal models for estimating abundance at broad spatial extents. With some exceptions (e.g., Rivot et al. 2008), there has been little consideration of experiments that are replicated in time, and data from individual years are often modeled separately (St. Clair et al. 2013) despite the possibility of information sharing among temporal replicates. Existing models also assume the number of animals removed and sampling efforts executed on each trial are perfectly observed (Borchers et al. 2002). Perfect observation is not realistic for many broad-scale monitoring programs collecting removal data, particularly for managed species where removals occur through recreational or commercial harvest and monitoring produces estimates of harvest and effort from samples. Measurement error in removal data results in overestimation of abundance ) and limits application of removal models with many existing data sets. Thus, a pragmatic framework for the analysis of spatially and temporally replicated removal experiments in the presence of measurement error is needed. To address this need, we developed a hierarchical modeling framework for estimating abundance and population growth when removal data are not perfectly observed. In addition, we specifically sought to accommodate attributes common to harvest monitoring programs implemented by resource management agencies: variable-effort sampling, heterogeneous abundance and capture probabilities, and sampling error in the measurement of both the number of removals and removal effort. Here, we describe a framework wherein models for fixed-effort or perfectly observed data, as well as models for removal experiments replicated in either space or time, can be considered as special cases. We demonstrate the methods to estimate region-specific abundance and growth of wild turkey (Meleagris gallopavo) populations in southern Michigan, USA, and use simulation to evaluate estimator performance under a variety of scenarios.

Model description
We developed models for replicated removal experiments based on a model for variable-effort sampling replicated over S sites and T years that Boxes indicate observed quantities, whereas circles indicate parameters and latent variables or processes. Subscripts denote animal group (i), removal trial (j), sites (s), and year (t), and p denotes a prior distribution for a random quantity (priors for regression coefficients excluded for visual clarity). Symbol definitions are provided in text, and further details of model structure are provided in Box 1. is extended to accommodate heterogeneity in removal probability and animal density, and also measurement error in removal data ( Fig. 1; Box 1). Several individual components of the model we present were described elsewhere (Borchers et al. 2002, Dorazio et al. 2005, Rivot et al. 2008, St. Clair et al. 2013). However, our goal was to synthesize these components and extend the model to a flexible approach for analysis that accommodated imperfectly observed data from spatially and temporally replicated samples (under so-called metapopulation designs; K ery and Royle 2010).
We started with a conditional binomial likelihood model for removals (Borchers et al. 2002): where N is the vector of abundances at the start of each trial (j), p j;s;t is the probability an individual animal present at the start of trial j in region s and year t is removed during that trial, and h represents a vector of parameters determining the removal probabilities. Thus, N j represents the abundance of animals at the start of trial j and y j is the number of animals removed on that trial, where N j ¼ N jÀ1 À y jÀ1 for j > 1. Under variable-effort sampling, we can model p j;s;t ¼ 1 À 1 À u ð Þ e j;s;t as a function of the magnitude of sampling effort (St. Clair et al. 2013), where u is the per-unit-effort removal probability and e j;s;t is a forcing variable representing the sampling intensity. As such, 1 À u represents the probability of not removing an animal for one unit of effort, and 1 À u ð Þ e j;s;t is the probability of not removing an animal given all e j;s;t units of effort. With this model for p j;s;t , h in equation 1 only has one element, the per-unit-effort removal probability (u).
Eq. 1 is easily extended when heterogeneity in removal probability or animal density exists among animal groups (e.g., age or sex classes; see example below): for I animal groups, where p i;j;s;t ¼ 1 À 1 À u i ð Þ e j;s;t . In this context the primary goal of modeling is to estimate the initial abundances of each animal group for each site (s) and year (t), N i;1;s;t , and the per-unit-effort removal probability for each animal group u i . Total abundance at the start of each removal sequence is the sum of abundance across all groups (N s;t ¼ P i N i;1;s;t ). Lastly, un-modeled heterogeneity in removal probability is widely known to affect performance of closed-population abundance estimators (Moran 1951, Bohlin and Sundstrӧm 1977, Borchers et al. 2002, M€ antyniemi et al. 2005, St. Clair et al. 2013. To address this, additional structure in u i can be modeled explicitly using the logistic function (e.g., for space, time, or triallevel covariates), as is well established in capturerecapture (Pollock et al. 1984, Huggins 1989, Alho 1990, Royle and Link 2002, Royle and Dorazio 2006. In their basic form, Eqs. 1 and 2 assume no specific structure to the number of animals in the population at the start of removal sampling (i.e., N i;1;s;t are unconstrained). Rather than estimating Box 1. Summary of model components for hierarchical removal model fit to wild turkey harvest data from Michigan, USA. Subscripts refer to animal groups (i), trials (j), sites (s), and years (t).
Model structure and distributions Observation models y Ã i;j;s;t $ Poisson y i;j;s;t À Á e Ã j;s;t $ Poisson e j;s;t

À Á
Removal and abundance process models y i;j;s;t $ Binomial N i;j;s;t ; p i;j;s;t À Á p i;j;s;t ¼ 1 À 1 À u i;j;s;t ej;s;t N i;1;s;t $ Poisson d i;s;t A s;t À Á Capture probability and density models logit u i;j;s;t and hyper-priors e j;s;t $ Uniform 0; 8000 ð Þ g t $ Normal 0; r 2 g r 2 g $ Uniform 0; 0:4 ð Þ c t $ Normal 0; r 2 c r 2 c $ Uniform 0; 1 ð Þ b 0 $ Cauchy 10 ð Þ b k $ Cauchy 2:5 ð Þ a 0 $ Normal 0; 1 ð Þ a k $ Normal 0; 1 ð Þ ❖ www.esajournals.org initial abundances for each removal experiment as unconstrained, we modeled the distribution of initial abundance as an inhomogeneous Poisson process (Dorazio et al. 2005, Royle and Dorazio 2006, Rivot et al. 2008: where d i;s;t is the density of animals from group i at the start of sampling at site s, and A s;t represents the area sampled for a given site and year (Rivot et al. 2008, St. Clair et al. 2013). The hierarchical structure of Eqs. 3 and 4 provides a constraint on the values of N i;1;s;t and also provides a flexible framework for modeling spatial-temporal heterogeneity in animal density using log-linear regression Dorazio 2008, St. Clair et al. 2013).
We modeled a stochastic observation process to accommodate imperfect observation of variable-effort removal data. Eqs. 1-4 treat removals and effort as known without error and are applicable when perfect observations are feasible (e.g., stream electrofishing surveys, harvest with mandatory reporting). Yet in many practical applications, especially for managed populations where removals are coming from recreational or commercial harvest, observations of y i;j;s;t and e j;s;t are estimated from samples and therefore depart from their true values due to measurement error. Thus, we augment the basic model with an observation model that treats true values of removal and effort on each trial (y i;j;s;t and e j;s;t ) as imperfectly observed latent variables whose values determine the expectations of the observed data. Under this model, the true value of effort generates the true removal on each trial through a latent first-order Markov process (Eq. 1 or 2), which in turn determines the expectation of the observed removal for trial j. Specifically, we use a Poisson observation model to represent measurement error in the observed values of removal (y Ã i;j;s;t ) and effort (e Ã j;s;t ): y Ã i;j;s;t $ Poisson y i;j;s;t À Á (5) e Ã j;s;t $ Poisson e j;s;t À Á : (6) . Eqs. 5 and 6 can approximate many applications where over-and underestimation of removal and effort are both plausible, and the observation models could also be adapted for specific applications if additional information on the measurement process were available.
Lastly, we employ a Bayesian paradigm for all model fitting and inference in order to facilitate a straightforward and flexible framework for application. The equations and distributional assumptions about the data and parameters ultimately determine the posterior distribution of parameters of interest, upon which inferences will be made (Box 1). Conceptually, the joint posterior distribution, given the observed data, is proportional to the joint likelihood of the data and the latent states multiplied by the relevant prior distributions: where p e ð Þ, p u , and p d ð Þ represent the prior distributions for true sampling efforts, per-uniteffort capture probability parameters, and animal density parameters, respectively. More specifically, Eqs. 5 and 6 specify probability distributions for the observed data given the latent removal and sampling effort (f ðy Ã ; e Ã jy; eÞ). Eqs. 3 and 4 give probability distributions for initial abundance given density (f Nj; d ð Þ), and Eqs. 1 and 2 describe how likely a sequence of latent true removals are, conditioned on the initial abundances, per-unit-effort capture probabilities, and latent values of true sampling effort (f ðyjN; u; eÞ). In practice, the joint distributions and priors of Eq. 7 become functions of additional parameters when structure in u or d is modeled using hierarchical regression (Box 1).

Example using wild turkey harvest data
We fit hierarchical removal models to a wild turkey (hereafter turkey) dataset from southern Michigan, USA (Appendix S1). Turkeys are a popular game species in North America and are managed to provide recreational hunting opportunities (Healy andPowell 2000, Harris 2010). The regulatory framework in Michigan consists of 2 discrete annual hunting seasons, with maleonly harvests during spring and either-sex harvests in the fall of each year (Kurzejeski andVangilder 1992, Healy andPowell 2000). Male-only spring harvest is considerably larger than fall ❖ www.esajournals.org either-sex harvest in Michigan. Estimates of daily harvest and hunter effort across the spring hunting season contain measurement error but enable development of variable-effort removal models (Appendix S1). We developed models using estimates of male-only harvest and hunter effort from recreational spring hunting, and consequently estimated abundance and growth for the male segment of the population. Male-focused monitoring is the current status quo of turkey management (Kurzejeski and Vangilder 1992, Lint et al. 1995, Healy and Powell 2000, as it remains challenging to collect reliable data from females. We fit removal models using data replicated over 5 geographic sites for a period of 14 yr (2002-2015; Appendix S1: Fig. S1). The 5 sites (hereafter regions) are regions used to manage spring harvest, where individual regions range in size from 6739 to 16,148 km 2 (Appendix S1: Table S1). Hunting occurs each spring from mid-April through the end of May, and hunting seasons varied annually in length (range 39-45 d; Appendix S1: Table S2) but were consistent among regions each year. Each hunter is legally allowed to harvest one male turkey each spring, but hunter efforts varied among regions and also by day within each hunting season (Appendix S1: Fig. S2, Table S2). Peaks in effort occurred at the beginning of the season and on weekends thereafter but at a reduced magnitude (Appendix S1: Fig. S2, Table S3). Estimates of daily harvest and effort constitute a spatially and temporally replicated but imperfectly observed removal sample from each region and year, where individual days and the number of hunters per day represented removal trials and sampling efforts (Appendix S1).
We fit multiple models to reflect hypothesized heterogeneity in removal probability and turkey density (Appendix S2: Table S1). The most complicated capture probability model we considered contained age-group effects (juvenile and adult male turkeys) and within-experiment trends over removal trials (u changes linearly over a hunting season), a categorical spatial effect representing differences among regions (removal trial sites), and an annual random intercept to reflect yearly changes in removal probability: where g t $ Normal 0; r 2 g and Age i is a binary variable indicating juvenile or adult turkeys. In addition, we considered models with reduced versions of Eq. 8, without the annual effect (g t ), and without the site effect (both with and without g t ), resulting in 4 total capture probability models. Both heterogeneity in removal probability among individual animals and behavioral responses to initial removal create a pattern of declining removal probability over trials as more easily captured animals are removed, resulting in negatively biased abundance estimates if not accounted for (Moran 1951, Bohlin and Sundstrӧm 1977, Borchers et al. 2002, M€ antyniemi et al. 2005, St. Clair et al. 2013. Our use of the within-experiment time trend effect (b 2 j) was intended to capture such a pattern if it existed, as has been successfully demonstrated elsewhere (Schnute 1983, Hayes et al. 2007, K ery and Royle 2010. We expected a decline in removal probability over trials if individual heterogeneity or behavioral responses to removal were present. We also considered 4 hypothesized models of turkey density, where all density models included effects for age-group and linear time trends describing changes in density over the duration of the study, as well as an interaction term that allowed for age-specific trends in abundance over time. The most complicated density model also included a categorical spatial effect and annual random intercepts to reflect hypothesized spatial-temporal heterogeneity in average turkey density: where c t $ Normal 0; r 2 c . We also considered models with reduced versions of Eq. 9, without the annual effect (c t ), and without the site effect (both with and without c t ). We fit 16 total models (4 removal Â 4 density; Appendix S2: Table S1) and ranked their relative support using the deviance information criteria (DIC; Spiegelhalter et al. 2002).
We used weakly informative Cauchy priors for all logit-scale regression coefficients (Gelman et al. 2008), normal priors for all log-scale coefficients, and uniform priors for all random effect variance parameters (Gelman et al. 2014; Box 1). Because per-unit effort capture probabilities are close to zero for the levels of effort observed in this study, we used a Cauchy prior with scale = 10 for the logit-scale intercept parameter, which produces a u-shaped distribution on the real probability scale that is appropriate when mean probabilities are close to zero or one (Appendix S2: Fig. S1; Gelman et al. 2008). Similarly, we used Cauchy priors with scale = 2.5 for additional logit-scale regression coefficients, which produces a distribution on the real probability scale with more support for intermediate values (Gelman et al. 2008). We used standard normal priors for log-scale fixed effects coefficients because when exponentiated these produce realistic priors for the density of male turkeys (Appendix S2: Fig. S2), which rarely exceeds 5 birds per km 2 in excellent habitats (Vangilder 1992). We used uniform priors for all logit-(Uniform 0; 0:4 ð Þ) and log-scaled (Uniform 0; 1 ð Þ) random-intercept variances to restrict their distributions to plausible values, as variances approaching the upper bounds of these distributions result in unrealistically large changes to removal probabilities and densities, respectively (Appendix S2). Lastly, because true daily efforts are unobserved latent states when measurement error is included, prior distributions are also needed for daily efforts for all sites, trials, and years (note that priors for latent removals are specified indirectly through model structure). We assumed a uniform prior for daily hunter effort that was sufficiently broad so-as to encompass all plausible values (Uniform 0; 8000 ð Þ), given the daily efforts that were observed (e.g., Appendix S1: Fig. S2; Table S3).
We generated Markov Chain Monte Carlo (MCMC) samples to make inferences from posterior distributions of turkey abundance and population trends. We generated MCMC samples using JAGS (version 4.0.1; Plummer 2003) called from within R (version 3.2.0; R Core Team 2015) using the R2jags package (Su and Yajima 2015). For each model, we retained 200,000 MCMC samples from each of 3 chains (600,000 total posterior samples) after an initial burn-in period of 2 million samples and assessed convergence using multivariate Gelman-Rubin statistics (Gelman and Rubin 1992) and trace plots of structural parameters (see Data S1 for an example JAGS model statement). In addition to monitoring structural model parameters and initial abundances for each region and year, we also generated MCMC samples from posterior distributions of finite rates of population change (i.e., k t ¼ N t =N tÀ1 ) and annual harvest rates Lastly, to understand the implications of model structure for inferences, we conducted a sensitivity analysis using alternative parameterizations of the top model. Specifically, we sought to determine whether accommodating measurement error in the form of an explicit observation model was necessary for inferences from these data, and also the effect of assuming that regional (site-level) abundances shared no structured stochastic dependencies (i.e., abundances not constrained by a hierarchical structure). Thus, we re-fit the top model without measurement error, assuming daily harvest and effort were perfectly observed. We also fit models where there was no explicit structure used to constrain the spatial distribution of turkey abundance, but instead only vague a priori information about population size (but still retaining the removal probability model identical to the original model). The unstructured abundance model assumed discrete uniform priors for initial abundance at the start of each removal experiment and thus placed less constraint on estimates of local abundance (Appendix S2: Table S2). The model with unstructured abundance was fit both including and excluding measurement error, and sensitivity of abundance and population growth estimates was assessed among these alternative parameterizations.

Simulating model performance
We simulated data generation, model fitting, and parameter estimation to determine the ability of our hierarchical models to accurately estimate population sizes and trends. We simulated harvest data using observed estimates of regionspecific hunter efforts as true effort, the top model from analyses described above as the data generating model, and posterior mean estimates of parameters from the top model as true values for generating harvest data. We simulated imperfectly observed removal data for each of the 5 management regions under 2 scenarios of temporal replication: (1) one year of sampling, and therefore no temporal replication, and (2) 10 consecutive years of sampling. Performance of removal models is known to be sensitive to fraction of the population removed, where larger removals result in improved accuracy (Zippin 1956, Dorazio et al. 2005, St. Clair et al. 2013. Thus, we also simulated 6 scenarios where cumulative removal rates were altered through changes to the number of removal trials and daily hunter effort. We considered scenarios where the number of removal trials (i.e., days of hunting) was approximately equal to (40 d), one half of (20 d), or one quarter of (10 d) the length of turkey hunting seasons in southern Michigan. We replicated each of these scenarios over 2 scenarios of hunter effort: (1) high effort, where true region-specific daily hunter efforts were those observed in Michigan, or (2) low effort, where true region-specific daily efforts were half of those observed.
To understand performance of simplified and less constrained versions of the model relative to the model structure described above, we replicated model fitting and parameter estimation for all 12 simulation scenarios described above using 4 different model parameterizations: (1) using the true data generating model with a hierarchical Poisson abundance structure and measurement error in harvest and effort data (i.e., model correctly specified), (2) the correct density model with a hierarchical Poisson abundance structure but excluding measurement error (i.e., making an assumption that data were perfectly observed but the model was otherwise correctly specified), (3) the unstructured abundance model including measurement error, and (4) the unstructured abundance model excluding measurement error (Appendix S2: Table S2). Thus, we assessed performance of removal models under 48 combinations of the temporal replication of experiments, number of removal trials, magnitude of hunter efforts on each trial, and estimation model structure (Appendix S3: Table S1). For each fitted model, we assessed convergence and relative error ( estimate À truth ð Þ =truth) of region-specific and total estimates of abundance and the finite rate of population growth over the entire study (only for scenarios with temporal replication; k ¼ N 10 =N 1 ). We used posterior means as point estimates of abundance and population growth, used the average relative error for each scenario to indicate bias, and also determined the fraction of 95% credible intervals that contained the true values of abundance and finite rates of change for each scenario. We replicated data generation and parameter estimation 100 times for each scenario, and additional simulation details are provided in Appendix S3.

RESULTS
The top model estimated that turkey abundance in southern Michigan was stable-to-increasing over the study, despite short-term annual fluctuations (Fig. 2). Total abundance peaked at 52,911 males in 2008, which was followed by a short-term decline and then growth to 49,410 males in 2015 (Fig. 2). The best model included a time effect that shifted average turkey density annually, spatial heterogeneity in density among regions, and age-specific density trends over the study (Appendix S2: Table S1, Figs. S4, S6). This model also estimated a larger removal probability for adult males than juveniles and spatial heterogeneity in removal among management regions, but little change in removal probabilities across trials within a hunting season (Appendix S2: Figs. S5, S7-S8). Estimates of population growth from the top model were synchronous among management regions but varied annually despite the spatial heterogeneity in turkey density (Fig. 3). Estimated annual population growth for the entire study area ranged from a low of 0.81 in 2010-2011 to a high of 1.16 in 2011-2012 (Fig. 3). Moreover, estimated annual harvest rates ranged from a low of 0.27 for juvenile males in region ZA to a high of 0.79 for adult males in region ZB (Appendix S2: Fig. S9). Estimated harvest rates were generally larger for adult males than juveniles, but the magnitude and temporal trends of harvest rates were region specific and affected by patterns of hunter effort (Appendix S1: Table S2).
Sensitivity analyses demonstrated that abundance estimates were generally robust when a model was used to describe spatial-temporal structure in turkey density, whereas ignoring measurement error changed the scale of abundance estimates when an unconstrained abundance model was used (Fig. 4). Estimates of abundance and population growth were similar when the hierarchical density model was used to constrain abundance, irrespective of the inclusion of measurement error in the fitted model (Fig. 4). However, differences in estimates were more pronounced when abundance lacked an explicit structure (i.e., when only constrained through vague uniform priors), and the model lacking both measurement error and an abundance model estimated smaller turkey abundance relative to the other three parameterizations (Fig. 4). Despite differences in the absolute scale of abundance estimates, both models with unconstrained abundance shared similar estimates of population growth over time, irrespective of inclusion of measurement error (Fig. 4b), and these population growth estimates were more variable than the models that included a Poisson abundance model. Regionspecific estimates of abundance and population growth usually showed similar sensitivity to model structure, but the magnitude of differences in estimates among models varied by region (Appendix S2: Figs. S10-S11).
Simulation results demonstrated that when specified correctly our model produced estimates of abundance and population growth that were approximately unbiased at the sample sizes and levels of hunter effort observed in this study (Table 1). Simulations assuming 10 yrs of replicated removal experiments also suggested that the model with an abundance hierarchy but ignoring measurement error produced similar abundance estimates as the true data generating model that included measurement error, with a larger proportion of model fits resulting in convergence (Table 2). However, the model including measurement error had better coverage of credible intervals, which was 95% for all management regions ( Table 1). The corresponding models without measurement error resulted in credible interval coverage less than 95% (range = 0.84-0.89). Reducing hunter effort by half resulted in worse performance for models including an abundance hierarchy, with fewer model fits converging and larger magnitude of bias (Tables 1-2). The model including measurement error still had reasonable credible interval coverage under the reduced effort scenario (0.90-0.95 for models that converged); however, the model ignoring measurement error had degraded coverage (0.63-0.74). Despite the bias of estimated abundance, estimates of population trends from the base hierarchical abundance models under low effort scenarios were approximately unbiased. Moreover, reducing the number of removal trials degraded performance and reliability of the models with an abundance structure, with decreased rates of convergence and increased bias of abundance estimates (Appendix S3: Tables S2-S3).
Simulations also demonstrated that when the unstructured abundance model was assumed in the presence of measurement error, abundance estimates were biased for all scenarios (Table 1). Convergence was achieved for all models fit assuming an unstructured abundance model; however, these models severely overestimated abundance, and abundance estimates more than twice truth were common. Unstructured models that included measurement error resulted in more severe positive bias for abundance estimates than unstructured models that simply ignored measurement error, and these biases were typically made worse by reducing hunter effort or the number of removal trials (Table 1,  Appendix S3: Tables S2-S3). Despite their inability to accurately estimate absolute abundance, unstructured models (both including and excluding measurement error) were able to estimate population trends over time reasonably well when hunter effort was high (i.e., at levels observed in this study). Nonetheless, negative bias emerged for population growth estimates with unstructured models when hunter effort or the number of removal trials was reduced (Appendix S3: Table S3). Lastly, performance was poor for all model structures fit to data without temporal replication (i.e., only one year of data), where failed convergence, biased abundance estimates, or both were observed under all scenarios (Appendix S3: Table S4).

DISCUSSION
We developed a hierarchical framework for estimating animal abundance and population growth from replicated but imperfectly observed removal experiments. This framework synthesizes existing removal methods and extends the models to accommodate measurement error in sampling intensity and the number of removals Fig. 4. Differences in estimates of abundance (N) and of finite rate of population change (k) for turkeys in southern Michigan arising from corresponding models accounting for (ME) and ignoring measurement error (No-ME) in daily hunter effort and harvest data, and using different hierarchical structures for the distribution of abundance (Poisson: Base; no hierarchy: Unstructured). This figure demonstrates sensitivity of estimates to changes in model structure.
per trial via an explicit observation process. Estimates for our example with wild turkeys in Michigan, USA, suggested stable-to-increasing populations despite spatial-temporal fluctuations in turkey density. Sensitivity and simulation analyses demonstrated these estimates were robust and approximately unbiased at the levels of hunter effort observed. Simulations also demonstrated that accuracy of abundance estimates was sensitive to inclusion of a model to explicitly account for heterogeneity in density and abundance, implying weak ability to determine the absolute scale of abundance in the presence of measurement error if heterogeneous density was not modeled directly but instead abundances were unconstrained (e.g., the common approach where removal data from individual years are analyzed separately; St. Clair et al. 2013). Broad-scale estimates of population growth from our model were also robust and relatively insensitive to model parameterization. Moreover, the framework we describe is flexible and additional data or prior information could be used by tailoring the models to attributes of Table 1. Results of simulation study used to assess performance of posterior mean estimates of abundance (N) and finite rate of change (k ¼N t¼10 =N t¼1 ) for removal studies replicated over a ten-year period. Notes: The fraction of simulated model fits that resulted in convergence (Convergence) as well as average relative error of abundance estimates from year 10 are presented by turkey management region (Region), and for all regions combined (Total). Results are presented for simulation scenarios defined by the magnitude of hunter effort (High, Low), the hierarchical structure of the fitted abundance model (Poisson, Unstructured), and the presence of measurement error in fitted model (Measurement error, No measurement error), for the simulation scenario where the number of trials in each removal experiment was equal to 40 d of hunting.
† Mathematical details of each model structure and simulation scenario are described in Appendix S2: Tables S1-S2, and Appendix S3.
specific study systems and monitoring programs.
The hierarchical model we describe provided estimates that were robust and approximately unbiased in the presence of measurement error under sample sizes and removal rates observed in our study. The model used an inhomogeneous Poisson process to explicitly model changes in density and induce stochastic dependence on the distribution of abundances, and provided estimates of abundance and population growth that were approximately unbiased. Surprisingly, this model also produced abundance estimates that were robust to measurement error in the removal data, even when it was ignored by excluding the observation model. The Poisson distribution has been used by numerous authors to provide structure and constraint to local abundance estimates arising from studies under a metapopulation design (Dorazio et al. 2005, Royle and Dorazio 2006, Rivot et al. 2008, K ery and Royle 2010, St. Clair et al. 2013. Our results imply such models can be used to constrain abundance estimates to realistic values and ameliorate the bias Table 2. Results of simulation study used to assess performance of posterior mean estimates of abundance (N) and finite rate of change (k ¼N t¼10 =N t¼1 ) for removal studies replicated over a 10-yr period.

Notes:
The fraction of simulated model fits that resulted in 95% credible intervals that contained the true abundance or finite rate if change (for model fits that converged) are presented by turkey management region (Region), and for all regions combined (Total). Results are presented for simulation scenarios defined by the magnitude of hunter effort (High, Low), the hierarchical structure of the fitted abundance model (Poisson, Unstructured), and the presence of measurement error in fitted model (Measurement error, No measurement error), for the simulation scenario where the number of trials in each removal experiment was equal to 40 d of hunting.
† Mathematical details of each model structure and simulation scenario are described in Appendix S2: Tables S1-S2, and Appendix S3. that typically results from measurement error in removal data. This appears to be a novel finding with respect to removal models, as we are not aware of other studies evaluating the implications of measurement error when a hierarchical model is used to represent deterministic (i.e., covariate effects) and stochastic structure in regional abundance for spatially replicated removal experiments. However, excluding the observation model did result in overestimation of precision in the presence of measurement error. Thus, there remain tangible benefits of using the more complicated model that includes an explicit observation process, as it resulted in a more reliable assessment of uncertainty.
Absent a model inducing structure on the distribution of density and abundance, the implications of measurement error for estimating abundance via removal methods were substantial. Eliminating the model for abundance resulted in greater sensitivity of estimates to observation model structure (i.e., including and excluding an observation process), and also greater variability of estimates over time for turkey populations in Michigan (i.e., no shrinkage; Royle and Link 2002). The effect of measurement error on estimator performance when abundance was effectively unconstrained, except by a vague prior, was to inflate abundance estimates substantially. Similarly,  demonstrated positive bias in abundance estimates when removal data from single experiments were subject to measurement error, and found this directional bias was consistent over an order-of-magnitude change in the true abundance. Collectively, these results imply that estimating the absolute scale of abundance via removal models is more difficult when removals and effort are not perfectly observed, that inappropriate scaling caused by measurement error results in large positive bias in abundance estimates if ignored, and that without the structure induced by appropriately modeling spatial-temporal changes in density, the estimators are suspect.
Our analyses demonstrate that estimation problems induced by measurement error can be ameliorated when information is shared among replicated experiments and the systematic changes to animal density are explicitly modeled. Estimation challenges for closed-population abundance estimators with latent states are well described in the literature (e.g., mixture models in capture-recapture; Coull andAgresti 1999, Dorazio andRoyle 2003). Estimation challenges are commonly used to justify modeling abundance hierarchically under metapopulation designs as we did in this study (Royle and Dorazio 2006, Royle et al. 2007, K ery and Royle 2010, which acts along with prior distributions to regulate parameter estimates derived from posterior distributions under a Bayesian paradigm (Hooten and Hobbs 2015). This effectively allows researchers to mathematically constrain abundance estimates to biologically plausible ranges. This is done via use of hierarchical structures and prior distributions on animal abundance and density, respectively, as we did in this study to prevent unrealistically large densities of turkeys. This approach is defensible for application of complex hierarchical models in ecology, as one is using what is already known to develop constraints on parameters in order that they remain biologically feasible (Hooten and Hobbs 2015). Moreover, comparison of posterior and prior distributions from our example analyses (Appendix S2) clearly shows the data were sufficient for estimating abundance under the model that included both an abundance hierarchy and an explicit observation process. Thus, while the hierarchical model acted to constrain abundance to plausible values, the data were also informative about spatial-temporal changes in turkey abundance within the constraints of the model.
Despite the sensitivity of abundance estimator performance to inclusion of an explicit density model, estimates of population growth were more robust to model structure. Our simulations demonstrated that all of the model parameterizations considered were able to reasonably estimate decadal-scale population trends in the presence of measurement error at the observed sample sizes and removal rates. Even models with unstructured regional abundances that overestimated the scale of abundance for a given year could reliably depict the finite rate of change over a 10-yr period (i.e., positive bias in abundance was consistent over time). That is, simulations showed that the models were capable of reliably estimating long-term lambda even when estimates of absolute abundance were biased. Thus, assessing long-term population trends may be easier than accurately scaling abundance when removal data are imperfectly observed. Few studies have considered models for temporally replicated removal experiments (but see Rivot et al. 2008), and their usefulness for assessing population trends for managed species appears to be underappreciated. Although trends provide less information to decision makers than absolute abundance (Nichols and Williams 2006), we believe the reliable trend estimates produced by our model are more informative and useful for decision making than traditional indices (e.g., catch-per-unit effort) used to monitor species like the wild turkey, as the models can explicitly account for changes to both animal density and per-unit-effort removal probability. Moreover, estimates of population trends are often used by practitioners to inform natural resource management decisions (e.g., when adjusting harvest regulations), as estimates of absolute abundance can lack context to stakeholders and decision makers without comparable historical information on population size.
We demonstrated reasonable performance for hierarchical removal models that accommodate measurement error under conditions similar to those observed for our example, but performance was sensitive to the fraction of the population removed and the availability of replicated data. Reducing the fraction of the population removed, manifested through reduction to either the magnitude of sampling effort or the total number of removal trials, degraded performance considerably. Under these conditions, the models often failed to converge and bias of abundance estimates increased. These issues are widely recognized in the literature on capture-recapture and removal models, where performance is tightly linked to the overall fraction of the population sampled (Zippin 1956, Coull and Agresti 1999, Dorazio and Royle 2003, Dorazio et al. 2005, M€ antyniemi et al. 2005, St. Clair et al. 2013. Our simulations indicate that problems associated with removing only a small fraction of the population will likely manifest themselves through failed model convergence. In this light and given the wide variety of contexts for which removal models may be applied, we suggest it wise to use simulation to assess estimator performance for specific applications when removal data are subject to measurement error. In addition, larger samples of simulated parameter estimates would benefit future studies. We chose to simulate a large number of scenarios (48), which came at the expense of large numbers of iterations of data generation and model fitting per scenario (100) due to computational limitations and demands of model fitting. While we believe this was adequate for assessment of central tendency and bias of estimators, we acknowledge that more simulation replicates would be useful to provide improved assessment of precision and coverage of credible intervals.
While simulations indicated reasonable performance for removal experiments replicated in space and time similar to our example dataset, our models were data hungry and did not perform reliably in the face of measurement error when spatially replicated removal experiments were not also replicated in time. This suggests substantial replication is likely a prerequisite for applying these models, which could limit their application. In many applied settings, spatial replicates are fixed and represent the entire area of interest (e.g., a set of management regions). Additional replication of such experiments can only come in time, and our simulations suggest this replication is necessary. However, it is not entirely clear what amount of temporal replication will be necessary for reliable application of the model under different levels of spatial replication. It seems likely that replicated observations of the removal processes and information sharing among experiments is what is needed, and thus the necessary amount of temporal replication may depend on the number of spatial replicates, and vice versa. In other contexts, the number of spatial replicates could be much larger and require less temporal replication for reliable inferences. Thus, while the reliable performance of models required considerable replication of removal data, we expect the necessary amount of replication to depend on the context of the study. This further supports our suggestion that simulation-based assessment of estimator performance for specific applications is wise when removal data are subject to measurement error.
Although models for removal data affected by measurement error have some limitations, the hierarchical framework described provides a high degree of flexibility that can enable further tailoring of models to specific species and sampling protocols. In our example, we focused on relatively simple model structures to describe spatial-temporal heterogeneity in removal and animal density (fixed effects and iid normal random intercepts), yet the framework is flexible enough to accommodate a variety of model structures. For example, this could include logit-or log-scale multivariate normal effects that induce a correlation structure on realized values of density and removal probability (Royle and Link 2002, Royle and Dorazio 2006, or conditional auto-regressive models to induce spatial correlation among regions (Lichstein et al. 2002, Webster et al. 2008). In addition, alternative observation models could be tailored to the characteristics of specific monitoring programs. For instance, one might be able to use sampling theory (Cochran 1977) to derive observation-error variances for estimated removal data under specific survey designs, instead of assuming the variance-mean relationship of a Poisson distribution. Lastly, given the variety of contexts for which removal models are applied, extensions for sampling specific taxonomic groups may also be possible, where examples include the analysis of removal data from avian point counts and stream electrofishing surveys with measurement error due to species misidentification. Consequently, the framework described here provides a flexible starting point for fine-tuning population assessment for a variety of species using imperfectly observed data arising from removal sampling protocols.
Lastly, our analyses indicated male turkey populations in Michigan were stable-to-increasing, with heterogeneous density among management regions that also varied annually, and heterogeneous per-unit-effort removal probabilities by animal group and among regions. We estimated fluctuations in abundance consistent with known dynamics of turkey populations, where previous authors suggested population fluctuations of up to 40% annually are possible (Kurzejeski andVangilder 1992, Healy andPowell 2000). These large annual fluctuations were synchronous across southern Michigan despite a smaller magnitude of spatial heterogeneity in density, which is again consistent with earlier work suggesting broad-scale synchrony in turkey abundance is common (Fleming and Porter 2007). Higher removal probabilities for adult males than for juvenile males were also anticipated, as multiple studies reported that adult males are more vulnerable to harvest than juveniles (Vangilder 1992, Healy and Powell 2000, Chamberlain et al., 2012. Previous research also suggested harvest rates for males might vary among management regions (Diefenbach et al. 2012), which would be expected even in the absence of spatially heterogeneous capture probabilities given variation of hunter effort in space (Clawsen et al. 2015, Stevens et al. 2020. We suspected a priori that removal probability would decline over the hunting season due to capture heterogeneity among individual animals, behavioral responses to harvest, or some combination of these and other factors, yet we found little evidence of this pattern. It is possible that other parameterizations could better portray systematic shifts to capture heterogeneity over a hunting season. For example, a categorical effect over trials could capture behavioral shifts of animals during the hunting season (i.e., early season vs late season), or linear trends over the season may be modeled as a function of cumulative hunter effort (and therefore cumulative exploitation) up to a given trial, instead of declining at a constant rate over trials. Nonetheless, our estimates appear robust and indicate turkey populations in Michigan are stable. This directly contradicts, at least for our study region, a widespread belief that turkey populations in the eastern USA are in decline (Kobilinsky 2018). Such beliefs are often based on monitoring turkey populations using raw harvest or catch-per-unit effort indices that do not account for spatial-temporal changes in both density and removal probability. Thus, we provide assessment tools that enable a more refined understanding of population trends for harvested species like wild turkeys, which in this case facilitated a shifting perception of the status of populations across the study region. In the context of removal via commercial or recreational harvest, our models also provide the ability to estimate abundance and exploitation rates arising from a given set of management regulations, both of which are useful for simulation-based assessment of sustainable harvest strategies (Quinn and Collie 2005, Deroba and Bence 2008, Punt et al. 2016). Our models therefore provide value beyond the assessment of status and should further enable rigorous assessment of management decisions for target species.