Intrinsic Bayesian Analysis for Occupancy Models

Occupancy models are typically used to determine the probability of a species being present at a given site while accounting for imperfect detection. The survey data underlying these models often include information on several predictors that could potentially characterize habitat suitability and species detectability. Because these variables might not all be relevant, model selection techniques are necessary in this context. In practice, model selection is performed using the Akaike Information Criterion (AIC), as few other alternatives are available. This paper builds an objective Bayesian variable selection framework for occupancy models through the intrinsic prior methodology. The procedure incorporates priors on the model space that account for test multiplicity and respect the polynomial hierarchy of the predictors when higher-order terms are considered. The methodology is implemented using a stochastic search algorithm that is able to thoroughly explore large spaces of occupancy models. The proposed strategy is entirely automatic and provides control of false positives without sacrificing the discovery of truly meaningful covariates. The performance of the method is evaluated and compared to AIC through a simulation study. The method is illustrated on two datasets previously studied in the literature.


Introduction
It is often the case that measurements recorded for a given response are, at best, a noisy version of the variable of interest. A particular case of this issue is known as imperfect detection, and constitutes a pervasive problem. For instance, in biological surveys subject to imperfect detection, "presence/absence" data for a given species actually become "detection/non-detection" data, since a specie may be present at a given site, but may go undetected in a survey. Ignoring imperfect detection may lead to inaccurate measurement of the presences (Guillera-Arroita et al., 2014), which generally results in biased parameter estimates (MacKenzie et al., 2002).
As defined in the ecological literature, occupancy is the proportion of sites where a target species is present, constituting a state variable instrumental to assess the distribution of species (MacKenzie et al., 2002). Over the past decade, site occupancy models have been the main tool used by ecologists to estimate occupancy while accounting for imperfect detection. Occupancy models describe the observed data by linking two processes: presence and detection. Occupancy models adapt the conventional binary regression model to produce separate estimates for presence and detection probabilities (Dorazio and Taylor-Rodriguez, 2012). This separation is possible by surveying repeatedly the sampling locations, which provides additional information to better assess if non-detection of the specie truly corresponds to its absence. Conveniently, these models can be fitted even if the number of surveys is unbalanced across sites. The core of the occupancy model is characterized by the hierarchy where y ij is the binary detection indicator at the ith site (i = 1, . . . , N ) during the jth survey (j = 1, . . . , J i ). The detection probability for event {y ij = 1} is p ij whenever the species is present; and z i is the presence indicator at the ith site with success probability ψ i . Note that the z i are imperfectly observed. At at site i, whenever the vector of detections y i = 0, then we know that z i = 1, but y i = 0 does not imply that z i = 0. To produce estimates of ψ i and p ij , site occupancy surveys collect information on several predictors with the potential to influence habitat suitability (characterizing ψ i ) and species detectability (defining p ij ). Given that some of the collected predictors may be uninformative or redundant, variable selection techniques are instrumental in identifying good models.
In this paper, we propose an objective Bayesian variable selection procedure for occupancy models. Our approach is based on intrinsic objective priors for the model parameters, and uses priors over the model space that simultaneously account for test multiplicity, and, if interactions and/or polynomial terms are considered, enforces the polynomial hierarchical structure among predictors.
Currently, variable selection procedures for occupancy models implemented in statistical software are mainly based on the Akaike Information Criterion (AIC) (Akaike, 1983). As a consequence, these procedures do not allow for valid post-selection inference and uncertainty quantification, and typically require enumerating and fitting every possible model in the space of models under consideration (e.g., Mazerolle and Mazerolle, 2013;Fiske and Chandler, 2011). In practice, this enumeration is feasible only if the model space is small enough, either because substantial knowledge about the underlying ecological processes is available to constrain the model space, or because only a few variables are considered to begin with. Nevertheless, many site occupancy surveys collect large amounts of covariate information about the sampled sites, and since the total number of candidate models grows exponentially in the number of predictors, choosing a reduced set of models based on ecological intuition becomes increasingly difficult.
The AIC is designed to find the model that is the closest to the true (unknown) model with respect to Kullback-Leibler divergence, identifying as good models those with smaller AIC values. It has been shown, however, that the AIC has certain limitations as a model selection criterion. For instance, if nested models are being considered, the AIC will not necessarily select the true model (Wasserman, 2000). In fact, the AIC generally shows a weak signal-to-noise ratio and tends to prefer more complex models, even if the true model is available (Rao and Wu, 2001). Other versions of the AIC address this issue by including a bias correction factor that enhances the signal-to-noise ratio (see Hurvich and Tsai, 1989;McQuarrie et al., 1997); however, these modified versions cannot be used with occupancy models, as they depend on the effective sample size, which is unknown for these models.
In this context, Bayesian methods are more appealing. Under regularity conditions, when the true model is contained in a fixed model space, its posterior probability converges to one as the number of sites and surveys per site both increase. In addition, if the true model is not contained in the model space, the posterior probability of the most parsimonious model closest to the true data generating process tends asymptotically to one. In the finite sample context, Bayesian methods allow for full and faithful error propagation. Furthermore, the Bayesian machinery provides the means to conduct valid inference accounting for model uncertainty.
A Bayesian selection procedure for occupancy models was described in Hooten and Hobbs (2015). However, their implementation uses informative prior distributions on the model parameters, tailored specifically to the example discussed in the paper, which prevents the approach from being applicable to occupancy problems in general. It is often the case that subjective elicitation of parameter and model prior distributions is not possible, since neither the relationship between the response and the predictors, nor the advantages of one model over another, are clearly understood. In addition, the use of seemingly innocuous subjective priors may drastically affect outcomes. This has been a recurring argument in favor of objective Bayesian procedures, which appeal to the use of formal rules to build parameter priors that incorporate the structural information inside the likelihood while utilizing some objective criterion (Kass and Wasserman, 1996).
To the best of our knowledge, the method proposed in this article is the first general Bayesian selection procedure for occupancy models, that 1. bypasses the need for hyper-parameter tuning, 2. uses priors specifically designed for testing, 3. controls for test multiplicity, and 4. accounts for the hierarchical polynomial structure in the predictors.
In building our approach, we first derive intrinsic priors (Berger and Pericchi, 1996;Moreno et al., 1998) for the model parameters in both the presence and detection components of the single-season occupancy model. For the model priors, we consider the ones proposed in Taylor-Rodriguez et al. (2015). These priors, in addition to controlling for test multiplicity, allow restricting the model space to the set of models that respect (weakly or strongly) the polynomial hierarchy among the predictors whenever interactions and higher-order terms are considered. As discussed in Peixoto (1987Peixoto ( , 1990 when covariate interactions and polynomial terms are present, failure to restrict the class of models to those respecting strong heredity may result in incoherent variable selection. This is because the model design matrices are not invariant to linear transformations of order-one predictors (e.g., recentering of the main effect variables). Using the derived intrinsic priors on the parameter space and the multiplicity correction priors on the model space, we build a fast stochastic search algorithm that allows us to thoroughly explore large spaces for the single-season occupancy model framework. This strategy is completely automatic, avoiding the need for both tuning parameters in the sampling algorithm and subjective elicitation of parameter prior distributions. Furthermore, as any other Bayesian approach, it naturally enables parameter and model uncertainty quantification.
The outline of the paper is as follows: in Section 2, we provide background on occupancy models and set notation. In Section 3, we introduce our objective Bayesian model selection method and develop the Gibbs sampler. In Section 4, we present results from a simulation study and a comparison with selection using AIC. In Section 5.2, we illustrate our methodology on two datasets, which have been previously examined in the ecological literature (Kéry et al., 2005;Kéry et al., 2010;Dorazio and Taylor-Rodriguez, 2012). We conclude the paper with a brief discussion. Code for all the tools proposed is available in the R package OccOBayes. A description of the stochastic search algorithm is included in the Appendix.

Inference for a single model
This section briefly describes the estimation procedure for a single model. Assuming the probit link, the occupancy model can be characterized in terms of latent variables, which in turn allows one to relate the detection and occupancy probabilities to predictors. We build an objective prior distribution for the regression coefficients using the expected posterior prior framework (Pérez and Berger, 2002) where we condition on both the observed data as well as the unobserved latent variables (Leon-Novelo et al., 2012).

The Occupancy Model with Probit Link
The occupancy model in (1.1) is completed in two ways. First, the probabilities for detection p ij and for presence ψ i are linked to vectors of predictors q ij and x i , respectively, through appropriate link functions, g p (p ij ) = q ij λ and g ψ (ψ i ) = x i α. We assume that the link function is the inverse standard normal cdf, leading to probit models. Other binary regression models can be fit and lead to slightly more complicated computational algorithms. Second, the parameters of the underlying space, here (α, λ), are given a prior distribution π(α, λ). This paper proposes a prior distribution building on the expected posterior prior method of Leon-Novelo et al. (2012).
Letting X and Q be the matrices whose rows are, respectively, vectors x i and q ij for i = 1, . . . , N and j = 1, . . . , J i , the Bayesian probit occupancy model is specified as where Φ is the standard normal cdf. As it will be made evident in subsequent, we explicitly condition on X and Q since the priors devised for the model parameters depend on these design matrices. Again, note that the z i are not perfectly observed. The sites with y i = 0 provide no detections but this does not necessarily imply a lack of presence. Thus, the model is a zero-inflated binary regression model where both lack of presence and individual instances of detection are predicted with covariates. The observed data vectors for the sites, y 1 , . . . , y n , are independent given (α, λ) and The model can be expanded in the spirit of Albert and Chib (1993) by introducing latent variables at each level. Let v i be the underlying continuous latent variable for presence at site i and w ij be the underlying continuous latent variable for detection during survey j from site i. The hierarchical model in (2.1) becomes When one uses a multivariate normal prior for (α, λ), the model in (2.2) can be fit using a Gibbs sampler. As described in Dorazio and Taylor-Rodriguez (2012), the only complication in using a Gibbs sampler is the fact that the sign of v i determines the value of z i and so the Gibbs sampler has to proceed in two blocks. The first block, which corresponds to a multivariate normal draw, is (α, λ|z, v, w, y, Q, X). The second block is (v, w, z|α, λ, y, Q, X). Each z i is drawn from the distribution [z i |α, λ, y, Q, X], which is a Bernoulli distribution with probability of success and the v i and w ij are sampled independently from their full conditionals. Each v i has a truncated normal distribution with mean x i α and variance 1, restricted to the positive real line when z i = 1 and to the negative real line when z i = 0.
Each w ij has a truncated normal distribution with mean q ij λ and variance 1 that is supported on the positive real line when z i y ij = 1, the negative real line when z i (1 − y ij ) = 1, and the whole real line when z i = 0. The marginal p(y|X, Q) for the observed data can be estimated using the output from the Gibbs sampler (Chib, 1995). In this sampling scheme, one can also perform parameter expansions for both v and w (Liu and Wu, 1999). These dramatically decrease the autocorrelation between successive samples and reduces the asymptotic variance of estimators (Roy and Hobert, 2007).
Alternatively, one can perform inference for the model specified in (2.1) directly using a Metropolis-Hastings algorithm (e.g., an independence chain, a random walk, or Hamiltonian Monte Carlo). The output of the Metropolis Hastings algorithm can be used to estimate the marginal of the observed data using the method outlined in Chib and Jeliazkov (2001). When the sample size is large, an independence chain, using the Laplace approximation to the posterior as a proposal density, provides accurate numerical estimates of the posterior evaluated at its mode in a relatively small number of samples.

An Objective Prior for (α, λ)
Intrinsic priors, as defined by Moreno et al. (1998), are an example of expected posterior priors (Pérez and Berger, 2002). Concisely, an expected posterior prior for parameter θ with prior π M under a model M is given by where D is some imaginary data that is integrated out, p M (θ|D, π M ) is the posterior of θ given data D under the model M with parameter prior π M , and m 0 is a fixed distribution for generating the data D. The properties of the data D are determined by the investigator. For regression problems, this amounts to determining the number of samples in the response and the associated design matrix. The generating model m 0 for the data D is usually taken to be a simple model, for instance an intercept-only model. Thus, the expected posterior prior under model M is calibrated to the distribution m 0 .
Consider the context of multiple models, M 0 , M 1 , . . . , , M K , where M 0 is nested in M k for all k and model M k has parameter θ k with non-informative (often improper) prior π N k . In this context, M 0 is referred to as the base model. The intrinsic prior for each model is computed as where D k is a training sample for model M k and m N 0 is the marginal density for D k under the base model. For the intrinsic prior, D k is taken to be a minimal training sample for model M k under the prior π N M k , which is a dataset of the smallest possible size that provides a proper posterior for p N M k (θ k |D k , π N k ). Of course, the intrinsic prior for the base model is just its original non-informative prior. When the prior for model M k is improper and only defined up to a multiplicative constant c k , the intrinsic prior framework removes the ambiguity of these constants and each intrinsic prior is defined up to a common multiplicative constant c 0 .
An extension of the intrinsic prior framework is to have the datasets D k include both observable and unobservable latent variables. Leon-Novelo et al. (2012) used this approach in computing an objective prior for standard probit regression. There, the authors conditioned on both the observed binary data as well as the unobserved continuous latent variables. Following their development, we form an objective prior for the occupancy model by conditioning on the unobserved latent presence variables (z) as well as the unobserved continuous latent variables for both presence and detection (v, w). We refer to this objective prior as an intrinsic, prior though its derivation differs from that in Moreno et al. (1998) and Berger and Pericchi (1996).
Specifically, let X 0 and Q 0 be design matrices for presence and detection in the model M 0 and let X and Q be design matrices for a model M that nests M 0 . Let (α, λ) and (α 0 , λ 0 ) be the parameters of M and M 0 , respectively. Further, assume that the prior distributions for the parameters under each model are constant, π N M = c M and π N 0 = c 0 . The intrinsic prior for (α, λ) is given by 3) where the "∼" over variables indicates that these correspond to the training sample that is to be integrated out. The formula in (2.3) is greatly simplified by the fact that, under the non-informative prior, (α, λ) are conditionally independent of (z,ỹ) given the continuous latents (ṽ,w). Moreover, the α and λ are conditionally independent of each other given the continuous latents. Thus, where the last equality follows from the assumptions of (2.2) and the prior independence of α and λ under π N M . Both of the integrals in (2.4) are of the form of the integrals in Leon-Novelo et al. (2012). Thus, the intrinsic prior is given by a product of singular normal distributions.
The explication of these priors is greatly aided by the introduction of additional notation. Because M 0 is nested in M , we can write X = (X 0 X A ) and Q = (Q 0 Q A ) and can do the same for the design matrices for the minimal training sample. Similarly, we can write α = (α 0 , α A ) and λ = (λ 0 , λ A ) . The intrinsic prior is given by whereH 0z andH 0y are the hat matrices associated toX 0 andQ 0 , respectively.
Here we include two constants c 0 and d 0 for the reference prior for the base model, where c 0 and d 0 are undefined constants for the flat priors for α 0 and λ 0 , respectively. The only remaining task for this intrinsic prior is to define the design matrices for the minimal training samples. Letting p α = dim(α) and p λ = dim(λ), the minimal training samples for v and w contain p α and p λ samples, respectively. Following Leon-Novelo et al. (2012) and Casella and Moreno (2006), we definẽ X andQ to be any design matrices of dimensions p α ×p α and p λ ×p λ satisfying where N is the number of sites and J • = N i=1 J i is the total number of surveys. Note that the covariance matrices in (2.5) and (2.6) are thus completely determined by X X and Q Q.

The Variable Selection Problem
The hierarchy in Equation (2.2) is given for a specific model with a fixed set of predictors. This section develops the model selection problem for occupancy models. Each model contains two components, one for presence and one for Let K = (K y , K z ), where K y and K z denote the sets of column indices for Q F and X F that are not in Q 0 and X 0 , respectively. The model space can then be represented by the Cartesian product P Thus, the entire model space M is populated by models of the form M A = M Ay , M Az , where M Ay and M Az are the corresponding component models for detection and presence determined by the base covariates as well as covariates with indices in A y and A z , respectively. It follows that for the presence process z, the design matrix for the model M A is of the form X M A = (X 0 X A ), where X 0 is the design matrix of the base model M 0z and X A is the matrix containing the covariates indexed by A z (and similarly It is important to note that this construction using the Cartesian product provides the largest possible model space for the occupancy model given the structures of the base and full models. Investigators may wish to impose additional model space restrictions based upon their (subjective) judgment. One means of achieving this restriction is to form two sets of models, M y for detection and M z for presence. The model space can then be defined by the Cartesian product, M = M y × M z . One particular example of such a restriction arises when higher-order terms are included in the detection or presence models. Heredity conditions (Chipman, 1996) can be imposed on either model space and appropriate priors defined (see Section 3.1).

Priors over the Space of Models
Here we outline the construction of prior distributions over the model space.
To allow for flexible modeling, it is assumed that the sets of covariates can potentially include interaction effects, higher-order polynomial terms, and factor variables. The priors for either the presence or the detection component have the same structure, and the joint prior is the product of marginal priors of the two model components.
The priors placed on the model space for the presence and detection models respect the hierarchy of the terms that could be included in a given model. Aspects of the prior construction are described here and full details on such priors can be found in Taylor-Rodriguez et al. (2015). The full model for either the presence or the detection component is represented as a directed acyclic graph (DAG) with nodes representing polynomial terms (powers or interactions; e.g., x 1 or x 2 1 or x 1 x 2 2 ) and with edges specifying inheritance relationships. For example, x 1 x 2 2 has edges (inherits) from its parent nodes x 1 x 2 and x 2 2 , also x 2 1 inherits from its parent x 1 but not from x 2 . Feasible models, also known as models obeying weak heredity, correspond to a special kind of connected subgraph of the full model DAG. First, they must include the base model DAG. Second, a node η can only be included in a model's subgraph only if there is a directed path from a node in the base model to η. The priors considered here focus on models satisfying a strong heredity (also known as well-formulated models), which amounts to requiring that for each node η in a model's subgraph, all parents of η included in the model's subgraph.
Model prior probabilities are specified recursively via conditional node inclusion probabilities (given the parent DAG) using a type of Markov condition reflected in the principles of conditional independence and immediate inheritance (Chipman, 1996). Conditional node inclusion is identified with a latent Bernoulli random variable and placing a conjugate (beta) prior on the inclusion probabilities. The model space prior is obtained by integrating out the conditional inclusion probabilities. In the simplest case all of conditional inclusion probabilities are assumed to be equal and the prior is called the hierarchical uniform prior (HUP). The amount of penalization of complex models can be adjusted (typically, increased relative to the purely combinatorial penalization of the HUP) using node-specific inclusion probabilities and stronger shrinkage via the beta hyper-priors on the inclusion probabilities; this results in the hierarchical independence (HIP) and order priors (HOP) that group nodes of similar complexity together.

Model Posterior Probabilities
In order to compute the posterior probabilities of interest, we take advantage of the model representation making use of the latent variables introduced for the presence and detection processes. Specifically, a conditional independence argument provides because z is independent of M A once v is known and y is independent of M A once z and w are known. In (3.1), with φ(·|µ, σ 2 ; M ) denoting the normal pdf with mean µ, variance σ 2 conditional on model M , and π IP M A (α, λ|Q,X) as defined in (2.4). Under the intrinsic priors above, the closed-form expression for the marginal m(v|M Az ) is where H ⊥ Ay is the hat matrix associated with (I − H 0y )Q A and J • = N i=1 J i is the total number of surveys. Finally, the marginals for the base model M 0 = M 0y , M 0z are The specification of the model posteriors in Equation (3.1) is completed using the construction of the priors π(M A ) over the model space; see Section 3.1.
The advantage of (3.1) is that the posterior of model M A can be represented as which provides for straightforward ergodic estimation of p(M A |y) if samples can be drawn from f (z, w, v|y). If S such samples are obtained, then (3.7) can be approximated by Such draws can be obtained using reversible jump MCMC (RJMCMC) (Green, 1995), as described in the Appendix. One subtle point of difficulty is the calculation of m(w, v) = M A m(w, v|M A )π(M A ) in the denominator of (3.1) when the space of models is too large to be enumerated (or if the necessary calculations for each model and each draw of (w, v) are too arduous). In such a case, the sum may be approximated by T −1 t m(w, v|M (t) )π(M (t) ), where t indexes a set of T models. For instance, t could index the set of models visited during the RJMCMC sampler or a larger set of models could be used (the posterior of a model M A not in this set can be estimated using (3.8)).

Simulation Experiments
This section considers nine different scenarios where we explore a range of detectability and prevalence regimes to assess the behavior of the proposed algorithm. For each model component, the base model is taken to be the interceptonly model, and the full models considered for the presence and the detection have, respectively, five and three predictors. Therefore, the model space contains 2 5 × 2 3 = 256 candidate models. The assumed true models are M T z = {1, x 1 , x 2 , x 5 } for the presence and M T y = {1, q 2 , q 3 } for the detection, where 1 represents the intercept. This small model space is considered so that comparisons with selection using AIC (which generally requires complete enumeration of the model space) can be made.
The simulation scenarios we consider vary depending on where the distributions for the detection and presence probabilities are centered. That is, we set the average probability for detection and presence to predefined valuesp andψ, respectively. If the detection probabilities are centered near one, a non-detection commonly implies a non-presence since the detection is almost perfect. On the contrary, if the detection probabilities are centered close to zero (as with cryptic species), then the uncertainty surrounding an observed zero is greater, making it more difficult to determine if this also corresponds to a true zero in the presence. Now, combining the different values forp with different values for the center of the distribution for the presence probabilitiesψ, we can account for a variety of possibilities observed in real data, ranging from cryptic but highly prevalent species, to easy to detect but very rare species.
The mean probability values for detection and presence that determine our scenarios correspond to the pairs (p,ψ) ∈ {0.2, 0.5, 0.8} ×2 . To match the target values (p,ψ), 15 independent sets of {X F , Q F } were drawn independently from the standard normal distribution, and for each of them the true model parameters were chosen to solve for α and λ the equationsψ(α) =ψ andp(λ) =p, whereψ For each scenario and dataset combination, we used the best solution from ten runs of a gradient-based (quasi-Newton) algorithm initialized from independent standard normal draws. Finally, having determined the regression parameters corresponding to the different scenarios and conditioning on M T z and M T y , the true presence and detection indicators were drawn from the probit model described by (2.1) for each dataset.
The results are shown in Figure 1, which depicts the average proportion of true positive (TP) and false positive (FP) predictors included in the selected models under each scenario. The TP predictors are those in the true model that are also in the selected model, and the FP predictors correspond to those absent from the true model but included in the chosen model. The selected models are the lowest AIC model and the median probability model (MPM) under the objective Bayes methodology. The MPM is the model that includes all predictors whose marginal posterior inclusion probability (MPIP) is greater than or equal to 0.5, where the MPIP for a given predictor is defined as The TP and FP rates for both detection and presence components lead to the same conclusions. In terms of the TPs, the AIC selects a slightly higher number of true positive terms, especially for the component of the model associated to the presence indicators. Nonetheless, these differences are modest at most. Conversely, the resulting proportions of false positive terms (FP) tend to be strinkingly lower using our method, especially for the presence component in those scenarios where there is poor detection (i.e.,p = 0.2). Remarkably, whenever the species is highly prevalent (ψ = 0.8) and detection ranges between moderate and high (p = 0.5, 0.8), the number of false positive terms under our approach is very close to zero in both model components. Also, with (p,ψ) = (0.2, 0.8) our method substantially outperforms AIC in filtering out the false positive terms both in the presence and detection components.
These results are very encouraging: the proposed method not only reduces the inclusion of false positive terms in comparison to AIC but also has comparable performance finding true predictors.

Case studies
In this section, we analyze two datasets. First, we consider presence-absence data for mallard wild ducks (Anas platyrhynchos), collected as part of the 2002 Swiss breeding bird monitoring program. For our second example, we consider the blue hawker dragonfly data, which had been previously studied using AIC as the variable selection strategy in Kéry et al. (2010). The mallard data is extremely clean, with sufficient sites being surveyed, which for the most part are visited the same number of times. On the other hand, the blue hawker dataset was collected through a large scale citizen science effort. As such, although the number of sites visited is large for this type of data, it displays large asymmetries in the surveying effort, posing a more challenging problem for this type of analysis. Both data sets contain a sufficiently small number of predictors so that enumeration of the entire model space is feasible. Therefore, for these data analyses, we present estimators of posterior probabilities from enumeration (EPE), renormalization (RPE), and visit frequency (FPE). While all estimates exhibit Monte Carlo error, we treat the enumeration estimators as a gold standard estimator because the Monte Carlo error can be easily controlled. We implement the method of Chib and Jeliazkov (2001) for estimation of the marginal and use a relative magnitude stopping rule to determine the length of sampling (Flegal and Gong, 2013). In particular, we require that the 95% confidence interval for the estimator for the log posterior evaluated at its mode be less than 1% of the size of the estimate.
To obtain the EPE for each model, we run the MCMC algorithm defined by (2.2) using the priors given in (2.5)-(2.7). These yield draws of the regression coefficients conditional on each model, which are then used to calculate the marginal density of the response. We calculate the EPEs using the marginals obtained under each model. Once the EPEs are in place, we then compare them to their corresponding MCMC estimates using either FPE or RPE. Expression (3.8) enables direct calculation of the RPEs for a specified set M A of models, which may even include models that were not sampled. Given the moderate size of the model space for these examples, in both cases we set M A to be the entire model space. In contrast, as a general rule the FPEs are only available for the set of the visited models in the RJMCMC. Finally, to compare our results to the traditional approach using AIC, we use the "Akaike weights" (see Burnham and Anderson, 2003;Burnham, 2004, for a definition and further information). These are obtained using functions occu and dredge from the R packages unmarked and MuMin, respectively. The AIC weights allow us to make direct comparison of the results provided by either method, as they can be seen as posterior probabilities obtained from a specific prior on the model parameters. However, as AIC is minimax-rate optimal for estimating the regression function, it cannot be a consistent model selector, as demonstrated in Yang (2005), making these priors ill-suited for variable selection.

The mallard data
As Switzerland is a small and mountainous country, it provides for large variation in its topography and physio-geography. As such, elevation is a good candidate to predict species occurrence at a large spatial scale. It can serve as a proxy for habitat type, intensity of land use, temperature, as well as some other biotic factors (Kéry et al., 2010). The data used in the illustration was collected by the Swiss breeding bird survey, and had been previously used to derive abundance estimates in Kéry et al. (2005).
The monitoring program for common breeding bird species comprises more than 250 1-km 2 quadrats distributed in a grid sample across Switzerland. Throughout the breeding season, each quadrat is surveyed two or three times annually by an experienced surveyor along a route, recording the date and whether visual or acoustic contact was made. Elevation (elev) and forest cover (forest) were matched for the studied locations from the Swiss Federal Statistical Office (Kéry and Schmid, 2004). Given that the route length (length) across quadrats was not homogenous, route length (within a quadrat) was considered to account for variation in effective sample area. To model the detection probabilities, survey duration divided by route length (ivel) was used as a measure of effort. Also the date (date) was considered for the detection component since the surveys were collected over a three month period, and behavioral changes that might affect detection could be expected. Using the built-in feature of our algorithm to account for the polynomial structure in the predictors, we considered a full quadratic surface for the predictors, both in the presence as well as in the detection component. The dataset contains 235 quadrats, of which two were surveyed once, 42 twice, and 191 were visited three times.

Results
As mentioned above, given that this dataset contains only a few covariates, even when considering the full quadratic surfaces, it is possible to perform complete enumeration of the model space (which has 1,235 models). The results from our analyses are summarized in terms of the MPIPs (calculated using (4.1)), the top ranked models (in terms of their posterior probabilities), and the Median Probability Model (MPM), which is the model containing only terms whose MPIPs are greater than 0.5. These measures were all obtained for each method using the posterior probabilities from the joint model for presence and detection. Table 1 displays the MPIPs calculated with EPEs, RPEs, FPEs and AIC w . Although the MPIPs obtained from EPE are lower than those from the two other estimates (RPE and FPE), for the most part all three share the same ordering, with the exception of the length 2 term in the presence component. It is worth noting that, although the MPIPs are comparable for the three alternatives, those from RPE are considerably closer to those from EPE than those from RPE, especially for the detection component. The MPIPs from AIC w are considerably higher for most predictors than any of their Bayesian counterparts, implying that good models resulting from AIC selection are more complex, as expected.
Using each of the first three columns displayed in Table 1 one can extract a median probability model (MPM). Following the same approach, with the last column in Table 1, we obtain the 50% threshold model using the AIC weights. The MPM matches for RPE and FPE, and this model in turn is similar to that from EPE, but the latter excludes the forest term in the presence component. In spite of this discrepancy, it is noteworthy that the MPIP using EPE for this term is 0.4305, being relatively close to the 0.5 threshold for the MPM. The comparable model obtained using AIC weights is considerably larger than all the MPMs resulting with EPE, RPE and FPE, all of which are nested within it.  Table 3 displays the five highest probability models (HPMs) under the three calculation alternatives, as well as those resulting from AIC based ranking. Remarkably, the highest probability model is the same under the true posterior probabilities and the two estimation methods considered. Among the set of top models resulting from EPE, four are among the top five from RPE, and three are among those from FPE. Additionally, the model ranked fifth using EPE, which does not match with any of the top five HPMs from RPE or FPE, is ranked eighth and ninth with RPE and FPE, respectively. Also, models ranked fifth under RPE (which coincides with model four with FPE) and fifth under FPE, which are not among the top five with EPE, are respectively ranked eighth and seventh with EPE. Again, more complex top models result from AIC selection in the presence components, and notably the model posterior probabilities are highly diluted across the model space, with the five top models concentrating only about 8% of the posterior mass. This contrasts markedly with the mass harnessed by the top five models with the other three methods, which are approximately 26% with FPE, 43% for RPE and 55% with EPE. The results in Tables 1-3 indicate that estimating the model posterior probabilities using either RPE or FPE yield reasonable approximations to the actual posterior probabilities. In particular, all methods rank models similarly, and if model averaging was to be performed, these would all produce comparable results, as the derived MPIPs resemble each other under the three alternatives. Nonetheless, following the results from Table 1 we prefer RPEs, as these appear to be converging faster towards the benchmark posterior values (EPEs). These results are consistent with the findings from exhaustive simulation experiments conducted in Taylor-Rodriguez et al. (2015), where overwhelming evidence was found in favor of renormalized model posterior estimates when compared to the frequency-based ones in the multiple linear regression problem. For occupancy models, this behavior is more conspicuous in the detection component than in the presence one, possibly due to the additional uncertainty arising from only partially observing the presence indicators. In addition to the observation that the renormalized posteriors are closer to those from enumeration, in larger model spaces where not all models are visited by the stochastic search, it is possible to calculate renormalized posteriors for a larger set of models than those visited, while with frequency-based estimates this is not possible.

Blue hawker data
During 1999 and 2000, an intensive volunteer surveying effort coordinated by the Centre Suisse de Cartographie de la Faune (CSCF) was conducted to analyze the distribution of the blue hawker, Ashna cyanea (Odonata, Aeshnidae), a common dragonfly in Switzerland. Repeated visits to 1-ha pixels took place to obtain the corresponding detection history. In addition to the survey outcome, the x-and y-coordinates, thermal level, the date of the survey, and the elevation were recorded. Surveys were restricted to the known flight period of the blue hawker, which occurs between May 1 and October 10. In total, 2,572 sites were surveyed at least once during the surveying period. The number of surveys per site ranges from 1 to 22 times within each survey year, with as many as 67% of the sites being surveyed only once, and only 5% of the sites being surveyed more than 3 times. As such, the analysis of this data set is an illustration of a considerably more challenging problem. Kéry et al. (2010) summarize the results of this effort using AIC-based model comparisons. To select the predictors in the detection component, the authors follow a backwards elimination approach while keeping the presence component fixed at the most complex model. To select the presence model, they choose among a group of three models while using the chosen detection model. The full models considered in this study are Φ −1 (p) = λ 0 + λ 1 year + λ 2 elev + λ 3 elev 2 + λ 4 elev 3 + λ 5 date + λ 6 date 2 Φ −1 (ψ) = α 0 + α 1 year + α 2 elev + α 3 elev 2 + α 4 elev 3 , where the term year denotes I {year=2000} .
Assuming these full models and intercept only base models (and disregarding the polynomial hierarchy among predictors), the model space for this problem contains 2 6+4 = 1, 024 models in the joint model space. However, if the polynomial structure is respected, without considering interactions (for compatibility with the analysis in Kéry et al. (2010)), the size of the model space for the detection component reduces to 24 models, and to eight models for the presence. This corresponds to a total of 192 models in the combined space. In the exercise below, when using the proposed approach we enforce the strong heredity condition through the priors over the model space.
As in the analysis of the Mallard dataset, we obtain the EPEs, the RPEs, and the FPEs. The model ranks obtained with the posterior probabilities (or their estimates) are compared to those resulting from AIC selection. The functions used to conduct selection with AIC did not constrain the model space to respect strong heredity, hence for the AIC selection all 1024 models were considered. All results are compared to the models ultimately recommended by Kéry et al. (2010), given by Detection: 1, elev, elev 2 , date, date 2 Presence: 1, elev, elev 2 , elev 3 . Table 4 shows the MPMs from either of the approaches considered obtained with the MPIPs found in Table 6 of Appendix B. The MPMs obtained with RPE and FPE coincide, and are similar to that from EPE, with the the latter additionally including the elev 2 term. The pseudo-MPM that results when using AIC weights contains all the term included in the MPMs from RPE and FPE, but adds the elev 3 and year terms in the detection component. Note that this model does not respect the polynomial hierarchy, including elev 3 but not elev 2 . The top ranked models in terms of the true (EPE) and estimated posterior probabilities (RPE and FPE), and from AIC-based selection are displayed in Table 5. The top model obtained with EPE, RPE and FPE are the same for both the presence and detection components, with the top AIC model not respecting the polynomial hierarchy in the detection component (including the elev 3 but not elev 2 ) and having only the year term in the presence component. Interestingly, four out of the top five models found by EPE coincide with those from RPE, whereas only two from EPE are among the top 5 discovered with FPE, indicating again faster convergence of the renormalized estimates when compared to the frequency based ones. Again, it is worth emphasizing that the probability mass with AIC weight is much more diluted across the model space than with any of its Bayesian counterparts.

Results
To conclude, a notable advantage of the Bayesian approach is that the uncertainty associated to the choice of a particular model can be assessed using the model posterior probabilities, whereas this is not the case with AIC selection, as the AIC weights do not correspond to actual posterior probabilities.

Discussion
This paper developed the first objective Bayes methodology for variable selection using single-season site occupancy models, based on intrinsic priors derived from non-informative priors. This solution uses latent variables to data-augment the analysis, helping to seamlessly calculate the model posterior probabilities. Working on the latent scale additionally facilitates the construction of a straightforward MCMC sampler and posterior estimation using sample averages. Because the intrinsic priors are built from non-informative priors, the need for hyperparameter specification is avoided, making the method entirely automatic and widely applicable. Additionally, the types of prior distributions assumed on the model space (HIP, HOP and HUP) enforce the heredity constraints required when performing selection with interactions and higher-order polynomial predictors. These classes also allow for stronger penalization than the usual equal probability prior, further helping control the false positive rate. These have been shown to be particularly useful in problems with small and moderate sample sizes (for more details see Taylor-Rodriguez et al., 2015). An important advantage of our method, relative to the AIC-based selection, is that the resulting model posterior probabilities provide a measure of uncertainty associated with choosing a particular model.
The stochastic search algorithm can be used to thoroughly explore large model spaces using the renormalized posterior estimates (instead of the frequencybased ones). This tool will allow practitioners to explore the model space without having to enumerate it or preselect a subset of models, enabling its use with larger model spaces.
The simulation experiments confirmed the ability of the method to identify the predictors present in the true model when considering both the highest and median probability models. The objective Bayes method proved to be competitive with AIC in detecting true predictors, and greatly outperformed AIC in reducing the number of false positive predictors included in the models with high posterior probabilities.
The software used throughout the article was built into the R package Oc-cOBayes available at request. This package includes functions to run the variable selection procedure, as well as some auxiliary functions to validate a set of "best" models using a hold-out data set.