Power-Expected-Posterior Priors for Generalized Linear Models

The power-expected-posterior (PEP) prior provides an objective, automatic, consistent and parsimonious model selection procedure. At the same time it resolves the conceptual and computational problems due to the use of imaginary data. Namely, (i) it dispenses with the need to select and average across all possible minimal imaginary samples, and (ii) it diminishes the effect that the imaginary data have upon the posterior distribution. These attributes allow for large sample approximations, when needed, in order to reduce the computational burden under more complex models. In this work we generalize the applicability of the PEP methodology, focusing on the framework of generalized linear models (GLMs), by introducing two new PEP definitions which are in effect applicable to any general model setting. Hyper-prior extensions for the power parameter that regulates the contribution of the imaginary data are introduced. We further study the validity of the predictive matching and of the model selection consistency, providing analytical proofs for the former and empirical evidence supporting the latter. For estimation of posterior model and inclusion probabilities we introduce a tuning-free Gibbs-based variable selection sampler. Several simulation scenarios and one real life example are considered in order to evaluate the performance of the proposed methods compared to other commonly used approaches based on mixtures of g-priors. Results indicate that the GLM-PEP priors are more effective in the identification of sparse and parsimonious model formulations.


Introduction 1.Motivation
In this article, the variable selection problem in Generalized Linear Models (GLMs) is analyzed from an objective and fully automatic Bayesian model choice perspective.The desire for an automatic Bayesian procedure is motivated by the appealing property of creating a method that can be easily implemented in complex models without the need of specification of tuning parameters.Regarding the justification for the necessity of an objective model choice approach we can argue that in variable selection problems we are rarely confident about any given set of regressors as explanatory variables, which translates to little prior information about the regression coefficients.Therefore, we would like to consider default prior distributions, which in many cases are improper, thus leading to undetermined Bayes factors.
Intrinsic priors (Berger and Pericchi, 1996a,b) and expected-posterior (EP) priors (Pérez and Berger, 2002) can be considered as fully automatic, objective Bayesian methods for model comparison in regression models.They are developed through the utilization of the device of "training" or "imaginary" samples, respectively, of "minimal" size and therefore the resulting priors have a further advantage of being compatible across models; see Consonni and Veronese (2008).Intrinsic priors as well as EP priors have been proposed in many articles for use on the variable selection problem in the Gaussian linear models (see for example Casella and Moreno, 2006); however, to the best of our knowledge, there is only one study that proposes this methodology for GLMs, which is restricted to the case of the probit model (Leon-Novelo, Moreno and Casella, 2012).We believe that this is due to the fact that derivation of such priors can be a very challenging task, especially under complex models, leading to computationally intensive solutions.Furthermore, by using minimal training samples, large sample approximations (e.g Laplace approximations) can not be applied in many cases.
Our contribution with this article is two-fold.First, we develop an automatic, objective Bayesian variable selection procedure for GLMs based on the EP prior methodology.In particular we consider the power-expected-posterior (PEP) prior of Fouskakis, Ntzoufras and Draper (2015), that diminishes the effect that the imaginary data have upon the posterior distribution and therefore the need of using minimal training samples.Through this approach we can consider imaginary samples of sufficiently large size and therefore be able to apply, when needed, large sample approximations.Secondly, we introduce a simple tuning-free Gibbs-based variable selection sampler for estimating posterior model and variable inclusion probabilities.

Bayesian variable selection for generalized linear models
Despite the importance and popularity of GLMs, Bayesian variable selection techniques for non-Gaussian models are scarce in relation to the abundance of methods that are available for the normal linear model.This is mainly due to the analytical intractability which arises outside the context of the normal model.The relatively limited studies that focus on non-Gaussian models, mainly aim to overcome analytical intractability through the use of Laplace approximations and/or stochastic model search algorithms.Chen and Ibrahim (2003) introduced a class of conjugate priors based on an initial prior prediction of the data (similar to the concept of imaginary data) associated with a scalar precision parameter.This approach essentially leads to a GLM analogue of the g prior (Zellner andSiow, 1980, Zellner, 1986) where the precision parameter has the role of g.However, the prior of Chen and Ibrahim (2003) is not analytically available for non-Gaussian GLMs and, therefore, Chen, Huang, Ibrahim and Kim (2008) proposed a Markov chain Monte Carlo (MCMC) based solution for this class of models.Ntzoufras, Dellaportas and Forster (2003) used a unit-information g-prior (Kass and Wasserman, 1995) for variable selection and link determination in binomial models through reversible-jump MCMC sampling.Sabanés Bové and Held (2011) consider the asymptotic distribution of the prior of Chen and Ibrahim (2003), which results in the same g-prior form used in Ntzoufras et al. (2003), and further consider mixtures of g-priors along the lines of Liang, Paulo, Molina, Clyde and Berger (2008).Computation of the marginal likelihood in Sabanés Bové and Held (2011) is handled through an integrated Laplace approximation, based on Gauss-Hermite quadrature, which allows variable selection through full enumeration for small/moderate model spaces or through MCMC model composition (MC 3 ) algorithms (Madigan and York, 1995) for spaces of large dimensionality.Other GLM variations of g-prior mixtures have an empirical Bayes (EB) flavor, using the observed or expected information matrix evaluated at the ML estimates as the prior variancecovariance matrix (Hansen and Yu, 2003, Wang and George, 2007, Li and Clyde, 2015).A computational benefit of the EB approach is that the integrated Laplace approximation can be expressed in closed form as a set of functions of the ML estimates.For large model spaces, where full enumeration is infeasible, Li and Clyde (2015) recommend using the Bayesian adaptive sampling algorithm (Clyde, Ghosh and Littman, 2011).A relevant prior specification is the information-matrix prior of Gupta and Ibrahim (2009) which combines ideas from the g-prior and Jeffreys prior for GLMs (Ibrahim and Laud, 1991); under a Gaussian likelihood the information-matrix prior becomes the standard g-prior, while for g → ∞ it reduces to Jeffreys prior which is proper only for the case of the binomial model.However, in applications Gupta and Ibrahim (2009) do not directly consider the problem of stochastic search over the entire model space.Finally, one application of Bayesian intrinsic variable selection for probit models via MCMC is presented in Leon-Novelo et al. (2012).
As seen, at present most methods for GLMs are anchored to the g-prior approach and therefore cannot be regarded as objective and fully automatic approaches in the sense that one cannot conduct an analysis starting with non-informative, flat priors.In this work we present an automatic, objective Bayesian variable selection procedure for GLMs based on the PEP methodology.The structure of the remainder of the paper is as follows.In Section 2 we provide an overview of the PEP prior formulation and discuss the applicability problems that arise in the case of non-Gaussian models.We proceed with two alternative definitions, which generalize the applicability of the PEP prior for GLMs.In Section 3 we introduce a Gibbs-based sampler suitable for variable selection and for single-model posterior inference.Section 4 presents an hierarchical extension of the methodology which involves assigning a hyper-prior to the power-parameter that controls the contribution of the imaginary data.Illustrative examples and comparisons with other methods using both simulated and real data sets are presented in Section 5. Section 6 concludes with a summary and a discussion of future research directions.

PEP priors for generalized linear models
We consider n realizations of a response variable Y accompanied by a set of predictors X 1 , X 2 , ..., X p which may potentially characterize the response.To fix notation, let γ ∈ {0, 1} p index all 2 p subsets of predictors serving as a model indicator, where each element γ j , for j = 1, . . ., p, is an indicator of the inclusion of X j in the structure of model M γ .Moreover, let p γ = p j=1 γ j denote the number of active covariates in model M γ .Within the GLM framework, the response Y follows a distribution which is a member of the exponential family.The sampling distribution of the response vector y = (y 1 , . . ., y n ) T under model M γ is given by The functions a(•), b(•) and c(•) determine the particular distribution of the exponential family.The parameter ϑ γ(i) is the canonical parameter which regulates the location of the distribution through the relationship ϑ γ , where g(•) is the link function connecting the mean of the response y i with the linear predictor η γ . We assume that an intercept term is included in all 2 p models under consideration, so β γ is the d γ × 1 vector of regression coefficients, where d γ = p γ + 1, and X γ(i) is the i-th row of the n × d γ design matrix X γ with a vector of 1's in the first column and the γ-th subset of the X j 's in the remaining p γ columns.The parameter φ γ controls the dispersion and the function α(•) is typically of the form α i (φ γ ) = φ γ /w i , where w i is a known fixed weight that may either vary or remain constant per observation.In addition, the nuisance parameter φ γ is commonly considered as a common parameter across models, therefore we assume throughout that φ γ ≡ φ without loss of generality.Given the above formulation, we have that E(y The GLM parameters θ γ = (β γ , φ) are divided into the predictor effects β γ and the parameter φ which affects dispersion.In the following we work along the lines of Fouskakis and Ntzoufras (2016) considering the conditional PEP prior; i.e. we construct the PEP prior of β γ conditional on φ.

An overview of the PEP prior
The PEP prior, initially formulated in Fouskakis et al. (2015) for the case of the normal linear model, creatively fuses ideas from the power-prior (Ibrahim and Chen, 2000) and the EP prior (Pérez and Berger, 2002).Let us first describe the EP prior approach.Consider that we have imaginary data y * = (y * 1 , . . ., y * n * ) T coming from the prior-predictive distribution m * (y * ) of a "suitable" reference model M * .Then, given y * , for any model M γ with sampling distribution f γ (y * |β γ , φ) as defined in (2.1) and a default baselineprior of the form π , we have a corresponding baseline-posterior distribution given by . (2.2) The EP prior for the parameters of model M γ is then defined as the posterior distribution in (2.2), averaged over all possible imaginary samples, i.e. 3) The reference model M * is commonly considered to be the simplest model, i.e. the (null) intercept model in the regression framework.This selection makes the EP approach essentially equivalent to the arithmetic intrinsic Bayes factor of Berger and Pericchi (1996b).
A key issue in the implementation of the EP prior is the selection of the size n * of the imaginary sample.In order to minimize the effect of the prior on the posterior inference, the reasonable solution is to choose the smallest possible n * for which the posterior is proper.This leads to the concept of the so-called minimal training sample, which however requires calculating the arithmetic mean (or other appropriate measures of centrality) of Bayes factors over all possible minimal training samples.In addition, when it comes to regression the same problem arises with the design matrix as one has to choose appropriate covariate values for each minimal training sample, and this further depends upon the choice of the reference model.A computational solution to deal with the aforementioned problems has been proposed in the literature (e.g.Casella andMoreno, 2006, Moreno andGirón, 2008), however, this solution is only applicable under the normal linear regression model and in addition under this approach it is not clear whether the resulting Bayes factors retain their intrinsic nature.Furthermore, the effect of the EP prior can become influential when the sample size is not much larger than the number of predictors; see Fouskakis et al. (2015) for details.Finally, when n * is small and (2.3) is hard to derive, large sample approximations cannot be applied.
The PEP prior resolves the problem of defining and averaging over minimal training samples and at the same time scales down the effect of the imaginary data on the posterior distribution.The core idea lies in substituting the likelihood function involved in the calculation of (2.2) by a powered-version of it, i.e. raising it to the power of 1/δ, similar to the power-prior approach of Ibrahim and Chen (2000).Following Fouskakis and Ntzoufras (2016), the conditional PEP prior in the GLM setup, under the null-reference model M 0 , is defined as follows where (2.9) (2.10) Note that the PEP prior for the intercept term of M 0 essentially reduces to the baseline prior; i.e. π PEP 0 Here the power-parameter δ controls the weight that the imaginary data contribute to the "final" posterior distributions of β γ and φ.The default choice is to set it equal to the size of the imaginary sample, i.e. δ = n * .Under this approach the contribution of the imaginary data is downweighted to overall account for one data point, leading to a minimally-informative prior with a unit-information interpretation (Kass and Wasserman, 1995).Furthermore, by setting n * = n we avoid the complicated problem of sampling over numerous imaginary design sub-matrices, as in this case we have that X * γ ≡ X γ .As shown in Fouskakis et al. (2015) the PEP prior is robust with respect to the specification of n * and it also remains relatively non-informative even when the model dimensionality is close to the sample size.
Another advantage of setting n * = n, which becomes more obvious in the GLM framework, is that one can now utilize large-sample approximations when needed.For instance, consider the baseline-posterior in (2.6), which can be expressed as This unnormalized distribution is recognized as the power-prior for GLMs (Chen, Ibrahim and Shao, 2000).If we assume a flat baseline prior for β γ , i.e. π N γ (β γ |φ) ∝ 1, then, based on standard Bayesian asymptotic theory (Bernardo and Smith, 2000), for n * → ∞ the distribution in (2.11) converges to where β * γ is the MLE of β γ for data y * and design matrix X * γ , and J * γ β * γ is the observed information evaluated at β * γ .Specifically, (2.13) where is straightforward to see that the asymptotic distribution in (2.12) has a g-prior form according to the definitions for GLMs presented in Ntzoufras et al. (2003) and Sabanés Bové and Held (2011).The familiar zero-mean representation in (2.12) arises when the covariates are centered around their corresponding arithmetic mean and the imaginary response data are all the same, i.e. y * = g −1 (0)1 n * , where 1 n * is a vector of ones of size n * since in this case we have that β * γ = 0 dγ ; for details see Ntzoufras et al. (2003).

PEP prior extensions for GLMs via unnormalized powerlikelihoods
The sampling distribution of the imaginary data involved in the PEP prior via (2.6), (2.7) and (2.9) is a power version of the likelihood function.In the normal linear regression case Fouskakis et al. (2015) and Fouskakis and Ntzoufras (2016) naturally considered the density normalized power-likelihood which is also a normal distribution with variance inflated by a factor of δ.Similar results can be derived for specific distributions of the exponential family such as the Bernoulli, the exponential and the beta distributions where the normalized power-likelihood is of the same distributional form.This property simplifies calculations when using the PEP methodology, especially for Gaussian models where the resulting posterior distribution and marginal likelihood are available in closed form.An application of the PEP prior using the normalized power-likelihood for MCMC-based variable selection in binary logistic regression can be found in Perrakis, Fouskakis and Ntzoufras (2015a).However, this property does not hold for all members of the exponential family.For instance, for the binomial and Poisson regression models, the normalized power-likelihoods are composed by products of discrete distributions that have no standard identifiable form.Although it is feasible to perform likelihood evaluations for each observation, the additional computational burden renders the implementation of the PEP prior methodology time-consuming and inefficient.One possible computational solution to the problem would be to utilize an exchange-rate algorithm for doubly-intractable distributions (Murray, Ghahramani and MacKay, 2006).However, this approach would further increase MCMC computational costs.
Here we pursue a more generic approach for the implementation of PEP methodology in GLMs by redefining the prior itself.Namely, we consider two adaptations of the PEP prior which, in principle, can be applied to any statistical model and, consequently, are applicable to all members of the exponential family.For the remainder of this paper, without loss of generality we restrict the scale parameter φ to be fixed, which is the case for the binomial, Poisson and normal with known error variance regression models.Specifically, we assume that φ = 1 and remove φ from all conditional expressions to alleviate notation.
The core idea is to use the unnormalized power likelihood (2.8) and (2.10) and normalize the baseline posterior density (2.11), i.e.
which is also the approach of Friel and Pettitt (2008, Eq.4) in the definition of the power posterior.Given this first step, we proceed by proposing two versions of the PEP prior which differentiate with respect to the definition of the prior predictive distribution used to average the baseline posterior in (2.15) across imaginary sets.This prior predictive distribution can be alternatively viewed as a hyper-prior assigned to y * (Fouskakis and Ntzoufras, 2016).More specifically we define the two PEP variants as follows.
Definition 1 The concentrated-reference PEP prior of model parameters β γ is defined as the power-posterior of β γ in (2.15) "averaged" over all imaginary data coming from the prior predictive distribution of the reference model M 0 based on the actual likelihood, that is In order for the above prior to exist we need to consider for each model M γ similar assumptions as in Pérez and Berger (2002), i.e.
In equation (2.16), m N 0 will not necessarily be proper, but still, by abusing slightly notation we define the concentrated-reference PEP prior as the expectation of π N γ (β γ |y * , δ) with respect to m N 0 .Furthermore, impropriety of the baseline priors in (2.16) causes no indeterminacy of the resulting Bayes factors, since π CR−PEP γ (β γ |δ) depends only on the normalizing constant of the baseline prior of the parameter of the null model.Finally, the concentrated-reference PEP prior for the parameter of the null model is no longer equal to the baseline prior π N 0 (β 0 ), since Definition 2 The diffuse-reference PEP prior of model parameters β γ is defined as the power-posterior of β γ in (2.15) "averaged" over all imaginary data coming from the "normalized" prior predictive distribution of the reference model M 0 based on the unnormalized power-likelihood, that is The conditions for the existence of the diffuse-reference PEP prior, for each model M γ , are similar to (2.18), i.e.
Again the definition of the diffuse-reference PEP prior as an expectation of π N γ (β γ |y * , δ) with respect to m Z 0 is slightly abusive under improper baseline prior setups.The normalization of m N 0 (y * |δ) is adopted in order to retain the "expected-posterior" interpretation under proper baseline prior setups.The induced normalizing constant C 0 = m N 0 (y * |δ)dy * exists under any proper baseline prior setup and has no effect on the posterior variable selection measures since it is common in all models under consideration.Additionally, impropriety of the baseline priors causes no indeterminacy of the resulting Bayes factors, since π DR−PEP γ (β γ |δ) depends only on C 0 which is common across all models.Finally, the diffuse-reference PEP prior for the parameter of the null model is no longer equal to the baseline prior, since Definition 1 is a special case of Definition 2 since m N 0 (y * ) is a special case of m N 0 (y * |δ) with δ = 1.Because the likelihood in (2.17) is not scaled down, it provides more information from the imaginary data resulting in a more concentrated (in relation to the alternative approach) predictive distribution.For this reason, this version is named concentratedreference PEP (CR-PEP).The CR-PEP prior is also given by In Definition 2 the likelihood involved in m N 0 (y * |δ) in (2.20) is raised to the power of 1/δ and, therefore, the information incorporated in the prior predictive distribution becomes equal to n * /δ points leading to a distribution which becomes increasingly diffuse as δ grows.Thus, this prior is coined diffuse-reference PEP (DR-PEP).Specifically, we have that

Further prior specifications
To complete the model formulation we need to specify a baseline prior for β γ , under each model M γ , and also a prior distribution on the model space M = {0, 1} p , as we are interested in variable selection rather than single-model inference in which case γ ∈ M is fixed.In addition, we do not need to specify a prior for φ, which is considered fixed in our setting.For models with random (under estimation) φ, we propose working along the lines of Fouskakis and Ntzoufras (2016) and use a flat prior on φ; this will just add one additional step to the MCMC algorithm presented in Section 3.

Baseline prior distributions for β γ
Common choices for the baseline prior of the regression vector β γ , under each model M γ , are either the flat improper prior or Jeffreys' prior for GLMs (Ibrahim and Laud, 1991) which is of the form For non-Gaussian GLMs Jeffreys' prior will depend on β γ through the matrix W γ (•); see Section 2.1 for details.Note that Jeffreys' prior for the parameter of the null model simplifies to

Prior distributions on model space
A common non-informative option for γ is to use a product Bernoulli distribution where the prior inclusion probability of each predictor is equal to 0.5.This leads to a discrete uniform prior on the model space, i.e. (2.27) An alternative choice, better suited for moderate to large p, is to use the hierarchical prior design γ|τ ∼ Bernoulli(τ ) and τ ∼ Beta(1, 1), in order to account for an appropriate multiplicity adjustment (Scott and Berger, 2010).
In this case the resulting prior is given by (2.28)

Posterior Computation
In normal linear regression models the conditional PEP prior is a conjugate normal-inverse gamma distribution which leads to fast and efficients computations (Fouskakis and Ntzoufras, 2016).For non-Gaussian GLMs there exist no convenient conjugate distributions and the integrals involved in the derivation of the CR/DR-PEP priors are intractable.However, one can work with the hierarchical model, i.e. without marginalizing over the imaginary data, and use an MCMC algorithm in order to sample from the joint posterior distribution of β γ and y * .For ease of exposition, for the remainder of this section we use indicator ψ to distinguish between the CR-PEP prior (ψ = 1) and the DR-PEP prior (ψ = δ) and we simply use the general term "PEP" to denote the joint posterior.Specifically, from (2.15), (2.16) and (2.20) we have the following hierarchical form where m N 0 (y * |1) ≡ m N 0 (y * ).A further computational problem in (3.1) relates to the prior predictive distributions m N γ (y * |δ) and m N 0 (y * |ψ) which are not available in closed form.One solution is to use a Laplace approximation for both.Alternatively, a more accurate solution is to augment the parameter space further and include parameter β 0 of the reference model M 0 in the joint posterior, thus avoiding to approximate m N 0 (y * |ψ).Based on (2.23) and (2.24) the posterior in (3.1) is expanded as which leaves us with the need of using only one Laplace approximation for m N γ (y * |δ).Sampling from (3.2) for a single model M γ , i.e. for a fixed configuration of γ, is feasible using standard Metropolis-within-Gibbs algorithms.Note that under flat baseline priors the posterior in (3.2) and the corresponding MCMC scheme are simplified.Moreover, under a flat baseline prior one may also consider using the normal approximation in (2.12) for the entire fraction appearing in (3.2), instead of using a Laplace approximation for the prior predictive m N γ (y * |δ).For variable selection, which is the topic of the next section, we further assign a prior on γ, based on the options discussed in Section 2.3, and utilize the algorithm of Dellaportas, Forster and Ntzoufras (2002).

Gibbs variable selection under the PEP prior
The Gibbs variable selection (GVS; Dellaportas et al., 2002) method is a stochastic search algorithm based on the vector of binary indicators γ ∈ {0, 1} p which represents which of the p covariates are included in a model.To formulate GVS we need to partition the regression vector β into (β γ , β \γ ), corresponding to those components of β that are included and excluded from the model, i.e. β j ∈ β γ if γ j = 1 and β j ∈ β \γ if γ j = 0, for j = 1, . . ., p.As we assume that the intercept term is included in all models under consideration, β γ and β \γ are of dimensionality d γ = p γ +1 and d \γ = p−p γ , respectively.
Under the GVS setting the joint prior of β and γ is specified as follows where the actual baseline prior choice involves only β γ , since π N γ (β \γ ) is just a pseudo-prior used for balancing the dimensions between model spaces; see Dellaportas et al. (2002).Suitable choices for the priors of β γ and γ have been discussed in Section 2.3, thus, in order to complete the GVS setup, we only need to specify the pseudo-prior for the inactive part of the regression vector β.In particular, we use a multivariate normal distribution of dimensionality d \γ , with parameters specified by the ML estimates; namely, where β \γ and σ β \γ are the respective ML estimates and corresponding standard errors of β \γ from the full model using the actual data y and I d \γ is the d \γ × d \γ identity matrix.
Based on this formulation, the full augmented posterior used to build our MCMC has the following form ) where, as a reminder, ψ = 1 in the CR-PEP setting and ψ = δ in the DR-PEP setting.
Step 7: Sample y * from using a Metropolis-Hastings step.
Step 8: Update the parameter values at iteration t as Note that the generation of γ and β \γ (Steps 2 and 5) is straightforward since the corresponding conditional distributions are of known form.For the rest of parameters, β γ , β 0 and y * , we use Metropolis-Hastings (M-H) steps.Details are provided next.

Implementation details
Concerning the binary inclusion indicators γ j , the conditional posterior distribution π γ j β, γ \j , y * , y, δ is a Bernoulli distribution with success probability O j /(1 + O j ) and where All of the quantities involved in (3.6) are available in closed form expressions except of the marginal likelihood m N γ (y * |δ).The latter is estimated through the following Laplace approximation where β * γ is the MLE for data y * given the configuration of γ and δ is equal to minus the inverse Hessian matrix evaluated at β * γ .Under a Jeffreys baseline prior for β γ , the Laplace approximation simplifies to m N γ (y * |δ) = (2πδ) dγ /2 f γ (y * | β * γ ) 1/δ .For the active effects β γ of model M γ and the intercept term β 0 of the reference model M 0 , we use independence sampler M-H steps.Specifically, for β γ we generate new candidate values as where β all γ is the ML estimate from a weighted regression on y all = (y, y * ) T , using weights w all = (1 n , 1 n δ −1 ) T , and Σ β all γ is the estimated variance-covariance matrix of β all γ .The proposed move is accepted with probability where β γ denotes the current value of the chain.The proposal distribution of β 0 is q(β 0 ) = N( β 0 , ψ σ 2 β 0 ) with β 0 and σ β 0 being the respective ML estimate of β 0 and the standard error of β 0 from the null model with response data y * .The proposed move is accepted with the usual M-H transition probability where the likelihood of the reference model is raised to the power of 1/ψ.Note that no specific fine tuning is required for the proposal distributions of β γ and β 0 .
Finally, for the generation of the imaginary data we propose candidate values y * from a proposal distribution q(y * ) and accept the proposed move with probability where the marginal likelihood estimates are obtained through (3.7) and y * denotes the current value of the chain.The joint proposal density is formed by the product of independent distributions, i.e. q(y * ) = n * i=1 q(y * i ), where the proposal of each imaginary observation y * i is constructed by combining the two likelihood components of the PEP prior.Hence, for the logistic regression model we use , where and N i denotes the number of trials of the observed data.Equivalently, for Poisson regression models we consider for the CR-PEP prior; where λ 0 = exp(β 0 ) and λ γ(i) = exp(X γ(i) β γ ).For the DR-PEP prior, the corresponding choice of a Poisson proposal with mean (λ 0 λ γ ) 1/δ was not found to be efficient in practice.Therefore, we use instead a Poisson random-walk proposal with mean equal to the value of y * i at the current iteration.A complete and thorough description of the PEP-GVS algorithm as implemented in this work is provided in algorithmic form at the electronic appendix of this paper.

Hyper-δ extensions
The initial PEP prior for the normal regression model can be interpreted as a mixture of g-priors where the power parameter δ is equivalent to g and the mixing density is the prior predictive of the reference model (Fouskakis et al., 2015).Thus, under the PEP approach we assign a hyper-prior on the imaginary data y * , rather than to the variance multiplier, i.e. the power-parameter δ.As discussed in Section 2.1, the same representation holds asymptotically in the GLM setting given a flat baseline prior.From this perspective, a natural extension of the PEP methodology arises by introducing an extra hierarchical level to the model formulation via the assignment of a hyper-prior on δ.Under this approach one can estimate the power-parameter instead of a-priori set it equal to a fixed predefined value.It should be noted, however, that when δ is not fixed at n * , then PEP priors loose their unit-information interpretation.
The hyper-δ CR/DR-PEP priors can be approximately expressed as under the baseline prior π N γ (β γ ) ∝ 1; where ψ = {1, δ}, β * γ is the ML estimate given the imaginary data, J * γ β * γ is given in (2.13) and f N dγ (•) denotes the d γ -dimensional multivariate normal distribution.Sensible options for π(δ) are the hyper-g analogues proposed in Liang et al. (2008).Specifically, we consider the hyper-δ prior which corresponds to a Beta 1, a 2 − 1 for the shrinkage factor δ 1+δ .Thinking in terms of shrinkage, Liang et al. (2008) propose setting a = 3 in order to place most of the probability mass near 1 or a = 4 which leads to a uniform prior.An alternative option is the hyper-δ/n prior given by In principle, any other prior from the related literature can be incorporated in the PEP design; for instance, the inverse-gamma hyper prior of Zellner and Siow (1980) or the recent g-prior mixtures proposed by Maruyama and George (2011) and Bayarri, Berger, Forte and García-Donato (2012).Of course, when working outside the context of the normal linear model the integration in (4.1) with respect to δ will not be tractable.Therefore, in order to incorporate the stochastic nature of δ we need to introduce one additional MCMC sampling step.In this case the augmented posterior is given by where the first quantity in the right-hand side of (4.4) is given in (3.5).The corresponding full conditionals we wish to sample from are Looking at the above expressions, a subtle point is that δ is not directly linked to the actual data y; however, it is linked indirectly via the posterior values of the parameters of models M γ (for both approaches) and M 0 (for the DR-PEP prior).Sampling from (4.5) or (4.6) is achieved by adding one simple step (after Step 7) in the PEP-GVS algorithm described in Section 3.1.Specifically, we use a random walk M-H step where we propose a candidate value δ from which has mean equal to the current value δ and variance s 2 δ .The latter is a tuning parameter which can be specified appropriately in order to have an acceptance rate between 0.2 and 0.5, as recommended by Roberts and Rosenthal (2001).The value of s 2 δ = δ proved to be efficient in the examples presented in Section 5. Given this proposal, the new candidate δ is accepted with probability α δ = min(1, A δ ), with A δ given by where ψ = ψ = 1 for the CR-PEP prior and ψ = δ , ψ = δ for the DR-PEP prior.This acceptance probability is derived from the marginal likelihood Laplace approximation presented in (3.7), keeping only the terms that include δ.The analytic description of the PEP-GVS algorithm found in the electronic appendix includes the additional sampling step discussed here.

Illustrative examples
In this section we present first a simulation study for logistic and Poisson regression taking into account independent and correlated predictors as well as different levels of sparsity for the true model.We proceed with a simulation study for logistic models where the number of predictors is larger and the correlation structure is more complicated.The section concludes with a real data example for binary responses.
In all illustrations we consider the CR-PEP and DR-PEP priors (introduced in Section 2.2) and their hyper-δ and hyper-δ/n extensions (presented in Section 4) with parameter a = 3, which is one of the main options proposed in Liang et al. (2008).For all PEP prior configurations we consider n * = n and X * γ = X γ , where the columns of the design matrix are centered around their corresponding sample means.For fixed δ we consider the default unit-information approach, i.e. δ = n * .Jeffreys' prior for GLMs, given in (2.26), is used as baseline prior for β γ .
We compare the PEP variants with standard g-prior methods, using the GLM gprior formulation of Sabanés Bové and Held (2011) for the parameters of the predictor variables and a flat improper prior for the intercept term.In particular, we consider the unit-information g-prior (with g = n) and three mixtures of g-priors; namely, the hyper-g and hyper-g/n priors with a = 3 (Liang et al., 2008), and the beta hyper prior proposed by Maruyama and George (2011).Henceforth, the latter will be referred to as MG hyper-g.Stochastic model search under these approaches is also implemented via GVS sampling.

Simulation study 1
In this first example we consider two simulation scenarios for logistic and Poisson regression, presented in Hansen and Yu (2003) and Chen et al. (2008), respectively.Both of these scenarios are also considered by Li and Clyde (2015).The number of predictors is p = 5 in the logistic model and p = 3 in the Poisson model, where each predictor is drawn from a standard normal distribution with pairwise correlations given by corr Concerning the correlations between covariates we examine two cases: (i) independent predictors (r = 0) and (ii) correlated predictors (r = 0.75).In addition, four sparsity scenarios are assumed; the true data-generating models are summarized in Table 1.For the logistic case we use the same sample size as in Hansen and Yu (2003), namely n = 100, but with lower effects in order to reflect more realistic values of odds ratios.Given the coefficients in Table 1, the odds ratios are approximately equal to 2, 2.5 and 3.5 for the sparse, medium and full models, respectively.For the Poisson simulation we use the same regression coefficients as in Chen et al. (2008), but with sample size equal to n = 100.Each simulation is repeated 100 times.
Since the number of predictors in both regression models is small, we assign a uniform prior on model space as given in (2.27).Results based on the frequency of identifying the true data-generating model through the maximum a-posteriori (MAP) model for the logistic regression simulation are summarized in Table 2.The comparison between the PEP prior approaches versus the rest of the methods indicates the following: i) Overall the PEP based variable selection procedures perform well, since in 5 out of the 8 simulation scenarios the "best" prior for identifying the true model is at least one of the PEP priors.
ii) The PEP procedures outperform all methods under the null and sparse simulation scenarios.
iii) Under the medium model scenario the PEP priors perform equally well to the rest of the methods in the case of independent predictors and slightly worse in the case of correlated predictors.
iv) Under the full model scenario g-prior methods perform better than PEP priors.This is no surprise as PEP priors tend to support more parsimonious solutions in general.approach versus the hyper-δ and δ/n extensions, we see that under the DR-PEP approach the results are more or less the same in terms of MAP model success patterns.However, this is not the case under the CR-PEP approach as the hyper-δ prior support more complex models than the fixed-δ prior, while the hyper-δ/n prior is somewhere in the middle.Interestingly, a similar pattern is observed among the g-prior and the hyper-g, hyperg/n priors.This pattern is identified more clearly by examining the resulting posterior inclusion probabilities; the corresponding boxplots under each method and simulation scenario are presented in Figure 1 for the case of independent predictors and in Figure 2 for the case of correlated predictors.As we can see the DR-PEP design is quite robust with respect to the choice between fixed versus random δ.Also, within the category of g-prior mixtures the MG hyper-g prior seems to have the strongest shrinkage effect.The MAP-model results from the Poisson simulations are presented in Table 3. Boxplots of posterior inclusion probabilities under each method and simulation scenario are presented in Figure 3 (for independent predictors) and in Figure 4 (for correlated predictors).Overall, conclusions are similar to the logistic case.Specifically, looking at the differences between the PEP priors and all versions of g-priors, we conclude to the following: i) The PEP procedures perform overall satisfactory; 6 out of the 8 best MAP success patterns are spotted by one of the PEP based methods.
ii) The PEP procedures perform overall well under sparse conditions, i.e. under the null or the sparse models.
iii) For the medium complexity scenarios, the hyper-g and hyper-δ CR-PEP priors yield the best results; however, under the correlated predictors scenario the true model is rarely traced by any method.
iv) For the full model with independent covariates, the MAP success rates under all methods are low; the hyper-g has the highest rate but with the hyper-δ CR-PEP prior being close and rather competitive.For the full model with correlated covariates, all methods fail; the hyper-δ CR-PEP prior has the highest success rate which is only 3%.
With respect to the various PEP prior distributions, the comparison in the Poisson case leads to the same findings as in the logistic regression case.Again, the most interesting finding is that inference under the DR-PEP prior is not affected by the choice of fixed versus random δ.On the contrary, this is not the case for the CR-PEP prior, where the hyper-δ extension systematically supports more complex models.To a lesser extend the same holds for the CR-PEP hyper-δ/n prior.
As a final remark, we note that all priors yield lower MAP success rates under the null scenario of the logistic simulations compared to the corresponding rates observed in the Poisson simulations.On the other hand, under the full scenario, the MAP success rates are higher in the logistic simulations.This can be attributed to the fact that the

Simulation study 2
In this illustration we consider a more sophisticated scenario with p = 10 potential predictors (1024 models) and a more intriguing correlation structure.Similar to Nott and Kohn (2005), the first five covariates are generated from a standard normal distribution, while the remaining five covariates are generated from for i = 1, . . ., n and j = 6, . . ., 10.We assume that sample size n is 200 and consider the three logistic regression data-generating models which are summarized in Table 4; the resulting odds ratios for the sparse and dense simulation models are approximately equal to 2 and 3, respectively.Each simulation is repeated 100 times.
In this example we use the beta-binomial prior (2.28) with a Beta(1,1) mixing distribution; see Scott and Berger (2010).The comparison that follows is based on the posterior inclusion probability of each covariate.Figures 5, 6 and 7 present boxplots of the posterior inclusion probabilities from the 100 simulated data sets for the null, sparse and dense simulation scenarios, respectively.
Under the null scenario, all priors, except the hyper-g prior, exhibit strong shrinkage on the inclusion probabilities.Under the hyper-g prior relatively large posterior inclusion probabilities with high variability across different samples are obtained.The hyper-δ CR-PEP prior also induces more variability, however, the resulting inclusion probabilities under this method are quite lower in comparison to those obtained under the hyper-g prior.
Under the sparse simulation scenario (true model: X 1 +X 7 +X 10 ), there are no striking differences among methods.All priors provide very strong support for the inclusion of X 7 and sufficient support for the inclusion of X 3 , although the variability under PEP priors is larger for the latter variable.Moreover, all methods yield very wide posterior inclusion probability intervals for predictor X 10 , implying high posterior uncertainty concerning the inclusion of this variable.For the non-important variables we observe that the fixed-δ CR-PEP and the DR-PEP priors yield the lowest posterior inclusion probabilities.
Finally, in the dense simulation scenario (Figure 7), where the true model is X 1 + X 3 + X 5 + X 6 + X 7 + X 8 + X 9 , the fixed-δ PEP priors generally outperform other methods in terms of providing low posterior inclusion probabilities for the insignificant covariates X 2 , X 4 and X 10 .The g-prior and the hyper DR-PEP extensions yield similar posterior inclusion probabilities and generally perform well, however, they introduce some uncertainty

A real data example
In our last example we consider the Pima Indians diabetes data set (Ripley, 1996), which has been analyzed in several studies (e.g.Holmes andHeld, 2006, Sabanés Bové andHeld, 2011).The data consist of n = 532 complete records on diabetes presence (present=1, not present=0) according to the WHO criteria for signs of diabetes.The presence of diabetes is associated with p = 7 potential covariates which are listed in Table 5.
For each method we used 41000 iterations of the GVS algorithm, discarding the first 1000 as burn-in period.We assigned a beta-binomial prior on model space (see Eq. 2.28).Table 6 shows the posterior inclusion probabilities of each covariate under the various methods.For comparison with the results presented in Sabanés Bové and Held (2011), we also include in Table 6 the resulting posterior inclusion probabilities from the Zellner and Siow (1980) inverse gamma (ZS-IG) prior, the hyper-g/n with a = 4, and a noninformative inverse gamma (NI-IG) hyper-g prior with shape and scale equal to 10 −3 .As seen, the posterior inclusion probabilities that we obtain from the GVS algorithm are in agreement with the results presented in Sabanés Bové and Held (2011).
For the covariates X 1 , X 2 , X 5 and X 6 , which seem to be highly influential, the results in Table 6 show no significant differences among methods.On the contrary, the posterior inclusion probabilities for the "uncertain" covariates X 3 , X 4 and X 7 vary substantially; specifically, the inclusion probabilities from the fixed-δ CR/DR-PEP priors, the hyperδ/n DR-PEP prior and the g-prior are considerably lower than the inclusion probabilities resulting from the rest of the methods.In terms of the shrinkage factors g/(g + 1) and δ/(δ + 1), results show that the shrinkage effect is stronger when g or δ is fixed, which the strongest effect on posterior inclusion probabilities, followed by the hyper-g/n and the hyper-g priors.Similarly, posterior inclusion probabilities under the hyper-δ/n PEP priors are lower than those resulting from the hyper-δ PEP priors.In addition, the DR design leads to a more stringent control for inclusion of variables in relation to the CR prior setup.
Figure 9 depicts the convergence and the estimated posterior distribution of the shrinkage parameter δ/(1 + δ) under the four PEP hyper-prior approaches.The posterior histograms are indicative of the behavior of the shrinkage parameter.Comparison between the hyper-δ (Figure 9a) and the hyper-δ/n (Figure 9b) approaches shows that the posterior distribution of the shrinkage parameter under the latter priors is more concentrated to values close to one, thus, resulting to a stronger shrinkage effect.Also, the histograms in Figures 9a and 9b indicate that the posterior distributions of the shrinkage parameter under DR-PEP are more concentrated to one in comparison to the corresponding posteriors under CR-PEP.Note that the shrinkage under the fixed-δ approaches is constant, equal to 0.998, which leads to considerably lower posterior inclusion probabilities as seen in Table 6 and Figure 8.
We conclude this example by examining the out-of-sample predictive accuracy of the various prior setups.We randomly split the data set in half in order to create a training 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0  sample and a test sample and identified the corresponding MAP and the median probability models, under all methods, from the training data set.Then, based on posterior samples from the predictive distribution we calculated the averages of false positive and false negative percentages for the test data set; the results are reported in Table 7. Overall, we cannot say that there is dominant method in terms of predictive accuracy as the predictions are more or less the same across the prior setups.We may note however that the most complex MAP model arises from the hyper-g prior which also results in the highest false negative prediction rates.Also, the unit-information g-prior, the CR-PEP prior with fixed δ, and the DR-PEP priors lead to the most parsimonious median probability model.This model is comparable in terms of predictive performance with the model that further includes X 7 , which is indicated as the median probability model by the rest of the methods.

Discussion
In this paper we presented an objective, automatic and compatible across competing models Bayesian procedure with applications to the variable selection problem in GLMs.Specifically we extended the PEP prior formulation through the use of unnormalized power-likelihoods and defined two new PEP priors, called CR-PEP and DR-PEP, which differentiate with respect to the definition of the prior predictive distribution of the reference model.Under the new definitions, the applicability of the PEP methodology is significantly enhanced.Although that we focused on variable selection for GLMs, the CR/DR-PEP priors proposed here may in principle be used for any general model setting.At the same time the new approaches retain the desired features of the original prior formulation; specifically, i) they resolve the problem of selecting and averaging across minimal imaginary samples, thus, also allowing for large-sample approximations, and ii) they are minimally informative as they scale down the effect of the imaginary data on the posterior distribution.We further studied the assignment of hyper-prior distributions to the power-parameter δ that controls the contribution of the imaginary data.Following the hyper-g and g/n priors proposed in Liang et al. (2008), we effectively introduced the hyper-δ and δ/n analogues.
The empirical results presented in this paper suggest that the proposed PEP priors outperform mixtures of g-priors in terms of introducing larger shrinkage to the inclusion probabilities of non-influential or partially influential predictors, thus, leading to more parsimonious solutions with comparable predictive accuracy.When comparing PEP priors with fixed (δ = n) and random δ the results indicate that the former approach induces more stringent control in the inclusion of predictors.Therefore, fixed PEP priors support simpler models which is a desirable feature when the number of covariates is large.Concerning the choice between the CR and the DR prior setups, we conclude in favour to the use of the latter since it is rather robust with respect to the fixed vs. random specification of δ.
In the near future, we aim to investigate further the theoretical properties of the PEP extensions.So far we have proofs, which are available in an earlier technical report (Perrakis, Fouskakis and Ntzoufras, 2015b), that both extensions result in model selection consistency for the case of the Gaussian linear model.We intend to extend this work within the formal GLM framework, possibly overcoming the problem of analytical intractability through Laplace-based approximations.In addition, we are currently working on extensions of the PEP methodology to high-dimensional problems, that include the small n-large p case, by incorporating shrinkage priors (e.g.ridge and LASSO procedures) into the PEP design.Finally, another promising alternative is to embody the expectation-maximization variable selection approach of Ročková and George (2014) within the PEP prior.
Electronic Appendix for the the paper entitled "Power-Expected-Posterior Priors in Generalized Linear Models"' by D.Fouskakis, I.Ntzoufras and K.Perrakis.
C. Repeat the steps in B until convergence.

Suggested proposals for
Step 6: For y * we recommend the following proposals depending on the likelihood of the model and on the PEP prior that is used: i) For logistic regression a product binomial proposal distribution given by q(y * ) = Pois(λ * i ).
of simulated samples (over 100 replications) that the MAP model coincides with the true model in the logistic case of Simulation Study 1 (row-wise largest value in bold).
Number of simulated samples (over 100 replications) that the MAP model coincides with the true model for the Poisson case in Simulation Study 1 (row-wise largest value in bold).

Figure 8 :
Figure 8: Boxplots of batched estimates of the posterior inclusion probabilities for the seven predictors in the Pima Indians data set based on 40 batches of size 1000.
+ (1 − π 0 ) 1/ψ (1 − π γ(i) ) 1/δ, where π 0 = (1 + exp(−β 0 )) −1 , π γ(i) = (1 + exp(−X γ(i) β γ )) −1 and N i denotes the number of trials of the observed data.ii)For Poisson regression a product Poisson proposal distribution given by q(y * ) = n * i=1 Figure 5: Posterior inclusion probabilities for Simulation Study 2 under the various priors from 100 repetitions of the null logistic simulation scenario.Figure 6: Posterior inclusion probabilities for Simulation Study 2 under the various priors from 100 repetitions of the sparse logistic simulation scenario where the true model isX 3 + X 7 + X 10 .concerning the inclusion of covariate X 4 .The rest of the methods systematically support more complex models as they provide elevated support for the inclusion of variables X 2 and X 4 .Figure 7: Posterior inclusion probabilities for Simulation Study 2 under the various priors from 100 repetitions of the dense logistic simulation scenario where the true model is

Table 7 :
Percentages of false negative and false positive detections for the Pima Indian data set under the MAP model and median probability model (MPM) for the various priors.