Prior Distributions for Objective Bayesian Analysis

We provide a review of prior distributions for objective Bayesian analysis. We start by examining some foundational issues and then organize our exposition into priors for: i) estimation or prediction; ii) model selection; iii) highdimensional models. With regard to i), we present some basic notions, and then move to more recent contributions on discrete parameter space, hierarchical models, nonparametric models, and penalizing complexity priors. Point ii) is the focus of this paper: it discusses principles for objective Bayesian model comparison, and singles out some major concepts for building priors, which are subsequently illustrated in some detail for the classic problem of variable selection in normal linear models. We also present some recent contributions in the area of objective priors on model space. With regard to point iii) we only provide a short summary of some default priors for high-dimensional models, a rapidly growing area of research.


Objective Bayes methods
In many situations a researcher is not able to express his/her prior opinion into a prior distribution. This may happen, for example, in complex applications, where the parameter space has large dimension and a genuine elicitation of the prior dependence structure among the parameters can be out of reach. In other cases, a very limited knowledge of the problem at hand is available, and one would like to encapsulate prior ignorance into a probability distribution. In both cases, it would be helpful to use a noninformative prior in order to make Bayes' theorem work, without introducing subjective inputs into the analysis. This has been, in the last decades, like a search of the "philosopher's stone" for the Bayesian community. However, using Savage's words, as reported in Kass and Wasserman (1995), . . . it has proved impossible to give a precise definition of the tempting expression "know nothing." The focus subsequently moved to the search of priors with a minimal impact on the corresponding posterior analysis, an important motivation for scientific communication. These priors has been named in many different, sometimes misleading, ways, from vague to objective, from default to noninformative or reference. Each of these terms describes a different aspect of the same problem, and Objective Bayes (OB, hereafter) has emerged as a broad term which tries to include all these strands. It is therefore not surprising that Berger (2006) warns his readers upfront that "there is no unanimity as to the definition of OB analysis, not even on its goals". We believe that after more than ten years this conclusion is still fair.
If we disregard goals, and rather focus on implementation issues, a commonly held view is that an OB method should only use the information contained in the statistical model, and no other external information (Bayarri and García-Donato, 2007); see, however, Leisen et al. (2017) for a radically different view. The above view of "objectivity" presupposes that a model has a different theoretical status relative to the prior: it is the latter which encapsulates the subjective uncertainty of the researcher, while the model is less debatable, possibly because it can usually be tested through data. Another justification is offered by the subjective-predictive approach to inference, as explicated in de Finetti's theory; see Bernardo and Smith (1994, Ch. 4) for an accessible introduction. At first sight this might look surprising, because in the celebrated representation theorem for exchangeable random variables both the model and the prior originate from a unique (subjective) predictive distribution, so that they seem to stand on an equal footing. Dawid (1982) however, in an insightful paper, clarifies how a philosophical distinction between model and prior can be drawn, even within the subjective paradigm, with the former representing a common "intersubjective" component, and the latter being specific to each individual. As an illustration, consider a sequence of 0/1 random variables. While each subject may have a distinct predictive opinion on sequences of such random variables, the very fact that each predictive distribution satisfies the condition of exchangeability implies that all subjects will share the same statistical model (product of i.i.d. Bernoulli laws in this case), while their disagreement will be confined to the distribution of the random probability of success, indexing the statistical model. Representation theorems for exchangeable processes beyond the 0/1 case are of course available, with a similar pattern emerging, although some further structural assumptions are needed to nail down a common intersubjective statistical model among different subjects; see again Bernardo and Smith (1994, Ch. 4).
Even if we take for granted a given statistical model, the actual implementation of any OB principle is likely to incorporate, besides the statistical model, some additional context information. This happens for instance in the construction of reference priors (Bernardo, 1979) for a parameter-vector, where the notion of inferential importance of the component parameters is crucial for a correct application of the methodology; see also Section 2. Another notable case is represented by the inferential "goal" of the analysis where the OB prior will be employed. We will argue below that a useful distinction is between priors for estimation (including prediction) and for model selection; again context matters.
In the end, our view of what constitutes an OB analysis is unavoidably pragmatic. First of all, we firmly believe that OB and subjective Bayesian analysis should complement each other, the former being helpful in particular scenarios (prior elicitation is too hard, or time consuming, or for reference analysis in scientific reporting). Subjective analysis is still a great resource, especially in applications where information about context is available and can be meaningfully incorporated. Secondly, the quality of an OB method should be judged both in terms of its theoretical foundations, and on the correspondence it exhibits to actual Bayesian procedures; see Berger and Pericchi (2001).
A communication problem with the OB approach is that the word "objective" is loaded with many interpretations and expectations. This has led Gelman and Hennig (2017) to propose a radically different approach to the subjective versus objective debate in Statistics, which actually transcends the Bayesian approach. They argue that "the words 'objectivity' and 'subjectivity' in statistical discourses are used in a mostly unhelpful way, and [. . . ] propose to replace each of them with broader collections of attributes, with objectivity replaced by transparency, consensus, impartiality, and correspondence to observable reality, and subjectivity replaced by awareness of multiple perspectives and context dependence". The advantage of their reformulation is that the replacement terms do not oppose each other, but rather complement each other, not just from a practical viewpoint, but also from a conceptual one.
We will distinguish between priors for estimation (and prediction) purposes within a given model, and priors for model selection (or comparison), where a collection of models is entertained. This distinction however is currently challenged in the analysis of highdimensional problems characterized by a huge number of parameters and models, where sparsity inducing priors are devised for the dual purpose of selection and estimation. In this review we will mostly focus on priors for model selection, and especially priors on the parameter space of each entertained model. One reason for this choice is that research on objective priors for estimation/prediction has a long tradition and, accordingly, it has received considerable attention over the past years; see in particular the excellent reviews by Kass and Wasserman (1995) and Ghosh (2011). On the other hand, the OB methodology for priors tailored to model selection started more recently, and its development and applications to various models have increased over the last few years, so that they could not be included in previous reviews such as Berger and Pericchi (1996), Berger and Pericchi (2001), and Pericchi (2005).

Prior distributions for estimation and prediction
"Noninformative prior" has been, for many years, the most common name for indicating any kind of prior which was proposed in an attempt to prepare the Bayesian omelette without breaking the Bayesian eggs (Savage, 1954); that is, to obtain probabilistic likelihood-based inferences without relying on informative prior distributions. For the sake of brevity, here we cannot review the long history of the selection of objective priors in Bayesian inference. The interested reader can refer to Kass and Wasserman (1996) and Ghosh (2011). Here we limit ourselves to list the most well-known existing methods and to discuss the most recent advances.
i. Uniform prior. Based on a somehow misinterpreted principle of indifference, one can use a prior for a scalar (continuous) parameter which assigns equal probabilities to intervals having the same length. However a uniform prior is not invariant under re-parametrization and in many real cases there is no natural parametrization for a given model (Jaynes, 2003). In addition, a uniform prior on an unbounded parameter space is improper (i.e. its total mass is not finite). Then, there is no guarantee that the posterior will be proper and a case by case check must be considered.
ii. Invariant prior. The lack of invariance of the uniform prior has led many researchers to look for objective priors which are invariant under a certain class of transformations.
Let (P, Θ) be a statistical model for the observation X, where P is the distribution model (a family of distributions), and Θ is the parameter associated to it. Let Y = s(X) be a transformation, and suppose that the distribution model for Y is still P, and denote with Λ the parameter. Notice that P is unchanged, and therefore we say that the model is invariant to the transformation s(·). If only P is allowed to inform our choice of the prior, then one should require that the prior for θ, π θ , and that for λ, π λ be such that P π θ {θ ∈ A} = P π λ {λ ∈ A}, for all sets A. This is named context invariance in Dawid (2006), and represents a very strong requirement because it means that it is only the structure of P that matters, irrespective of the context in which it is applied.
To exemplify, consider a model whose density is , for all c > 0. We can imagine X being the price of a commodity measured in $, and Y the corresponding price in Japanese yen. The scale invariance requirement for a prior π on σ leads to whence π(σ) = π(c −1 σ)c −1 for all σ. Setting σ = c, and noting that the equality must hold for all c > 0, one concludes that the only measure which satisfies the requirement is π(σ) ∝ σ −1 which is improper, although not uniform. It is important to note that this is the right Haar invariant measure on the group of scale transformations. A complete description of the uses of invariance in Bayesian analysis can be found in Berger (1985), Dawid (2006) and Robert (2007).
iii. Matching prior. The rationale behind this approach is that a noninformative prior should provide inferences which are similar to those obtained from a frequentist perspective, for example in terms of credible versus confidence intervals. In this perspective, a probability matching prior is a prior distribution under which the posterior probabilities of certain regions coincide with their frequentist coverage probabilities, either exactly or approximately; see Datta and Mukerjee (2004) for details.
iv. Maximum entropy prior (Jaynes, 2003). This approach selects the prior which maximizes the entropy over a class of priors satisfying some basic restrictions. In the continuous case, the entropy of a distribution π(θ) is given by and can be considered a measure of un-informativeness of π(·) for θ.
The maximum entropy prior approach is based on the following two steps. First, one chooses a large class Γ of potential prior distributions, characterized by a set of k constraints, usually in the form of quantiles or moments; the generic set of constraints can then be written as for suitable functions g j (·). Next the maximum entropy prior is selected as any element in Γ maximizing Ent(π).

v. Jeffreys and reference prior
In practical applications, however, at least before the advent of Markov Chains Monte Carlo (MCMC) methods, the vast majority of researchers used Jeffreys' prior (Jeffreys, 1961) where I(θ) is the Fisher information matrix, whose generic element I ij (θ) -under very general conditions -and assuming a continuous parameter space, is given by where E θ denotes the expected value over the sampling space for a given value of the parameter θ, and Y is an observable random variable.
Besides being parametrization invariant, Jeffreys' prior enjoys many optimality properties in the absence of nuisance parameters. It maximizes the asymptotic divergence between the prior and the posterior for θ, under several different metrics. It is also a second order matching prior (Datta and Mukerjee, 2004) when θ is a scalar.
Although the Jeffreys' prior is probably still the most popular objective prior method among practitioners, it has some potential drawbacks which is important to discuss. The Jeffreys' prior may be improper and there is no guarantee that the resulting posterior distribution will be proper for all possible data sets: interesting counterexamples may be found in Ye and Berger (1991) and . Jeffreys himself, in his original proposal, developed the method for the case of a scalar parameter. In the multidimensional case, the use of π J (θ) may lead to incoherence and paradoxes (Dawid et al., 1973).
Jeffreys also suggested to separately deal with location parameters. If θ = (φ, λ), where φ is a vector of location parameters, then the Jeffreys' proposal is to use a prior proportional to (det(λ)) 1/2 , keeping φ fixed. This prior is called "non-location Jeffreys' prior" in Kass and Wasserman (1996). Another popular variant of the Jeffreys' method is the so-called "independent Jeffreys priors", which are made of a product of conditional Jeffreys' priors, i.e., by computing the Jeffreys prior one pa-rameter at a time with all other parameters considered to be fixed (Robert, 2014). This prior is not invariant with respect to parametrization.
Another serious drawback of the Jeffreys' method for selecting objective priors is that there is no guarantee of a "satisfactory" behavior when the parameter of interest is a low dimensional function ψ(θ) of the entire parameter vector θ. Here "satisfactory" means that, in repeated sampling, the use of the Jeffreys' prior should produce statistical procedures with good a frequentist performance; for an interesting and well-known counterexample, see for example, Robert (2007), pag.133. This point is important because it suggests a deeper conclusion: a "good" objective prior for a vector θ may have an unsatisfactory performance with regard to a function of the parameter which is of interest. The problem of selecting an objective prior for a specific parameter of interest ψ(θ) in the presence of other nuisance parameters has been one of the main motivations for the development of the so-called reference prior method (Bernardo, 1979;Berger and Bernardo, 1992). The goal of the reference prior approach, introduced by Bernardo (1979), is to find a prior distribution which maximizes -over the sample space -a limiting version of the average divergence between the prior and the corresponding posterior for a specific quantity of interest ψ = ψ(θ). The method has been refined and improved in a series of papers Berger et al., 2012Berger et al., , 2015. The reference prior method has introduced two main innovations in OB thinking: i ) the explicit use of the notion of information contained in a statistical experiment, measured in terms of the Shannon-Lindley relative entropy; ii ) the necessity of declaring in advance an ordering of inferential importance among the parameters of the model. In fact, for a given statistical model, the reference prior for the parameter vector θ may well depend on that ordering (Berger and Bernardo, 1992). This reinforces the point that OB methods are, in general, context-dependent. Berger et al. (2015) deeply discuss this issue, and argue that there are many situations where having a single, overall objective prior would be desirable. They also propose two methods for achieving this goal.
In the scalar case, under general conditions, the reference prior coincides with Jeffreys' prior, at least when the latter can be calculated. Reference priors show, in general, very good frequentist properties in terms of coverage probability of a Bayesian credible interval. Further details on the methods for constructing priors discussed so far may be found in Kass and Wasserman (1996) or Berger (2006).
The remaining part of this section is devoted to some more recent developments.

Discrete parameter space
When the support of some of the parameters is discrete, traditional OB methods, like Jeffreys' or reference priors, cannot be directly used since they are based on the Fisher information matrix which assumes differentiability with respect to the parameters. It is important to stress that here we are not considering the case when the parameter is a model index, as for instance when it identifies a subset of covariates in a variable selection problem: see Section 3 for more details. We rather consider cases where the parameter is discrete due to the structure of the statistical model. Important examples include the number of degrees of freedom ν in a Student-t sampling model, the unknown population size N in a capture-recapture model, and change-point problems. Berger et al. (2012) discuss in detail several methods to tackle the problem. In particular they propose to embed the discrete parameter into a continuous parameter space and then apply the usual reference methodology. However it is not always clear how to practically perform the embedding. Under particular circumstances, one could add a hierarchical level to the model depending on a continuous hyperparameter, say θ, then find a reference prior for θ and use it to indirectly derive the prior for the discrete parameter.
Example. The Hypergeometric model . Write the sampling distribution for the observation R as where M ∈ {0, 1, . . . , N} is the unknown parameter. If we assume that, given p, M ∼ Bin(N, p), it is easy to see that the marginal model is given by The natural objective prior for p would be the Jeffreys prior, that is a Beta(0.5, 0.5); the prior for M would then be given by However, the above situation is not so common and other approaches are discussed in Berger et al. (2012), mainly based on asymptotic arguments.
A radically different approach is discussed in a series of papers by Villa and Walker (2014b, 2015b, 2015a, where the authors propose a general method for producing objective priors starting from the so called "self-information" loss combined with the notion of the Kullback-Leibler divergence between models. A measure of the information loss associated to an event E having probability π(E) is called self-information loss. The most natural one is given by: I(E) = log(1/π(E)) = − log π(E). Then, they state a version of Bayes' theorem in terms of losses. In this framework, the formal derivation of the prior distribution for θ can be expressed as follows. Consider a discrete collection of models indexed by {θ, f(·|θ)}. The worth associated to a particular value of θ is represented by the Kullback-Leibler divergence between the model indexed by θ and its nearest neighbor. That is, where D KL (f j ||f k ) = f j (y) log(f j (y)/f k (y))dy. Then, the above quantity represents the negative of the information loss in keeping the value θ in the parameter space. At θ, the information loss can be also measured in terms of self-information loss. By equating the two expressions, one can derive the objective prior for θ as More specialized topics related to estimation in discrete parameter spaces are: changepoint problems (Girón et al., 2007), exponential families restricted to a lattice (Choirat and Seri, 2012), the degrees of freedom ν of a Student t distribution (Villa and Walker, 2014b) where the new prior is compared with two versions of the Jeffreys' prior proposed in Fonseca et al. (2008), the estimation of the number of trials in binomial and capture-recapture experiments (Villa and Walker, 2014a), and for assessing objective prior probabilities in a model selection scenario (Villa and Walker, 2015b).

Hierarchical Normal Model
The hierarchical normal model is still a very useful and routinely applied model because of its flexibility and modularity. However the formal derivation of objective priors has proven to be highly challenging. The most basic situation, which we now discuss, has been considered by Berger and several co-authors in a series of papers (Berger and Strawderman, 1996;Berger et al., 2005;Sun et al., 2001).
with the ε i 's mutually independent with a N k (0, Σ i ) distribution, with Σ i known; for simplicity assume B i = I k , for all i's. Also, assume that with τ i ∼ N k (0, V ). Here the issue is to find objective priors for (β, V ) with reasonably good properties. This common situation is hardly manageable both from a classical and Empirical Bayes perspectives: even when k = 1 the marginal likelihood may provide estimates of V equal to zero! On the other hand, the usual Jeffreys' prior π(V ) ∝ V −1 would give an improper posterior, and the problem is only hidden, not solved, if one uses a vague proper inverse gamma prior on V with very small values of the shape and the scale parameters. This issue is discussed in detail in Berger and Strawderman (1996). In general, when an improper prior produces an improper posterior, the use of a vague proper prior does not solve the problem and the posterior distribution will pile up at the boundary of the parameter space, with a dramatic dependence on the values of the hyperparameters.
The problem of finding robust objective priors for this model has been tackled from a different perspective. Given that a formal reference prior cannot be derived, the idea is to leverage the notion of admissibility. Proper priors always provide admissible estimators for β; also, improper priors may be seen as the limit of appropriate sequences of proper priors. As a consequence, they are at the 'boundary of admissibility'. So, if a given improper prior results in an admissible estimator, it can be considered a valid candidate prior for an objective analysis. For the above situation, Berger et al. (2005) have proposed the following prior with independent components where the λ's are the eigenvalues of V , and d is the dimension of β. The admissibility of this prior has been recently proved by Berger et al. (2005). The above result, although very important, is not easy to extend outside the Gaussian set,-up, where a useful characterization of admissibility actually exists (Brown, 1971). An important exception can be found in Spitzner (2005). For the broad class of generalized linear models, two new classes of priors are proposed from an Empirical Bayes perspective. These classes of priors 'correct' the Jeffreys' prior, produce a shrinkage effect on the maximum likelihood estimator and achieve a risk reduction.

Nonparameteric models
While this review is focused on objective Bayesian methods for parametric models, it has theoretically some relevance also for Bayesian nonparametric (BNP) methods, because BNP could be more fittingly defined as "massively parametric Bayes" (Müller and Mitra, 2013). In practice however objective BNP methods are far less developed, and one can find a few reasons for this. In principle, one could argue that BNP methods are intrinsically objective in the sense that they use models with very large, if not full, support. In this context, trying to be "objective" also in the choice of the hyperparameters would seem like a daring enterprise. In the BNP literature, the Dirichlet process and its generalizations represents the staple approach to inference. Along this line of research Bush et al. (2010) and Lee et al. (2014) have proposed a minimally informative version of the mixture of Dirichlet process model, in which the size M and the base measure F 0 are selected using the concept of local mass. In a broader perspective, one can interpret the extensions of the Dirichlet process, such us the normalized generalized gamma process (De Blasi et al., 2015), as an impulse towards objectivity, or at least towards the construction of more flexible and robust priors, which allow different tail behaviors for some specific functionals of interest. Another link between objective inference and BNP can be found in the search of those prior processes which attain a minimax (adaptive) posterior concentration rates (Rivoirard and Rousseau, 2012;Hoffmann et al., 2015).

High-dimensional models
As already hinted in Section 1, current research is progressively developing objective methods which produce proper priors that can be used both in estimation and testing scenarios. One reason is the sheer complexity and dimensionality of the problems involved that make the derivation of a formal objective prior too hard or even impossible.
A second motivation is that objective improper priors for estimation may not guarantee proper posterior when the number of parameters exceeds the sample size. Actually the difficulty is more acute because even proper objective priors may lead to posterior distributions which are not satisfactory from several perspectives. To illustrate this point, let us consider the following example. (Berger et al., 2015). Assume a multinomial experiment with many, say m = 1000 cells. In the absence of specific quantities of interest, the Jeffreys' and reference priors are both the proper Dirichlet(1/2, . . . , 1/2) prior. However, this prior is not recommended in the presence of sparsity and small sample size n. For example, with n = 3, assume we observe x 111 = 2, x 976 = 1 and all the other x j = 0. The posterior means would be E(θ i |x) = (x i + 0.5)/(n + 0.5 m) so that E(θ 111 |x) = 2.5/503, E(θ 976 |x) = 1.5/503 and all other parameters have a posterior mean equal to 0.5/503. Then, cells 111 and 976 only have total posterior probability of 0.008 even though all 3 observations are in these cells. Here the problem is that the prior mass, far from being noninformative, overwhelms the role of the data. We discuss in more detail these issues in Section 4.

Further contributions
A recent and promising approach has been developed in Simpson et al. (2017) where the main goal is not to derive formal objective priors for a specific model. Rather the authors aim at identifying those parts of a complex model which require a (hopefully minimal) subjective input to be elicited in a principled way. Suppose one has a base model M 0 , characterized by some parameter value ξ 0 , say f 0 (·|ξ 0 ). Then, a richer and more flexible model can be denoted by f (·|ξ). In order to characterize the complexity of f compared to f 0 , one can build a so called penalizing complexity prior on ξ, which depends on a function of the Kullback-Leibler divergence between the base model and the alternative models indexed by ξ, d (ξ). The authors propose to derive the prior based on a principle of constant rate penalization which automatically implies an exponential prior on d(ξ). Details and discussion about advantages, disadvantages, and its debatable status of objectivity can be found in Simpson et al. (2017).

Some general issues
It is common practice to regard a statistical model as a family of distributions for the observable random variables, and we follow suit. Model selection involves the computation of the posterior distribution on a collection of statistical models; we may then summarize the latter distribution in order to single out a unique representative, which is the typical goal of model selection.
To fix notation for the rest of the paper let y = (y 1 , . . . , y n ) T denote the available observations and suppose we wish to compare the following two models: where θ 0 and θ are unknown, model specific, parameters of size d 0 and d respectively.
If M 0 is nested in M , so that d 0 < d , we will henceforth assume that θ = (θ T 0 , θ T \0 ) T , so that θ 0 is a parameter 'common' between the two models, whereas θ \0 is model specific. The use of a 'common' parameter θ 0 in nested model comparison is often made to justify the employment of the same, potentially improper, prior on θ 0 across models. This usage is becoming standard, but is not always appropriate, in particular when the intrinsic prior methodology is adopted; see e.g. Casella and Moreno (2006). We will return briefly to this issue below. Let π(θ 0 |M 0 ) be the prior under the null model M 0 , and without loss of generality let the prior under model M , have the following hierarchical form: To illustrate various approaches to the construction of priors on parameters, we will use the variable selection problem in normal linear regression models as a running important example. In this case, model M is specified by where Y = (Y 1 , . . . , Y n ) T is the vector of responses, X is a known n × (p + 1) design matrix (p covariates plus the intercept), I n is the n × n identity matrix, β is a (p + 1)vector of regression coefficients, and σ 2 is the error variance, common to all models. Therefore each model M has parameters θ = (β , σ 2 ) of size d = p + 2. With M 0 we denote the null model having the intercept only, with parameters θ 0 = (β 0 , σ 2 ), and with M F the full model with all p covariates under consideration. For model M we write β = (β 0 , β T \0 ) T and X = [X 0 , X \0 ], where X 0 is the n-dimensional unit vector. All matrices X are assumed to be of full rank. Moreover, in the case of variable selection, it is common to substitute the model indicator M by a vector of binary indicators γ = (γ 1 , . . . , γ p ) that identify which covariates are included in the model (George and McCulloch, 1993).

Posterior measures of evidence
A natural tool for comparing model M 0 versus M is the posterior odds (Jeffreys, 1961) defined by where π(M k ) is the prior probability of model M k , k ∈ {0, }, while m k (y) is the "marginal" likelihood (also called Bayesian "evidence") of M k given by m k (y) = The ratio of the marginal likelihoods of the two models is called the Bayes factor (BF) From (7) it appears that the BF is the multiplicative term, or factor, which updates the prior odds π(M 0 )/π(M ) to the posterior odds P O 0 . The terminology is due to Good (1958), and the initial use of the BF can be attributed both to Jeffreys and Turing who introduced it independently around the same time (Kass and Raftery, 1995). Notice that if equal prior model probabilities are assumed (prior indifference between models), the posterior odds reduce to the Bayes factor. The BF does not depend on the prior model probabilities; however it depends on the prior densities π(θ k |M k ), which in general must be proper. Notice that in some cases improper priors are allowed. For instance, Berger et al. (1998) proved a remarkable result which states that, in situations characterized by a group structure leading to invariance considerations, right Haar priors are perfectly legitimate to be used for computing BFs. Additionally, use of improper priors is common in nested scenarios, dating back to Jeffreys (1961); see also Kass and Raftery (1995). Improper priors may be used, although not in a direct way, for computing BFs; see Subsection 3.4 for more details.
Posterior model odds (and BFs) are directly related to posterior model probabilities π(M |y) because for any model M , M 0 ∈ M. If M , M 0 ∈ M are the only two models under consideration and they have the same prior probabilities, then π(M |y) = 1/(1 + B 0 ). The posterior model probability (9) is often interpreted as the probability that M is the "true" data generating model. Notice however that this interpretation is meaningful only under the M-closed view, wherein it is assumed that the true model is included in the set of models under consideration, and provided that the induced Bayesian procedure is consistent (see Section 3.3 for details). In most real life problems, the M-closed view is unrealistic. Nevertheless, measures of Bayesian model comparison support models (in M) that are close in Kullback-Leibler divergence to the true generating mechanism; see for details Walker et al. (2004), Clyde and Iversen (2013), Chib and Kuffner (2016). A disadvantage of using π(M |y), as opposed to posterior odds or BFs, is the "dilution" of the posterior probability over the space of models (George, 1999), which becomes spread out over many similar models. Dilution increases as more models are considered, so that posterior model probabilities, even for the maximum a-posteriori (MAP) model, decrease. For this reason it is advised to report, besides the posterior probability of each model, also its posterior odds or BF against the MAP model.
For the variable selection problem, we may further calculate the posterior inclusion probabilities for each covariate X j given by Posterior inclusion probabilities (George and McCulloch, 1993) represent an accumulated measure of evidence in favor of a covariate being present in a model structure, and have been used as an informal, empirical measure of evidence for many years. Their usefulness was highlighted in the work by Barbieri and Berger (2004) where it was proved that the median probability (MP) model, defined as that model containing only covariates whose posterior inclusion probabilities exceeds the value 0.5, has better predictive properties than the MAP model in specific cases. Posterior inclusion probabilities do not generally suffer from the phenomenon of posterior dilution because they can be written as In the above expression, the numerator and the denominator of O j are sums of 2 p−1 elements making this quantity robust when we decide to increase the number of covariates under evaluation. Similarly, when using any tool of model exploration in large model spaces, posterior inclusion probabilities are more reliably and quickly estimated than individual posterior model probabilities due to the large number of models with small but non-zero probability involved in the denominator of (9).
There is a growing interest in applying posterior measures of evidence in empirical research. For instance the Journal of Mathematical Psychology recently devoted a whole issue to this topic; see the introductory editorial page by Mulder and Wagenmakers (2016). One reason might be the acute dissatisfaction with current frequentist testing methods, also related to lack of reproducibility in scientific investigations; see Johnson (2013) and the recent statement by the American Statistical Association (Wasserstein and Lazar, 2016). Benjamin et al. (2017) is the outcome of a concerted effort by a large group of statisticians and scientists to define more stringent statistical standards of evidence for claiming new discoveries in many fields of science.
We close this subsection by presenting a variety of viewpoints on the issue of Bayesian model comparison from an objective standpoint. First of all it is worth pointing out that the use of the BF is not undisputed even within the OB community. Bernardo and Rueda (2002) consider testing a null model nested into a larger one. They argue that a testing problem should be regarded as a formal decision problem on whether or not to use the null model. Accordingly a loss function should be specified to take into account the amount of information which would be lost if the null model were used. Objectivity comes into the picture through the use of a reference prior on the parameter space. Dawid and Musio (2015) address the problem of the indeterminacy of the marginal likelihood of a model in the presence of an improper prior, and solve it by replacing the marginal log-likelihood with a homogeneous proper scoring rule, which is insensitive to the arbitrary scaling constant of the prior. They also show that, when suitably applied, their proposal will typically enable consistent selection of the true model. Kamary et al. (2014) propose to view the model selection enterprise as a problem in mixture modeling. Specifically the models under investigation are viewed as components of a mixture model, so that the original testing problem is transformed into an estimation problem, and accordingly the posterior probability of a model or an hypothesis is evaluated through the posterior distribution of the weights of a mixture of the models under comparison. Again improper priors can be used, although some care must be exercised. In order to perform OB methods for testing or selection, other authors rely on an unconventional use of the BF. Johnson (2005) proposes a test-based BF (TBF) for two nested models which is defined through a test statistic, rather than individual observations. The main idea is that the distribution of a test statistic does not depend on unknown model parameters under the null, so that some of the subjectivity that is normally associated with the definition of Bayes factors is eliminated. It remains to compute the marginal likelihood under the alternative model: this can be obtained through a prior or using a marginal maximum likelihood estimate. Further aspects are examined in Hu and Johnson (2009), while Held et al. (2015) relate BF's based on g-priors (discussed in Section 3.4) to TBF's. Finally Johnson (2013) introduces the concept of a uniformly most powerful Bayesian test (UMPBT) for testing a null model nested in a larger alternative one. A UMPBT is such that the prior under the alternative hypothesis is determined so as to maximize the probability that a Bayes factor against the null exceeds a specified threshold for each possible value of the true parameter belonging to the alternative set.

Principles for objective model comparison
Criteria for objective Bayesian model choice Bayarri et al. (2012) developed criteria (desiderata) to be satisfied by objective prior distributions for Bayesian model choice. A number of these criteria are applicable only in nested model comparisons. Notice that this represents a distinctive innovation relative to previous attempts in the literature which typically proposed, based on intuition or otherwise, reasonable priors which were subsequently evaluated in terms of their properties. Here the paradigm is turned upside down: first criteria meaningful for priors tailored to objective model selection are set out, and then priors satisfying them are derived. These criteria are grouped into four classes: basic, consistency, predictive matching and invariance. The basic criterion (C1) states that the prior of each model specific parameter, conditionally on the common ones, π(θ \0 |θ 0 , M ) should be proper, so that Bayes factors do not contain different arbitrary normalizing constants across distinct models.
Model selection consistency (C2) has been widely used as a crucial criterion for objective model selection priors. The criterion implies that if data have been generated by M , then the posterior probability of M should converge to one as the sample size diverges. Although consistency is an important requirement, it might not be enough to differentiate between several priors, all satisfying (C2). Hence the need to better investigate the rate of convergence to the true model. Current research in high-dimensional models, on which we report in Section 4, is precisely devoted to this issue; see in particular Castillo and Misner (2018) and Ročková and George (2018). An additional consistency criterion is information consistency (C3): if there exists a sequence of datasets with the same sample size n such that the likelihood ratio between M and M 0 goes to infinity, then the corresponding sequence of Bayes factors should also go to infinity. Information inconsistency was first discovered by Berger and Pericchi (2001) in the case of conjugate priors for location when the scale is unknown and was further studied by Liang et al. (2008). It represents a severe lack of robustness to highly specific sample information. When some aspects of the model, sample size and, to some extent, also of the observations, affect model selection priors, it is desirable that such features should disappear as n grows, leading to a limiting proper prior. This requirement is named intrinsic consistency criterion (C4).
Predictive matching (C5) is viewed as the most crucial aspect for objective model selection priors. Informally, with a minimal sample size, one should not be able to discriminate between two models, so that the BF should be close to one, for all samples of minimal size. In particular, exact predictive matching occurs if the BF equals one. The minimal sample size n * is defined as the smallest sample size with a finite nonzero marginal density for the combination of models and priors; i.e. 0 < m(y * |M ) < ∞ for all observations y * of size n * , and for all models M under the prior π(θ |M ). Bayarri et al. (2012) elaborate further on the notion of predictive matching, but we omit details for the sake of conciseness.
The last two criteria are in terms of invariance arguments. Measurement invariance (C6) broadly states that answers should not be affected by changes of measurement units. A special type of invariance arises when the families of sampling distributions of models under consideration are such that the model structures are invariant to group transformations. The group invariance criterion (C7) states that if models M and M 0 are invariant under a group of transformations G 0 , then the conditional priors π(θ \0 |θ 0 , M ) should be chosen in such a way that the conditional marginal distribution f (y|θ 0 , M ) is also invariant under G 0 . This means that if models exhibit an invariance structure, this should be preserved after marginalization. Note that G 0 is a group of transformations relevant to the null model M 0 , and therefore the group invariance criterion can be understood as a formalization of the Jeffreys' requirement that the prior for a non-null parameter should be "centered at the simplest model." Another use of invariance is to find priors on common parameters.
Remarkably, Bayarri et al. (2012) accomplished the goal of finding a prior satisfying all their desiderata within the framework of normal linear regression models, which they called robust prior. Under model M , as in (6), the prior takes the form where While the result holds for a general matrix of common predictors X 0 , note that, if X 0 = 1 (i.e. when M 0 contains only the intercept), then V \0 = Z \0 , with Z \0 denoting the column-wise centered version of X \0 .
Regarding the hyperparameters of the above prior distribution, the default values recommended by Bayarri et al. (2012) are a = 1/2, b = 1 and ρ −1 = p + 1. Under the robust prior, the resulting Bayes factors have closed form expressions in terms of the hypergeometric function. Finally, the hyper-g-prior (Liang et al., 2008), discussed in Section 3.4, using the recommended value of 3 for its hyperparameter, is a particular case of the robust prior with a = 1/2, b = 1 and ρ −1 = n + 1; similarly, the hyper-g/nprior (Liang et al., 2008), using the recommended value of 3 for its hyperparameter, may be obtained from the robust prior by setting a = 1/2, b = n and ρ −1 = 2.

Compatibility of priors
When dealing with model choice, a prior on the parameter space under each model should incorporate not only uncertainty but also features which are germane to the comparison setting. One important feature is compatibility of priors across models; see Dawid and Lauritzen (2011) and Consonni and Veronese (2008). Informally this means that priors should be related across models, although in principle they need not be, each being conditional on a given model. Compatibility is usually applied to nested models, with parameter spaces having different dimensions, but it can be extended to more general setups whenever we can identify a benchmark model (often the null model), which is nested into every other model under consideration (encompassing from below), so that compatibility is realized between each model and the benchmark, and thus indirectly between any pair of models. Compatibility was initially proposed to lessen the sensitivity of model comparison to prior specifications, and also to facilitate the task of multiple prior elicitations when several models are entertained. However it also underlies some approaches to the construction of objective priors for Bayesian testing, e.g. the expected posterior prior (Pérez, 1998) (see Section 3.4), wherein the prior under each model is anchored to a common base measure. Another version of prior compatibility across models, named matching, was examined at the beginning of Section 3.3 within a more general theoretical setup.

Validation of Bayesian approaches
The desiderata of Bayarri et al. (2012) refer to the desirable properties of prior distributions and the induced model selection procedures. Nevertheless, when more general methods with Bayesian motivation are used (e.g. the intrinsic and the fractional Bayes factors; see Section 3.4) then an additional important property should be satisfied. According to Principle 1 of Berger and Pericchi (1996), "methods that correspond to use of plausible default (proper) priors are preferable to those that do not correspond to any possible actual Bayesian analysis". Thus an acceptable Bayesian procedure should correspond, at least asymptotically, to a prior which makes sense in the context where it is applied.

Methods with good frequentist properties
A popular alternative to the standard objective Bayes techniques is to use prior distributions that lead to good frequentist performances. This trend is especially notable in high-dimensional settings as we discuss in Section 4. For instance, priors are selected based on the coverage of posterior intervals and false discovery rates (FDR). The former focuses on estimation (Castillo and van der Vaart, 2012;van der Pas et al., 2017), and is further discussed in Section 4, while the control of FDR is tailored to multiple comparisons and prior model probability specification (Tansey et al., 2018); see Section 3.6.

Methods for constructing objective prior distributions Unit information principle
The unit information principle has its origin in the work of Kass and Wasserman (1995) who investigated the use of the Schwarz (1978) criterion (or BIC) as an approximation of the Bayes factor. Informally, a unit information prior (UIP) has an information content equivalent to a sample of size one. For a dataset of size n, the observed Fisher information matrix under model M divided by n can be interpreted as an estimate of the average amount of information in one data point. If θ ∈ R d one way to construct a UIP is as follows where J n (·) is the negative of the Hessian matrix of the log-likelihood. Under this prior the logarithm of the BF is asymptotically equivalent to the Schwarz criterion (BIC).
In this way the unit information prior provides a Bayesian interpretation for the BIC model selection procedure.
There exist specifications of UIPs alternative to (11); for instance one could replace μ θ with the maximum-likelihood estimate. In the same spirit, Ntzoufras (2009) proposed a simplified version by considering independent prior distributions with means and variances equal to the corresponding posterior means and the variances (multiplied by n) obtained using a flat improper prior. The posterior model probabilities under this approach can be used as an initial yardstick for comparisons with other objective Bayes approaches.
The unit information principle can be easily combined with the power-prior approach described shortly below. Under this setting, the prior mean μ θ can be specified by "prior", or "imaginary", data. A sensible choice, for nested model comparisons, is to generate the latter under the null model. Examples of priors based on the unit information principle can be found in Ntzoufras et al. (2003) for binary response models, in Overstall and Forster (2010) for generalized linear mixed models, in Sabanés Bové and Held (2011) for generalized linear models, and in Ntzoufras and Tarantola (2008) for contingency tables.
The unit information principle rests on the notion of sample size which is straightforward for i.i.d. observations, but requires careful considerations in other settings, such as non-i.i.d. observations or in hierarchical models. In Bayarri et al. (2014) the concept of effective sample size is analyzed in detail, and applied to the construction of priors for model selection in a variety of statistical setups.

Training samples
This subsection by itself does not represent a direct method for constructing priors: its goal is rather to motivate the use of intrinsic priors which are described in the subsequent paragraph.
The difficulties in computing the Bayes factor under improper priors, mentioned in Section 3.2, have generated a few proposals that try to address them. One line of research rests on the use of training samples and led to the intrinsic Bayes factor (IBF) proposed by Berger and Pericchi (1996). The IBF employs a subset of the data, of size n * (the training sample) to convert the improper baseline prior to a proper posterior, and then uses the remaining data to calculate the Bayes factor. Next, a summary, e.g. median, arithmetic or geometric mean, of the Bayes factors over the set of all possible sub-samples of size n * can be reported, resulting in the median, arithmetic or geometric intrinsic Bayes factors respectively. Under the IBF approach, minimal training samples are often employed in order to minimize the loss of data utilized for building the prior distribution. These samples are defined such that their size is "as small as possible, subject to yielding proper posteriors" (Berger and Pericchi, 1996). The IBF has the disadvantage that in principle one should consider all possible sub-samples having a minimal sample size, and then take averages. This can be computationally costly. A way to overcome this difficulty is to resort to intrinsic priors which we describe below.
A related method is the fractional Bayes factor (FBF) proposed by O'Hagan (1995), which however does not require training samples. In order to compute the marginal likelihood of a given model using an improper prior, the prior is "trained" using a fraction of the full sample likelihood, that is raising the full likelihood to a power. Next the calculation of the marginal likelihood is implemented using the complementary fraction of the likelihood together with the newly trained prior. The FBF is appealing because of its simplicity, and has been used to address challenging statistical problems involving model comparison. In particular, we mention here two areas: multivariate time series models Villani, 2004, 2006;Villani, 2001), and graphical models (Carvalho and Scott, 2009;Consonni and La Rocca, 2012;Altomare et al., 2013;Leppä-aho et al., 2016;. Recent theoretical work on Bayesian fractional posteriors (Bhattacharya et al., 2016), while not directly motivated by OB methodologies and having a much broader scope, may provide useful results for further investigation into properties of FBF.
Intrinsic priors Intrinsic prior distributions were originally introduced by Berger and Pericchi (1996) in order to provide a proper Bayesian interpretation for intrinsic Bayes factors, according to the principle that a good Bayesian procedure should correspond to the use (at least asymptotically) of a sensible prior; see Section 3.3.
The intrinsic prior can be obtained by equating the limit (as n → ∞) of the arithmetic intrinsic Bayes factor with the corresponding Bayes factor obtained by using the intrinsic prior resulting in two intrinsic equations for every pair of models under comparison. For any two nested models under comparison M and M 0 , the two equations coincide. Although the intrinsic prior distributions always exist for nested model comparisons (Sansó et al., 1996), the intrinsic equations do not collapse into a single equation in non-nested cases. Therefore, the existence of the intrinsic priors is not ensured, and when they exist, we obtain a class of intrinsic prior distributions rather than a single solution (Moreno, 2005). Berger and Pericchi (1996) prove that in nested situations, the arithmetic, but not the geometric, IBF corresponds to a proper prior under the "alternative" when the "null" is simple, or when the baseline prior under the "null" is proper.
Consider the comparison of a "null" model The baseline priors in each model are assumed to be objective, typically improper, and the superscript "N" stands for "noninformative." In this part of the paper only, we depart somewhat from the notation employed in Section 3.1 because both θ 0 and θ are meant to be model specific parameters without assuming that θ 0 is a 'common' parameter. If we assume that the intrinsic priors are limit of proper intrinsic priors then it can be shown (Moreno et al., 1998) that the pair If the prior π N (θ |M ) is improper, so that its expression is unique up to a constant c , an important feature of the intrinsic prior is that it is free from c . Indeed π I (θ |M ) only depends on the constant c 0 of the (improper) prior π N (θ 0 |M 0 ) under the null model M 0 . However, if the latter is nested into every M , meaning that M 0 can be taken as a null, or baseline, model in all pairwise comparisons, c 0 will appear as a multiplicative constant in the intrinsic prior distribution of each model M , and therefore will cancel out in the ensuing Bayes factors, causing no indeterminacy problem in the resulting model comparison procedure based on intrinsic priors. Berger and Pericchi (1996), also in Moreno et al. (1998) it has been proved that in nested model comparisons, if the baseline prior for the reference model M 0 is proper, then π I (θ |M ) is also proper and unique under mild conditions. However, additionally, Moreno et al. (1998) constructed a limiting intrinsic procedure for the case where π N (θ 0 |M 0 ) is not proper. General theory for intrinsic tests and comparisons between nested models or hypotheses can be found in Moreno (1997) while for nonnested comparisons results are available in Berger and Mortera (1999) and in Cano et al. (2004). Cano and Salmerón (2013) generalized the intrinsic prior formulae, for non-nested situations, by iteration.

As in
Objective model comparison and hypothesis testing based on intrinsic priors have been implemented in a variety of problems. Here we can only list a subset of them which have appeared in the more recent years: analysis of variance models with heteroscedastic errors (Bertolino and Racugno, 2000), survival analysis models (Kim and Sun, 2000), tests for the selection of the number of mixture components (Moreno and Liseo, 2003), one-sided hypothesis tests (Moreno, 2005), test for the equality of regression coefficients with heteroscedastic errors , changepoint problems (Girón et al., 2007), one-way random effects models (Garcia-Donato and Sun, 2007), the equality of two correlated proportions (Consonni and La Rocca, 2008), two-way contingency tables , comparisons in multivariate normal regression models (Torres-Ruiz et al., 2011), Hardy Weinberg equilibrium models , and comparison of constrained ANOVA models (Consonni and Paroli, 2017). Finally in Pérez et al. (2017) a sensible prior to substitute the inverted gamma prior for scales is found as an intrinsic prior, and shown to generate by marginalization the horseshoe prior described in Section 4.
Moreover, intrinsic priors have been successfully used for variable selection in normal regression (Casella and Moreno, 2006), multivariate regression (Torres-Ruiz et al., 2011) and probit models (Leon-Novelo et al., 2012). For normal regression models with a finite number of predictors, a variety of priors, including the intrinsic, leads to a consistent variable selection procedure . For models whose dimension grows with the sample size n, Moreno et al. (2010) show that the Bayes factor for nested models under the intrinsic prior is consistent when the size of the model grows as O(n b ) for b < 1, and this holds also for the BIC selection procedure. When b = 1, the Bayes factor under the intrinsic prior is still consistent, except for a small set of alternative larger models which they characterize. Finally consistency of intrinsic posterior distributions both under model selection and model averaging is studied in Womack et al. (2014). Moreno and Girón (2008) provide a comparison between two different types of encompassing in each pairwise model comparison: "from below", so that the null model is nested into each of the remaining ones and acts as the baseline model, and "from above", considering each model as baseline when compared to the full one; only the former however guarantees the rather obvious coherency requirement that B 0 ((y)/B 0k (y) = B k (y). For a concise review of the intrinsic prior methodology we refer the readers to the recent publication of Moreno and Pericchi (2014).
Intrinsic priors, as virtually all commonly used priors for testing, result in pairwise model comparison procedures with unbalanced learning rates under the two rival hypotheses/models. Specifically, if M 0 is nested within M , the BF in favor of M 0 decreases as a power of n if M holds; on the other hand, the BF in favor of M decreases exponentially fast in the sample size when M 0 holds; see Dawid (2011). To alleviate this imbalance, one can resort to non-local priors (Johnson and Rossell, 2010), which we briefly discuss at the end of this subsection. An intrinsic version of non-local priors was implemented for the first time in Consonni et al. (2013) with an application to the comparison of nested models for discrete observations. Alternatively, as one referee pointed out, the imbalance in the learning rate can be also managed by considering "objective" losses that naturally arise in specific problems; see Goutis andRobert (1998), Plummer (2008) and Dawid and Musio (2015) for examples.
Similarly to intrinsic priors, fractional priors have been introduced in the objective Bayes community by Moreno (1997) in order to identify a Bayesian procedure that approximates the results obtained by the FBF. De-Santis and Spezzaferri (1997) derived formulae for the calculation of intrinsic priors of the FBF.

Imaginary observations
One of the main approaches used to construct prior distributions for objective Bayes methods is the concept of imaginary observations. The basic idea (whose origin can be traced back to the work of Good, 1950) is to consider a thought experiment with an appropriate dataset that will be used to specify the normalizing constants involved in the Bayes factors when using improper priors (Spiegelhalter and Smith, 1982). The main pathway here was to adopt the "local" principle, where the imaginary dataset fully supports the null hypothesis in nested model comparisons. In order to make the induced methods minimally informative, the notions of minimal training sample and the UIP principles were used in several occasions. A "non-local" alternative has been introduced by Spitzner (2011) who used the notion of "neutral" imaginary samples which result in posterior model odds that do not support either of the two hypotheses; see also Section 3.2 of Spitzner (2011) for details concerning the connection of this approach with the "non-local" priors for a simple hypothesis test. We further distinguish between fixed and random imaginary observations.

Fixed imaginary data
In this subsection, we will focus on three main approaches. We start with the description of power priors, because of their wider scope. We then continue with g priors, and mixture thereof, which are very popular choices in variable selection problems. Ibrahim and Chen (2000) and Chen et al. (2000) introduced power priors as a resourceful probabilistic procedure for the elicitation of prior information in the form of additional prior data whose importance is weighted by a power parameter. Although the primary use of the power priors was in subjective Bayes approaches, using historical data to build the prior, they can be used (in combination with the notions of unit information priors) also to build meaningful prior distributions for objective Bayesian analysis through the device of "fixed imaginary data" (Spiegelhalter and Smith, 1980). (4), and let π N (θ |M ) be an objective noninformative prior typically used for estimation purposes. Then for a set of imaginary data y * = (y * 1 , y * 2 , . . . , y * n * ) T of size n * , a sensible prior for the model parameters can be obtained by the following expression
For a 0 = 1, the prior (13) is exactly equal to the posterior distribution of θ after observing the imaginary data y * . Usually, when limited prior information is available, we let a 0 = 1/n * inducing contribution of the imaginary data to the overall posterior which is equivalent to one data point; i.e. the prior has a unit information interpretation. Moreover, the imaginary data can be generated from the simplest model (when available) under comparison in order to a priori support more parsimonious models. This specification can serve as a sensible default choice to conduct Bayesian analysis in a minimally informative way. Zellner's (1986) g-prior is one of the standard choices of prior distributions for variable selection in the normal linear regression models. It has been widely used due to its computational convenience, direct interpretation and its connection to the widely used BIC. Its original formulation is given by

g-priors
suppressing dependence on X . Up to the term g the prior variance-covariance matrix of β coincides with that of the maximum likelihood estimator of β . Formula (14) reports the original specification, wherein the improper prior for σ 2 is meant to provide no information about the error variance; however some researchers extend the term g-prior to more informative settings with σ 2 having a normal-inverse gamma distribution. An alternative version of g-prior has been widely used in literature, see for example Liang et al. (2008). In this approach, after centering all covariates, the intercept is treated as a "common" parameter, and the g-prior takes the form with β \0 denoting the sub-vector of β without the common parameter β 0 and Z \0 denoting the column-wise centered version of X \0 .
The g-prior in (14), with μ β = 0, can be interpreted as a power prior with fixed imaginary data y * = 0 of size n and imaginary design matrix X (same as the sample design matrix), power parameter equal to a 0 = 1/g, and a flat baseline prior distribution for β conditionally on σ 2 . Similarly, the conditional distribution β \0 |β 0 , σ 2 , M in (15) can be interpreted as a power prior with all imaginary data set equal to a pre-specified value.
The g-prior has been widely used in practice for several reasons, among which: (a) analytical tractability for posterior inference; (b) connection to readily available variable selectors such as BIC; (c) ease of prior elicitation, because there is only one unspecified prior hyperparameter, namely g. With regard to (c), notice that g has an interpretation similar to the inverse of the power parameter a 0 in the power prior setup. Therefore it determines the amount of prior information relative to the empirical or imaginary data. The information introduced by the prior can be measured by the ratio n/g and can be considered in terms of the effective sample size of the prior. Hence for the default choice g = n, the prior information will be equivalent to adding one observation in our analysis, while for g = 1, the prior information will be equivalent to adding n observations in our analysis. The prior mean of β is usually set equal to zero, also to favor shrinkage of parameter values towards to zero, especially for those components which are not especially relevant. Alternative choices of g have been proposed in the literature; see for example Foster and George (1994) and Fernández et al. (2001). Empirical Bayes approaches have been also proposed for the specification of g; see for example George and Foster (2000), Hansen and Yu (2001) and Liang et al. (2008). Both versions (14) and (15) of g-priors with g = n asymptotically lead to a BIC based variable selection procedure.
Zellner's g-prior leads to a consistent variable selection method; however it suffers from an "information paradox" (Liang et al., 2008). In response to this criticism, Zellner (2008) argued that a Bayesian procedure which places a high posterior model probability (but not equal to one), even on a limiting perfectly fitted model, is a reasonable answer, in line with the philosophy of Box ("all models are wrong"), and with Jeffreys (1961) who claimed that there is always an infinite number of models that can perfectly fit the data. Finally, the posterior model probability eventually converges to one as the sample size increases, which again is a plausible behavior because uncertainty progressively reduces as data information is accumulated.

Mixtures of g-priors
A natural extension of g-priors can be obtained by considering a hyper-prior π(g) in order to "let the data decide" about the value of g. Although Zellner (1986) had already suggested such an extension, no solid scientific arguments existed before the work of Liang et al. (2008), which justified theoretically the use of hyper-priors. Since g is nothing but the power parameter as described in the previous paragraph, any mixture of g-priors can be considered as a power-prior with fixed imaginary data and a hyper-prior placed on a 0 , that controls the amount of prior information which is fed into the posterior.
Within the normal linear regression model formulation, Cui and George (2008) and Liang et al. (2008) introduce in (15) the hyper-g prior which places a beta prior on the shrinkage parameter g/(g + 1) with hyperparameters 1 and a/2 − 1, leading to a mean equal to 2/a. The induced hyperprior for g has density function π(g) = a−2 2 (1 + g) −a/2 , for g > 0. Liang et al. (2008) suggested the value of a = 4 (uniform prior), or a = 3 with prior mean shrinkage equal to 2/3. Another sensible choice is a = 2(1 + 1/n), so that E[g/(g + 1)] = n/(n + 1), which corresponds to the shrinkage of the unitinformation setup of the g-prior (i.e. for g = n). Generally, any choice 2 < a ≤ 4 leads to robust answers (Dellaportas et al., 2012) except for choices extremely close to 2 which eventually activate the Jeffreys-Lindley-Bartlett paradox. A practical disadvantage of the hyper-g variable selection method is that, for non-important covariates, it results in posterior covariate inclusion probabilities which are inflated towards 1/2 in comparison with other methods; for examples and discussion see Dellaportas et al. (2012).
Under the hyper-g prior, the induced variable selection method is consistent in terms of prediction, model selection (C2 ) for any true model except the null, and information consistent (C3 ). Model selection consistency under the null is achieved under the hyperg/n prior, whose density is π(g) = a−2 2n (1 + g/n) −a/2 , for g > 0. Alternatively, one can consider the reparametrization g = ng * and place a hyper-g prior on g * . The reciprocal of the variance multiplier 1/g * = n/g measures the units of information in data points added in the analysis via the prior. Under this parametrization, a Beta(1, a/2−1) prior is assigned to the factor g * /(g * +1) = g/(g +n). In a similar manner, Ley and Steel (2012) use a Beta distribution with hyperparameters b and c on g/(g + n) (they also use a more specific horseshoe type of prior for the same shrinkage factor). Computations in normal linear regression models are relatively straightforward because the marginal likelihoods involved in all model comparisons require the computation of one-dimensional integrals.
Mixtures of g-priors include the Cauchy prior of Zellner and Siow (1980) which can be re-expressed as a mixture of g-priors with an inverse gamma hyper-prior with parameters 1/2 and n/2 (Liang et al., 2008), the approaches by Maruyama and George (2011) and George and Maruyama (2014), and the robust prior of Bayarri et al. (2012). Maruyama and George (2011) propose to use a Beta-prime distribution for g under which g/(1 + g) has a Beta prior with hyperparameters b and c and proposed values c = 1/4 and b = (n − p − 1)/2 − (1 − c) for model M when the number of covariates p is lower than n − 1. Therefore, this prior uses model specific hyperparameters: a feature that was not adopted in the original formulation of Liang et al. (2008).
Extensions to generalized linear models have been introduced by Sabanés Bové and Held (2011), Li (2013 and by Li and Clyde (2016), where calculations of the posterior probabilities can be based on Laplace approximations or on trans-dimensional MCMC methods. Additional articles related to mixtures of g-priors include the work of Malesios et al. (2017) in which hyper-g variable selection is implemented for zero-inflated Poisson epidemic models for sheep-pox incidences, and the work of Sabanés  where they implement hyper-g priors in generalized additive models with penalized splines. Mukhopadhyay and Minerva (2017) propose a mixture of g-priors for variable selection when the number of regressors increases with the sample size. Som et al. (2015) introduce the block hyper-g priors in order to avoid undesirable behaviors appearing when one coefficient is much larger than the rest. Wetzels et al. (2012) apply the hyper-g priors in ANOVA designs while Wang (2017) study the behavior of hyper-g priors on ANOVA models when the number of parameters is growing with the sample size.
Building on the seminal ideas of Jeffreys (1961) and with the goal to generalize the priors developed by Zellner and Siow (1980), Bayarri and García-Donato (2008) propose divergence based (DB) priors for general testing purposes in an objective framework. A DB prior for the comparison of two models is a function of a unitary symmetrized Kullback-Leibler divergence between the two models. This function is chosen so that the resulting prior has a desirable tail behavior. They apply their methodology in challenging scenarios such as irregular models and mixture models, showing that DB priors are well defined and enjoy appealing properties.

Random imaginary data
We proceed with the more recent introduction of prior distributions that treat imaginary data as stochastic components. The idea was independently introduced by Pérez and Berger (2002) and Neal (2001), while the power version of this prior was later introduced by Fouskakis et al. (2015) in order to alleviate the amount of information introduced by the size of the training dataset. Pérez and Berger (2002) have developed priors for Bayesian hypothesis testing, through the utilization of the device of "imaginary training samples" (Good, 1950;Spiegelhalter and Smith, 1980;Iwaki, 1997). The expected posterior prior (EPP) for the parameter under a given model is the expectation of the posterior distribution given imaginary observations y * of size n * , where the expectation is taken with respect to a suitable probability measure m * (y * |M * ) under a reference model M * , while the posterior distribution is computed via Bayes's theorem starting from a baseline, typically improper, prior. Specifically, consider model M with distribution f (·|θ , M ) and baseline prior π N (θ |M ). The EPP is given by

Expected posterior priors
where π N (θ |y * , M ) ∝ f (y * |θ , M )π N (θ |M ) is the posterior distribution of θ under model M conditionally on the imaginary data y * for the given baseline prior π N (θ |M ). Consider now the comparison of several models having the same structure. There will typically exist a model M 0 which is nested into each of the remaining models (the simplest model). In this case setting M * to M 0 is a reasonable choice, under the "local" principle described previously in this section. Accordingly m * (y * |M * ) will be the priorpredictive distribution under M 0 , namely where f (·|θ 0 , M 0 ) is the distribution under model M 0 , with model specific parameter θ 0 and π N (θ 0 |M 0 ) is the baseline prior under M 0 . Notice that m * (y * |M * ) may be improper; this will occur in (17) whenever π N (θ 0 |M 0 ) is improper. If M * = M 0 , then it is straightforward to show that the EPP for the parameter θ reduces to the intrinsic prior for nested model comparison because Additionally, it is immediate to verify that π EP P (θ 0 |M 0 ) = π N (θ 0 |M 0 ), so the EPP and the intrinsic prior for θ 0 also coincide. Pérez and Berger (2002, Eq. 2.1) provide conditions for the existence of the EPP; namely that π N (θ |y * , M ) is proper and that the expectation in (16) is positive and finite.
EPPs offer the same advantages of intrinsic priors, among which: i) impropriety of baseline priors causes no indeterminacy in the resulting Bayes factor; ii) an effective way of establishing compatibility of priors across models, as already mentioned in Section 3.3, because all priors are anchored to the same baseline measure m * (·). On the downside, EPPs rely on features of the imaginary training sample, such as the size n * , or, in variable selection problems, the choice of the imaginary design matrices X * for each competing model. The selection of a minimal training sample size n * has been proposed (Berger and Pericchi, 2004), to make the information content of the prior as small as possible, and this is an appealing idea. But even under this setup, the resulting prior can be influential when the sample size n is not much larger than the total number of parameters under the full model; see Fouskakis et al. (2015) for a discussion of the difficulties associated with the implementation of the EPP with particular reference to variable selection.
Under the variable selection problem in normal linear regression models, Womack et al. (2014) and Fouskakis et al. (2017a) show that the EPP prior, using M 0 as the reference model, minimal training sample of size n * = p + 2 and default baseline priors, can be expressed as a mixture of g-priors where Beta(t|a, b) denotes the density of the Beta distribution with parameters a and b evaluated at t, , X * 0 is an (p + 2) × (p 0 + 1) imaginary design matrix under model M 0 and X * = [X * 0 , X * \0 ] is an (p + 2) × (p + 1) imaginary design matrix under model M . Imaginary design matrices are formed by suitably subsetting the original full imaginary design matrix. Fouskakis et al. (2015) and Fouskakis and Ntzoufras (2016b) introduced the power-expected-posterior (PEP) prior and the power-conditionalexpected-posterior (PCEP) prior respectively, as generalized versions of the EPPs by combining ideas from the power prior method of Ibrahim and Chen (2000) and the unit information prior approach of Kass and Wasserman (1995). The goal is to produce a minimally informative prior, and at the same time to diminish the effect of training samples within the EPP methodology. In practice, the PEP methodology is sufficiently insensitive to the size n * of the training sample, because PEPs are constructed using unit information ideas, so that one may even take n * = n.

Power expected posterior priors
Under the PEP methodology, as a first step, the likelihoods involved in the EPP distribution are raised to the power 1 δ (δ ≥ 1) and then they are density-normalized. The power parameter δ could be set equal to n * , to represent information equal to one data point. For δ = 1 the PEP prior is equivalent to the EPP. Regarding the size n * of the training sample, Fouskakis et al. (2015) set it equal to n; this choice gives rise to significant advantages, for example for the variable selection problem it leads to setting the imaginary design matrix equal to the observed one, and therefore the selection of a training sample of covariates and its effects on the posterior model comparison is avoided, while still holding the prior information content equivalent to one data point.
Here is an outline of the PEP method. Suppose we wish to compare model M 0 and M with M 0 nested in M . Assuming M * = M 0 , the PEP prior is defined by the following equation with When the density normalized power likelihood is not a distribution of a known form, one can resort to a suitable extension of the above method, as illustrated in Fouskakis et al. (2017b).
Under the variable selection problem in normal linear regression models Fouskakis et al. (2017a) show that the PEP prior, using M 0 as the reference model, a training sample size equal to n, the default baseline priors and δ = n, can be expressed as a mixture of g-priors where Σ \0 is defined in analogy with Σ * \0 in (18) based on the sample design matrix.

Empirical Bayes approaches
Empirical Bayes (EB) approaches have been traditionally used to alleviate prior elicitation in multi-parameter setups (e.g. hierarchical models) by settings some prior hyperparameters equal to the corresponding sample estimates. The main criticism against EB is the obvious double use of the data which violates a basic principle of Bayesian theory. This can however be mitigated by combining EB with other ideas described in the previous section, such as the unit information principle, in order to minimize the re-use of the data especially in cases when the sample size is not large. EB methods in model selection usually focus on the specification of the prior for a small number of parameters, typically those causing the sensitivity of the Bayes factor. Estimates of hyperparameters are obtained either by maximizing a suitable integrated likelihood, see for example George and Foster (2000), or by controlling the false discovery rates (Tansey et al., 2018). With regard to the variable selection problem, EB methods have been used to specify (a) the parameter g in the g-prior (George and Foster, 2000;Liang et al., 2008); (b) the prior inclusion probability (George and Foster, 2000;Scott and Berger, 2010;Castillo and Misner, 2018); (c) the shrinkage parameter under the lasso setting (Yuan and Lin, 2005).
Finally we note that empirical versions of EPP and PEP can be produced by using the empirical distribution of the actual data to specify the predictive measure under the reference model, see for example Pérez and Berger (2002).

Non-local priors
Recall that criterion C7 described in Section 3.3 can be understood as a formalization of Jeffreys' criterion for comparing nested models. This says that the prior for the specific parameter of the larger model (the alternative hypothesis) should be "centered at the simplest model". In practice this has been implemented by assigning a continuous prior having mode at the parameter value specified by the null model. These type of priors are called local priors. On the other hand, Johnson and Rossell (2010) proposed the use of non-local priors in order to improve convergence rates in favor of the true null hypothesis. Such priors have densities which vanish on the null subspace. Example of such priors are the moment prior and the inverse moment prior; see for details Johnson and Rossell (2010). In a discussion of Consonni and La Rocca (2011), Rousseau and Robert suggest to cast the testing problem in a decision-theoretic setup and use the well-known duality between prior and loss function (Rubin, 1987) to replace non-local priors with suitable loss functions that take into account the distance from the null.

Comparison of priors for Bayesian variable selection in normal linear models
For the variable selection problem in normal linear regression models, most of the priors discussed in the previous sections can be expressed as mixtures of g-priors. Table 1 provides a summary. Save for the first three, all the remaining priors are mixtures of g-priors. Moreover, with the exception of the EPP and Maruyama and George prior, they can be written in the general form of the robust prior (10) with π R (g) replaced by a specific distribution as detailed in Table 1. The robust prior fulfills all the desiderata of Bayarri et al. (2012). Regarding the rest of the priors in Table 1, we have the following results with respect to the seven criteria.
• All priors lead to consistent model selection procedures (criterion C2 ); for the g-prior see Fernández et al. (2001); for the Cauchy, the hyper-g and hyper-g/n see Liang et al. (2008) (with the hyper-g only to suffer from model selection inconsistency when the true model is the null model); for the Maruyama and George prior see Maruyama and George (2011); for the EPP see  and finally for the PEP prior see Fouskakis et al. (2015) and Fouskakis and Ntzoufras (2016a).
• Liang et al. (2008) showed that the g-prior suffers from information inconsistency; while the Cauchy, the hyper-g and hyper-g/n priors satisfy the criterion C3 of information consistency. Finally, Fouskakis and Ntzoufras (2017) proved that model selection under PEP is free from information inconsistency.

Objective priors on model space
Within the M-closed view of model selection (i.e. the true model is included in M), the default choice to express ignorance or indifference between two or more models under comparison was, for many years, the uniform distribution on the model space M, that is π(M ) = 1/|M| for all M ∈ M, where |M| denotes the cardinality of M. For variable selection problems, letting p denote the potential number of predictors beyond those which must be present in all models, the uniform prior distribution π(M ) = 2 −p is obtained by assuming that each predictor enters the model independently with inclusion probability 1/2. In recent years, this choice has become progressively less popular, because it does not account for structural features, notably sparsity, dimensionality, and collinearity of predictors. In particular Chipman et al. (2001) and George (2010) discuss how to construct dilution priors which are uniform over neighborhoods of models which are regarded to be similar according to some criterion. Scott and Berger (2010) argue that prior model probabilities should take into consideration multiplicity issues inherent in model comparisons. When applied to variable selection problems, this principle can be implemented by assuming that, conditionally on a random probability of inclusion ω, each predictor can enter a model independently, so that π(M |ω) = ω p (1 − ω) n−p .
Next, a hyper-prior is assigned to ω; in particular if ω ∼ Beta(a ω , b ω ), the resulting prior becomes which is commonly known as the beta-binomial prior on model space. The default choice a ω = b ω = 1 results in a uniform distribution for ω. Under this specification, (22) reduces to which induces a uniform prior on model size: π {M ∈ M : p = d} = 1/(p + 1) for d = 0, 1, . . . , p.
The choice of a uniform prior on ω provides more support to individual models having either low or high dimensionality and does not penalize for complexity. Wilson et al. (2010) propose a ω = 1 and b ω = λp, where λ is a positive constant, resulting in a prior on model-dimension having expectation 1/λ, and a behavior similar to a geometric distribution for low values of the dimension. This prior also corresponds to an approximate penalization equal to log(λ + 1) in log-odds scale for each additional covariate added to the model. Castillo et al. (2015) investigate high-dimensional linear regression models under sparsity constraints. Conditionally on the size of the set of predictors, the prior on the regression parameter is a mixture of point masses at zero and continuous distributions. Assuming the prior and the design matrix satisfy some conditions, they show a variety of contraction properties for the posterior distribution; including the correct selection of at least the coefficients that are significantly different from zero. Further results of their approach are reported in Section 4. Womack et al. (2015) take a geometric approach, and argue, using isometry considerations on model space, that the appropriate distribution on model size is a truncated Poisson, while the prior probability of models having the same size is uniform. This provides a consistent model selection procedure. Another usual way to specify Bayesian procedures which account for multiple testing is via the control of false discovery rate (FDR); see for example in Storey (2003).
We close this section with two alternative treatments of the specification of the prior on the model space. The first approach, introduced by Dellaportas et al. (2012), argues that we should jointly specify the prior on the model parameters and the model space; see Robert (1993) for related ideas. The key point is that, by relating the two aspects, sensitivity of posterior model probabilities to the prior variance of the model coefficients can be avoided by suitable specification of prior model probabilities π(M ), M ∈ M. For example in the g-prior setup it is straightforward to see that setting π(M ) ∝ g (p +1)/2 in (14) or π(M ) ∝ g p /2 in (15) will eliminate any dependence of the posterior model probability π(M |y) on the prior variance multiplier g. To illustrate the method, consider the modified g-prior specification (15), conditional on the intercept and error variance. Dellaportas et al. (2012) propose to use prior model probabilities with the structure where p(M ) is some baseline model weight, and should reflect prior features of the model not related to the prior distribution on the model parameters, such as model dimension or complexity, or sparsity preferences. They note that setting p(M ) ∝ 1 will result in posterior model probabilities "which are asymptotically equivalent to those implied by BIC". Alternative choices of p(M ) can be obtained by matching the log-posterior model probabilities to suitable information criteria, although p(M ) should not change according to the sample size. The approach based on the joint specification on model and parameter spaces not only avoids the sensitivity of the posterior model probabilities to the prior uncertainty of model parameters, but also produces Bayesian model averaging estimators which do not suffer from the Jeffreys-Lindleys-Bartlett paradox.
The second approach to the specification of prior model probabilities is proposed by Villa and Walker (2015b) and it is strictly related to the method for obtaining objective prior in models with discrete parameter space, already discussed in Section 2. The basic idea is that each model M has a worth, which only depends on how "close" in KLdivergence M is to its nearest neighbor in the collection of models under consideration (the smaller the divergence, the smaller the worth, because it means that M can be excluded with a small loss). Since the worth depends on no other considerations, the method can claim to fall within the objective methodology. This leads to the following specification where D KL is the KL-divergence, see Section 2. This approach has been illustrated in a variety of simple model comparisons (nested and non-nested) in Villa and Walker (2015b), and in Villa and Walker (2017) for the testing setup described in Lindley (1957). Villa and Lee (2015) have extended the method for variable selection in normal linear regression models. In such problems, (24) is proportional to one, for all models, which induces the uniform prior on model space. To resolve this issue, Villa and Lee (2015) introduced an additional loss function based on the dimensionality/complexity of the model.
Finally, Spitzner (2011) introduced the idea of "neutral" data which support neither of the two hypothesis/models under consideration. This idea can be naturally accommodated for the construction of "objective" priors on the model space.

High-dimensional models
Current applications of statistical methods often deal with high-dimensional models, wherein the derivation of an objective prior, defined according to a well established formal rule, like Jeffreys' or reference prior, is virtually impossible; see also Section 2. In regression settings, common default priors such as the g-prior and its extensions to random g, are not defined when the number of predictors p is larger than the sample size n, save for the generalized g-prior of Maruyama and George (2011). The "robust" prior of Bayarri et al. (2012) suffers from the same problem because it requires the existence of the maximum likelihood estimator for each model under consideration. Similarly the intrinsic, or more generally the Expected Posterior prior (EPP), methodology would require a training sample size n * bigger than n. This means that the training design matrix X * should be taller than the observed X matrix, with extra rows that would need to be fixed exogenously. This raises inevitable concerns for the OB approach, although they could be mitigated through a suitable discounting factor within the PEP methodology. More generally, high-dimensional problems pose new challenges that need be addressed through novel methodologies.
1. Sparsity. Consider the sparse normal means problem, that is where n is typically very large. Let θ 0 = (θ 01 , . . . , θ 0n ) be the true mean value. Under sparsity, in the near-black sense, the number p n of θ 0i 's different from zero (signals) is allowed to grow with n but at a slower rate, so that p n = o(n). The goal is estimating θ = (θ 1 , . . . , θ n ), distinguishing signal from noise.
2. Shrinkage. Bayesian methods are ideally suited for creating suitable shrinkage in many dimensions, as has been recognized for many decades, starting from the seminal work of Stein (1956). Indeed sparsity and shrinkage, though distinct, are closely related as we look for priors that do shrink strongly on noise components. On the other hand, strong signals should be clearly picked-up, and model estimates of the corresponding parameters should undergo negligible shrinkage. Priors which achieve this goal are often named, in this context, robust.
3. p >> n situations. High-dimensionality often means that the number of parameters p exceeds the sample size n, a situation which is routinely found today in many applications. Improper priors cannot deal with these cases, and accordingly suitable proper priors need be developed.
A large body of research has been deployed to develop default proper priors for high dimensional models. Typically the performance of these priors is assessed in relation to: 1) computational efficiency; 2) frequentist assessment, especially in terms of the speed of concentration of the posterior parameter distribution, or functionals thereof, to the true value, and in terms of coverage of credible sets; 3) ease of interpretation, so that tuning hyperparameters (when present) can be readily set in specific applications.
The number of papers dealing with the above topics has literally mushroomed in the last decade, and we cannot even try to provide a reasonably exhaustive review of the various contributions. Accordingly, we shall merely present a highly selective account in order to provide the interested reader with some useful signposts. A general point to make is that, in these situations, the typical use of proper priors makes the distinction between objective priors for estimation and testing redundant. Most of the proposals can be collected under two broad categories: 1) spike-and-slab priors and 2) global-local priors.
The spike-and-slab prior (George and McCulloch, 1993) for θ i is a two-point mixture of distributions, one being absolutely continuous and heavy-tailed (the slab), and the other a Dirac measure at zero. More formally, conditionally on a latent binary random vector γ = (γ 1 , . . . , γ n ) T , one has where δ 0 (·) is the Dirac delta function while ψ(·|λ 1 ) is the slab distribution possibly depending on a fixed hyperparameter λ 1 . The latent vector γ in turn is assigned a distribution π(γ|ν). Castillo and van der Vaart (2012) show that, under the prior (26) and a suitably chosen value for ν, or a suitable beta-prior π(ν), the whole posterior distribution concentrates on the true value at the minimax rate. The same result holds for several posterior estimators, under a convex loss, targeted to both location and spread parameters. Castillo et al. (2015) provide contraction results in a Gaussian regression setup under a family of joint distributions for the size of the active covariates (signals) and the regression parameter which includes the spike-and-slab prior. A remarkable result is that the product of Laplace priors for the individual regression coefficients, whose mode is the popular lasso estimator, produces a posterior distribution which fails to contract at the same speed as the mode.
Several elaborations of (26) have been considered, with special emphasis on continuous relaxations, that is replacing δ 0 (·) with a peaked continuous density (George and McCulloch, 1993;Ishwaran and Rao, 2005). The motivation is twofold: to enhance flexibility and to make the ensuing Bayesian analysis amenable to fast deterministic computation (Ročková and George, 2014). In particular, Ročková and George (2018) introduce the spike-and-slab lasso (SS-LASSO) prior where both components of the mixture are Laplace distributions, so that the resulting prior can be viewed as a compromise between the theoretical benchmark (26) and the (computationally convenient) single Laplace prior. A thorough theoretical evaluation of the SS-LASSO priors is undertaken in Ročková (2018), where connections with current penalized likelihood methods are established in order to enhance interpretation, and risk results are proved for estimators not only of functionals of the posterior distribution of θ i (especially the mode) but, importantly, for the whole posterior distribution. Castillo and Misner (2018) provide convergence results of the posterior distribution associated to a variety of spike and slab prior distributions when the key sparsity hyperparameter is calibrated via marginal maximum likelihood empirical Bayes.
An alternative approach, which is easy to implement using generic sampling tools, and is typically fully automatic, is represented by continuous scale mixture priors. Among the many existing proposals, and limiting ourselves to the general set-up exhibited in (25), we mention the normal-exponential-gamma prior (Griffin and Brown, 2010), and the very popular horseshoe prior (Carvalho et al., 2010;Polson and Scott, 2012b) which is hierarchically specified as . . . , n; (27) that is the θ i 's are conditionally independent given the local parameters λ i 's, which in turn are conditionally i.i.d. given the global parameter τ . An interesting representation of the above priors is obtained by considering κ i = (1 + τ 2 λ 2 i ) −1 , i = 1, . . . , n. Then the marginal posterior mean of θ i , conditionally on τ , is Thus κ i ∈ [0, 1] operates as a local shrinkage factor for the i-th component of the model. On the other hand τ acts as a global parameter. The horseshoe prior is thus a global-local shrinkage prior because it is able to combine robustness control on the tails as well as sparsity. The resulting conditional prior for κ i has a U-shape, depending on τ , whence the name horseshoe given to the entire prior structure.
The horseshoe prior approach has to be completed with the choice of a prior on τ . This is the most sensitive issue and no clear default choices exist, although the common proposal is to adopt a half-Cauchy prior (Polson and Scott, 2012a). This issue is deeply discussed in Vehtari (2017a, 2017b), who propose an intuitive way of formulating the prior for τ based on prior assumptions on the effective number of nonzero parameters. Further elaborations on horseshoe priors are provided in Polson and Scott (2012a), Polson and Scott (2012b) and Bhadra et al. (2016).
The frequentist properties of the horseshoe priors have been analyzed in a series of papers; see for instance Datta and Ghosh (2013) who consider the asymptotic properties of the multiple testing rule induced by the estimator (28), and van der Pas et al.
(2017) who consider the frequentist coverage of posterior intervals of the location parameters, and discuss the irreconcilability between adaptivity and honesty when the level of sparsity is unknown.
Compared with (27) with σ = 1, the Dirichlet-Laplace prior models independently the global parameter τ and the local parameters ϕ i 's.
An alternative way to modeling, with proper priors, the scale parameters in a hierarchical setting, is given in Pérez et al. (2017). Instead of assuming the usual conjugate inverse Gamma or the half-Cauchy (Gelman, 2006), the authors suggest to consider a Gamma mixture of Gamma densities, which is named Scaled Beta2 (SB2). It was previously derived in Girón et al. (2006) as an intrinsic prior for the scale parameter in a linear model. The two parameters of the mixing Gamma determine the behavior of the marginal density around zero and for large values, respectively, and make the SB2 family quite appealing for its flexibility. Additionally, the Cauchy-Scaled Beta2 is shown to represent an explicit horseshoe distribution.
Finally non-local priors can also be represented as mixtures; in this case the mixing parameter is a latent truncation. Rossell and Telesca (2017) thoroughly investigate their behavior in high-dimensional settings showing their good performance both in terms of model selection and estimation.

Discussion
Objective Bayesian analysis is here to stay, and so is the search for priors that allow its efficient implementation in a great variety of situations. Although we presented many such priors, we also tried to highlight principles and methods behind them. Paraphrasing a Reviewer of our paper: there is a galaxy of stars (priors) out there, but fortunately we also have categories to study, evaluate and organize them into meaningful systems.
Below we report on a few of outstanding issues which are worth of further consideration.
• OB priors for estimation and model selection. This distinction was posited at the very beginning of our review, because the conceptual framework underlying the construction of priors for estimation is different from that leading to priors for model selection, with the latter largely influenced by the approach initiated by Jeffreys (1961); see for instance the desiderata illustrated in Section 3.3.
Consider however a setting where prediction under model uncertainty is the goal, so that model averaging (Hoeting et al., 1996) techniques are employed. In this case one is potentially confronted with two separate priors on the parameter space of the same model: one to determine the model posterior probability, and another one to compute predictions (conditionally on a given model). This dichotomy is however hardly discussed explicitly. Typically the prior employed for model selection is also used to carry out estimation/prediction; see for instance Pérez and Berger (2002, Sect. 6) with regard to expected posterior priors, but the motivation is mostly pragmatic and confined to a specific data analysis. Interestingly, in the area of Bayesian experimental design, it is not uncommon to entertain two distinct priors for the same parameter of a given model, because one distinguishes between a prior for design and a prior for inference; see Han and Chaloner (2004) and earlier references therein.
• Priors for high-dimensional models. Our account of this body of research, in this article, is clearly too limited, especially with regard to important technical results on: i) sparsity conditions; ii) assumptions on the priors and features of the underlying model; iii) posterior contraction rates for several notions of recovery of the true model; iv) new computational tools, also alternative to traditional MCMC algorithms. We believe that a review paper devoted to default priors in high-dimensional settings will be a useful gift to the Bayesian community.
In this connection, a point we would like to raise concerns methods for evaluating the performance of priors in high-dimensions. Currently this is measured in terms of rates of contraction of the posterior distribution (or functionals thereof) to the underlying true values. Among the desiderata that we laid out in Section 3.3, it seems that only properness of the prior and model selection consistency are taken into account. Actually consistency becomes a rather weak property to evaluate priors, while rates with which such consistency is achieved become more crucial. However, as one Reviewer pointed out, insistence only on frequentist properties is open to criticism, as one would like to embrace a "more Bayesian" perspective, possibly along the lines of newly formulated desiderata.
• Computational aspects. Computation aspects are becoming increasingly important for evaluating any statistical methodology. This is of course the case in high-dimensional settings where scalability of a procedure is an obvious concern. From this perspective, Section 4 does not even come close to providing a reasonably complete account of current technology and trends, although some of the papers we reference contain substantial material on computation; see e.g. Ročková and George (2014) on leveraging the EM algorithm for variable selection. As already hinted above we expect that a full treatment of this topic is better left to a specific review paper.
On a related point, we note that complex models pose challenges even with regard to traditional objective priors, such as the reference, and often the Jeffreys, priors, which are hard to obtain in a closed form. On the other hand, it is also true that often the exact knowledge of the functional form of the prior is not strictly necessary. Nowadays, the vast majority of applications of Bayesian methods rely on the use of Monte Carlo, or other simulation methods, where the evaluation of the prior, rather than its form, is important. Also, it is often the case that, from a mathematical perspective, the hard step in computing the prior is the evaluation of an expected value. In this context, it is reasonable to include the algorithm for evaluating the prior within the general simulation method. This approach has been discussed in Lafferty and Wasserman (2013), and only sporadically mentioned in other papers (Berger and Sun, 2008;Berger et al., 2009).
• Priors for model selection based on the desiderata of Bayarri et al. (2012). The general methodology was illustrated in Section 3.3, and in our opinion it represents a major conceptual innovation which deserves to be carefully considered.
We still see some outstanding difficulties: i) Non-nested models. The method is currently predicated on the comparison between two nested models. This of course is not a major drawback if one can find a null model which is nested into every other model under consideration, as we mentioned in Section 3.3. However, when this is not the case, the problem remains open, unless some other forms of encompassing are implemented. Notice that the comparison of non-nested models is also problematic for other more specific approaches, such as the intrinsic, or the EP, prior. ii) Scope. The implementation of the methodology within normal linear regression models represents a major accomplishment; yet it remains to be seen whether the general idea can be extended to other substantive statistical settings.