An Assessment of the E ff ects of Prior Distributions on the Bayesian Predictive Inference

Predictive inference is one of the oldest methods of statistical analysis and it is based on observable data. Prior information plays an important role in the Bayesian methodology. Researchers in this field are often subjective to exercise noninformative prior. This study tests the effects of a range of prior distributions on the Bayesian predictive inference for different modelling situations such as linear regression models under normal and Student-t errors. Findings reveal that different choice of priors not only provide different prediction distributions of the future response(s) but also change the location and/or scale or shape parameters of the prediction distributions.


Introduction
The posterior distribution for parameters of a set of observations is typically the major objective of the Bayesian statistical analysis.A posterior distribution also implies a marginal density known as prediction distribution of any future observations from a model.Predictive inference is one of the oldest and useful methods of statistical inference.A prediction density of future response(s) can be derived from various statistical models.In general, predictive inference uses the observed responses from a performed experiment to make inferences about the behavior of the unobserved future response(s) of a future experiment (Aitchison & Dunsmore, 1975).A detail of predictive inference methods and applications of prediction density can be found elsewhere (Geisser, 1993;Rahman, 2008, Rahman et al., 2010;Rahman & Upadhyay, 2015;Rahman & Harding, 2016).
Various methods can lead to prediction density and different researchers considered different approaches in prediction problems.General prediction problems have been discussed by Jeffreys (1961), Aitchison and Sculthorpe (1965), and Faulkenberry (1973).Goldberger (1962), Wilson (1967) and Hahn (1972) studied prediction problems by the classical method.Fraser and Haq (1970) obtained the prediction distribution by using the structural density function, and later on Haq (1982) develops the structural relations approach.Most of these authors have contribution in prediction problems under the independent normal errors assumption.Some researchers have been discussed the prediction problems and its applications in many areas from the Bayesian viewpoint (Zellner & Chetty, 1965;Aitchison & Dunsmore, 1975;Zellner, 1976;Sutradhar & Ali, 1989;Geisser, 1993;Rahman, 2008).Unlike others Rahman (2009Rahman ( , 2011) ) obtained prediction distribution using the Bayesian method for a range of statistical models under the t errors assumption.In many practical situations when the underlying distributions have heavier tails, models with Student t-errors are most appropriate.
A prior distribution of unknown parameters is an essential component in the Bayesian predictive inference.The prior distribution describes researcher's subjective belief about unknown parameter(s) before observing given data.When the distributions of data and the prior information are significantly different from each other, an inferential conflict exists between the sources of information.For example, the posterior distribution may be strongly affected by the prior information for models under the normal distribution assumption (O'Hagan & Forster, 2004).Some conflicts may cause by the data (for instance, outliers) or by the prior knowledge.In such situations, if the models are with a light-tailed distribution, the conflict may strongly influence the posterior distribution and potentially lead to inappropriate statistical inferences (Andrade & O'Hagan, 2006).A number of researchers suggest using models under heavy-tailed distributions such as Student-t distributions (e.g., see Dawid, 1973;O'Hagan, 1979;Fernandez et al., 1995;Le & O'Hagan, 1998;Haro-Lopez & Smith 1999).
In the Bayesian predictive inference an important issue is how to define the prior distribution and examine its effect on statistical inference.In a model when both the prior and posterior distributions come from the same family of distributions, the prior distribution is considered as conjugate (Seber & Lee, 2003).That means, a conjugate prior is one for which application of Bayes' technique results in a posterior having the same family of distributions as prior.On the other hand, if there is no prior information available for the unknown parameters researchers want a prior distribution with minimal influence on the inference.Such a prior is called a noninformative prior or uniform prior (Bernardo & Smith, 1994).Kass and Wasserman (1996) stated two different interpretations of noninformative priors: i) noninformative priors are formal representations of ignorance; ii) there is no objective, unique prior that represents ignorance, instead noninformative priors are chosen by agreement.In the second interpretation, noninformative priors are the default to use when there is insufficient information to otherwise define the prior.However, nowadays the first interpretation is somewhat useless, so focus on considering different priors to see if any is preferable in some sense.Box and Tiao (1973) define a noninformative prior as a prior which provides little information relative to the experiment.Pericchi and Walley (1991) have a quite different view.They say that no single probability distribution can model ignorance satisfactory, therefore large classes of distributions are needed.They use the first interpretation of Kass and Wasserman (1996) but they realise that a single distribution is not enough.Therefore noninformative priors are also classified as vague prior, flat prior, and reference prior etc (e.g.see Berger & Bernardo, 1989).
The common form of noninformative prior is also known as the Jeffreys' prior (Jeffreys, 1946) which is based on the Fisher information criterion.For a model with parameter space Θ ⊆ R, the Fisher information is where f (x | θ) is the sampling distribution and the expectation is taken over f (x | θ).Under regularity conditions, ) .In such a setting, the Jeffreys prior for θ is defined by to be proportional to the square root of the Fisher Information at θ.In general the Jeffreys prior may be improper.If θ has the Jeffreys prior and h is a monotone differentiable function of θ, the prior induced on h(θ) by the Jeffreys prior on θ is So the Jeffreys priors are invariant under reparameterization.If the prior favors values of θ for which I(θ) is large, the effect is to minimise the effect of the prior relative to the information in the data.In this sense the Jeffreys prior as attempting to be noninformative about θ.
This study examines the effects of a range of prior distributions on predictive inference for different modelling situations under the normal and Student-t errors.It is obvious that different choice of priors not only provides different prediction distributions but also changes the location and/or scale or shape parameters of the prediction distributions.Ultimately these also affect on measures of Bayesian credible intervals for the predictive inference.
The remainder of paper is organised as follow.In Section 2, the prediction densities of a future response have been derived using the Bayesian approach under different situations of the conjugate prior distributions.In Section 3, the prediction distribution has been obtained under the non-conjugate prior information.In Section 4, the prediction densities of a single and a set of unobserved responses have been produced for the simple linear regression model with the Normal errors and the S tudent − t errors.Finally, a summary of the significant results and the concluding remarks are provided in Section 5.

Prediction Under Conjugate Prior
The probability function of prior information is called prior distribution which illustrates the subjective belief of a researcher about unknown parameter(s) before observing data.In the Bayesian inference when both the prior and posterior distributions come from the same family of distributions, the prior distribution is regarded as conjugate prior.Let f (x|θ) be the probability model, f (x|θ) be likelihood function for the observed data x given the unknown parameter(s) θ, g(θ) be the prior distribution model for θ and f (θ|x) be the posterior distribution model for θ given x.If g(θ) and f (θ|x) both belong to the same distributions family then they are called conjugate distributions, and g(θ) is the conjugate prior for f (x|θ).The term conjugate priors and its application appeared in a study by Raiffa and Schlaifer (1961).Before them the same theory was developed by Barnard (1954) as considered the distributions involved as being closed under sampling.
If there exist, conjugate priors are often used for its computational convenience (Seber & Lee, 2003), and they provide the most transparent view of the relationship between prior and posterior.This section analyses the effects of a conjugate prior on predictive inference under various modelling situations.

Unequal but Known Variances
Let Y j |µ ∼ N(µ, σ 2 1 ) for j = 1, 2, ...., n; µ ∼ N(µ 0 , σ 2 0 ); and σ 1 , µ 0 and σ 0 are known.That is is the likelihood function for a set of observations y = (y 1 , y 2 , .., y n ) ′ and is a prior density function of the parameter µ.The posterior density function of µ given y can be obtained from the relationship p(µ|y) ∝ p(y|µ, σ 1 )g(µ|µ 0 , σ 0 ) . (1) The enclosed exponential term 1 ∑ n j=1 (y j − µ) 2 = ξ(say) can be expressed as more convenient form where the normalizing constant is given by, Ψ(n, σ 0 , σ 1 ) = . Let m ′ and s ′2 are the mean and variance of the posterior distribution of µ.Then The posterior mean m ′ can be expressed as , the posterior mean equals the weighted average of the prior mean and ȳ where the weights are the proportions of the posterior precision.In addition, the posterior variance s ′2 has a relationship 1 i.e., the posterior precision equals the precision of ȳ plus the prior precision.
Consider a future response y f from the same normal model with mean µ and known variance σ 2 1 i.e., Using the Bayesian method the prediction distribution of y f for a given y can be derived as where and 3) and ( 4).Now, ξ f has a convenient expression ) which does not contain µ, and Q = Using equation ( 6) in ( 5) and then integrating over µ we get, , the prediction density of y f can be written as where is the normalizing constant.
Hence the prediction density of a future response y f follows a normal distribution with mean equals to the posterior mean and with variance , that is greater than the posterior variance

Equal but Unknown Variances
This subsection consider the common unknown variance σ 2 for both the model and prior distribution.
Consider a normal prior for the location parameter µ with known mean µ 0 and common unknown variance from the model such that µ|σ ∼ N(µ 0 , σ 2 ), and g(µ, σ|µ 0 ) ∝ σ −1 e − 1 2σ 2 (µ−µ 0 ) 2 is a joint prior density function of the parameters µ and σ.Thus the joint posterior density function of µ and σ can be expressed as The normalizing constant of (9) can be obtained by integrating the joint density function with respect to σ and µ.If the normalizing constant is denoted by Ψ(c) then we have The integration of equation ( 9) with respect to σ yields the marginal posterior density function of µ as where the normalizing constant is given by Then the prediction distribution of y f for given y can be obtained from the following equation Using appropriate integration over σ and µ we get the density function of future response as where In this case, although the joint posterior distribution follows a normal distribution, the marginal posterior distribution of location parameter µ and the prediction distribution of a future response y f both have a Student-t distribution with n − 1 d f and the same location nȳ+µ 0 n+1 , but different scales such as (n+1) } −1 respectively.

Prediction under Nonconjugate Prior
This section deals with the predictive inference of normal model under noninformative prior information (Jeffreys, 1961).
Let Y j |µ, σ ∼ N(µ, σ 2 ), for j = 1, 2, ...., n.A nonconjugate prior distribution of parameters µ and σ is p(µ, σ) ∝ 1 σ .Hence the joint posterior density function of parameters is as follows, and the normalizing constant say Ψ −1 (.) can be expressed as a relation, Now after taking integration over σ we get Again by taking integration over µ we have the following result, ) and p(y f |µ, σ), then the prediction distribution of y f for given y can be obtained from the following equation Using appropriate integration with respect to σ and µ we get the prediction density of future response y f as where ȳ = 1 n ∑ n j=1 y j , and is the normalizing constant. Hence is a Student-t variate with n − 1 d f .That is, under the noninformative prior the prediction density of a future response y f has a Student-t distribution with n − 1 d f .

Simple Linear Regression Model
The simple linear regression model involves a single explanatory variable (x) to a single response variable (y) linearly in the parameters.The model can be represented as y j = β 0 + β 1 x j + e j for j = 1, 2, ...., n where y j is the jth observation on the response variable, x j is the jth observation on the explanatory variable, e j is the jth error term associated with y j th response, and β 0 and β 1 are the intercept and slope parameters respectively.This linear model can be expressed as a convenient form where β = (β 0 , β 1 ) ′ , a 2 × 1 dimensional parameters vector; y and e, both are n × 1 dimensional response and error vectors respectively; and X, a n × 2 dimensional design matrix of explanatory variable.

Prediction under the Normal Distribution Error
Assume e j is identically and independently distributed as a normal variable with mean zero and variance σ 2 for j = 1, 2, .., n, that is, e j |σ ∼ N(0, σ 2 ).Then, e|σ ∼ N(0, I n σ 2 ).The likelihood function of the response vector y is Consider a noninformative prior distribution of unknown parameters β and σ as p(β, σ) ∝ 1 σ .Thus the joint posterior density function of parameters for given y can be expressed as with the normalizing constant say Ψ −1 (.), and it can be obtained from the equation Integration with respect to σ yields the result as where Using ( 16) in ( 15) and then integrating over . Hence the normalizing constant of the joint posterior density function of parameters is as follows .
Moreover, the future simple linear regression model can be expressed as where β = (β 0 , β 1 ) ′ , a 2 × 1 dimensional parameters vector for future response; y f and e f , both are 1 × 1 dimensional response and error values respectively; and X f , a 1 × 2 dimensional design matrix of future model explanatory variable.Also in future model we assume that e f is identically and independently distributed as a normal variable with mean zero and variance σ 2 that is, e f |σ ∼ N(0, σ 2 ).Thus, the pd f of y f given β and σ is Now the prediction distribution of y f for given y, X and X f can be obtained from the following equation By integrating with respect to σ provides the result as where Using ( 21) in ( 20) and then integrating over β, we obtain the following prediction density function of y f as where is the normalizing constant.
Moreover, according to the same approach as considered above the prediction density function of a set of future response y f = (y 1 , y 2 , ...y n f ) ′ for given y, X and X f can be obtain as where

Prediction under the Student-t Error
Assume that each of the n components in e for the simple linear regression model in ( 14) is uncorrelated but not independent of the others and has the same univariate Student-t distribution with location 0, scale σ > 0 and ν degrees of freedom.Therefore, the joint pd f for the n elements of e|σ is an n-dimensional multivariate Student-t distribution
Here E(e) = 0, and Cov(e) = νσ 2 (ν−2) I n for ν > 2. Thus the elements of e and hence those of y are uncorrelated but not independent.Therefore the probability density function of the realized vector y becomes Let a single response from the future model in equation ( 17) follows the assumption that e f |σ has univariate Student-t distribution with ν degrees of freedom i.e., e f |σ ∼ t 1 (0, σ, ν).Hence, the realized error e and the future error e f have been combined to form a (n + 1) dimensional multivariate Student-t distribution with ν degrees of freedom.Accordingly the joint density function of the combined responses y from the performed experiment and y f from the future experiment becomes where Let a noninformative joint prior distribution of unknown parameters β and σ 2 be p(β, σ 2 ) ∝ σ −2 .It is assumed that the degrees of freedom of the error distribution are given and the elements of β and log σ 2 are independently and uniformly distributed.Combining the prior information with the joint density function in ( 25) by means of Bayes' Theorem, we have the following joint posterior density of β and σ 2 for y and where The prediction distribution of a future response can be obtained by solving the following integral That means in this case we can obtain the prediction distribution of future response(s) from the joint posterior density of unknown parameters for combined responses generate from both the performed and future models.Now the joint posterior density in ( 26) can be expressed as the following convenient form Using a transformation as Q + νσ 2 = t −1 we have the Jacobian of the transformation |J| = 1 νt 2 with the range of t from 0 to Q −1 .Hence equation ( 28) can be written as If we put z = Qt, then the joint pd f in (29) becomes Equation ( 30) confirms that z has a beta distribution and hence z ∼ B . After integrating with respect to z, the above equation become to the following form Now, Q = Q y + Q y f can be expressed as the convenient quadratic form of the parameters vector β, i.e., It is noted that A is free from the unknown regression parameters' vector β.
Using thethe convenient form of Q, the probability density function in (31) can be expressed as The prediction density for y f can be obtained by integrating the above equation with respect to the elements of β using the multivariate Student-t integral.Hence the prediction distribution of y f given a set of observed responses y is obtained as . Thus, using the ultimate form the prediction distribution of a single future response y f is obtained as is the normalizing constant.
Therefore, under the Student-t errors the prediction distribution of a future response has an univariate Student-t distribution of n − 2 d f , with location X f (X ′ X) −1 X ′ y and variance (n−2)s 2 (n−4) It is also revealed that the d f of the prediction distribution does not depend on the d f parameter ν of the t-distribution for error terms.
Furthermore, employing the same procedures the prediction distribution of a set of future responses y f = (y 1 , y 2 , ...y n f ) ′ from simple linear regression model [i.e., y f = X f β + e f , with e f |σ ∼ t n f (0, σ, ν)] for the given y, X and X f can be obtain as , β and s 2 are same as of (33) and Therefore, under the Student−t errors a set of future responses y f has the prediction density of an n f -dimentional Student-t distribution with n − 2 d f .This indicate that the shape parameter of the prediction distribution dependents on the size of the observed sample n and the dimension of regression parameters β.The location and scale of the prediction distribution are X f β and respectively.This result coincides with that of Zellner (1971) and Hahn (1972), where they had considered the normal error terms and obtained the distribution using the classical method.Consistant results had also obtained for the normal errors regression model by using the structural distribution (Fraser & Haq, 1970) and structural relations of the model (Fraser & Ng, 1980) approaches instead of the Bayesian approach.So, the prediction density is unaffected by departures from the model with independent and normal errors to the Student-t errors under the classical, structural distribution, structural relations of the model and Beyasian methods.

Conclusion
The posterior distribution for parameters of a set of observations is typically the major objective of the Bayesian statistical analysis.A posterior distribution implies a marginal distribution known as prediction distribution for outcomes of any future sample observations from a medel, and prediction distributions have many applications in statistical inference.In this paper, the effects of a range of conjugate and non-conjugate prior distributions on the Bayesian predictive inference have been tested.The prediction distributions for future response(s), conditional on a set of observed responses have been also derived for the simple linear regression model with the normal and Student-t errors situations.Findings reveal that if a conjugate prior for the normal error model with known but unequal variances are used, then the joint posterior distribution of parameters, the marginal posterior distribution of the location parameter µ and the prediction distribution