On the Independence Jeffreys prior for skew--symmetric models with applications

We study the Jeffreys prior of the skewness parameter of a general class of scalar skew--symmetric models. It is shown that this prior is symmetric about 0, proper, and with tails $O(\lambda^{-3/2})$ under mild regularity conditions. We also calculate the independence Jeffreys prior for the case with unknown location and scale parameters. Sufficient conditions for the existence of the corresponding posterior distribution are investigated for the case when the sampling model belongs to the family of skew--symmetric scale mixtures of normal distributions. The usefulness of these results is illustrated using the skew--logistic model and two applications with real data.


Introduction
The need for modelling data presenting departures from symmetry has fostered the development of distributions that can capture skewness. A popular method to produce this sort of distributions consists of adding a parameter that controls skewness to a symmetric distribution. In this line, Azzalini (1985) proposed a transformation to produce an asymmetric normal density, termed skew-normal, as follows sn(y; µ, σ, λ) = 2 where y ∈ R, µ ∈ R, σ ∈ R + , λ ∈ R, φ is the standard normal probability density function (PDF), and Φ is the standard normal cumulative distribution function (CDF). The parameter λ * UNIVERSITY OF WARWICK, DEPARTMENT OF STATISTICS, COVENTRY, CV4 7AL, UK. E-mail: Francisco.Rubio@warwick.ac.uk † MEMOTEF, SAPIENZA UNIVERSITÀ DI ROMA. E-mail: brunero.liseo@uniroma1.it is often interpreted as a skewness parameter given that the density (1) is asymmetric for λ = 0, and it reduces to the normal PDF for λ = 0. Subsequently, Wang et al. (2004) showed that, in particular, this method can be extended to any continuous symmetric density f , with support on R and mode at 0, through the transformation ss(y; µ, σ, λ) = 2 σ f y − µ σ π λ y − µ σ , where π, termed skewing function, is a function that satisfies 0 ≤ π(y) ≤ 1, and π(−y) = 1 − π(y). It follows, then, that any symmetric CDF can be used as a skewing function. Several choices for f and π have been explored in the literature, such as the power exponential distribution with power δ ∈ R + (Azzalini, 1986), the Student-t distribution with ν ∈ R + degrees of freedom (Azzalini and Capitanio, 2003), the logistic distribution (Nadarajah, 2009), among others. Distributions obtained by means of this method are called skew-symmetric distributions. These distributions are widely used nowadays in several contexts such as binary regression (Bazán et al., 2010), meta-analysis (Guolo, 2012), data fitting (Branco et al., 2012), among many others.
It has been found that several skew-symmetric models present inferential issues. For instance, Azzalini (1985) showed that the Fisher information matrix of the parameters (µ, σ, λ) is singular at λ = 0 for the skew-normal sampling model. In addition, the maximum likelihood estimator of the parameter λ can be ∞ with positive probability. The cases with infinite estimators are more commonly found in small and moderate samples. These inferential issues are present in other skew-symmetric models (Hallin and Ley, 2012). Some authors have proposed the use of the Bayesian approach in order to avoid these inferential problems (Liseo and Loperfido, 2006;Branco et al., 2012). In Bayesian practice it is often of interest to employ noninformative priors given that they typically produce posterior inference with appealing frequentist properties. However, due to the singularity of the Fisher information matrix at λ = 0 of some skew-symmetric models, the use of the Jeffreys-rule prior, which is defined as the square root of the determinant of the Fisher information matrix, has been avoided in this kind of models. In addition, the calculation of this sort of prior is typically cumbersome.
Reference priors, which are another kind of noninformative priors, have been studied for the skew-normal and the skew Student-t models in Liseo and Loperfido (2006) and Branco et al. (2012). An alternative noninformative prior is the independence Jeffreys prior. This prior is constructed as the product of the Jeffreys priors for each parameter, while treating the remaining parameters as fixed.
In this paper, we study the independence Jeffreys prior associated to the class of skewsymmetric distributions obtained by using a CDF as a skewing function in (2). In Section 2, we analyse the Jeffreys prior of the skewness parameter λ in skew-symmetric models without location and scale parameters. We show that this prior is proper, symmetric about 0, and with tails O(λ − 3 2 ) under rather mild regularity conditions. Using these results, we construct the independence Jeffreys prior for the general model with location and scale parameters. In Section 3 we obtain easy to check sufficient conditions for the propriety of the posterior distribution when the sampling model f in (2) belongs to the family of scale mixtures of normal distributions. The case of samples containing censored observations is covered as well. In Section 4, we present the use of these results on the skew-logistic distribution. In Section 5, we illustrate the use of these models in the context of binary regression and stress-strength models. We conclude with some discussion and extensions of this work in Section 6.
2 Independence Jeffreys prior for univariate skew-symmetric models Throughout we focus on the study of skew-symmetric models of the type where f is a continuous symmetric density function with support on R, and G is a CDF with continuous symmetric density g with support on R. This structure covers many cases of practical interest such as the skew-normal distribution (Azzalini, 1985), the skew-t distribution (Azzalini and Capitanio, 2003), the skew-logistic distribution (Nadarajah, 2009), among many others.
The Jeffreys prior of the parameter λ is defined, up to a proportionality constant, as the square root of the Fisher information I(λ), this is, π(λ) ∝ I(λ). The following result characterises the cases where this prior is well-defined at λ = 0. Proof. See appendix.
Particular cases of Remark 1 have already been reported in the literature. For instance, Branco et al. (2012) report the presence of a pole at λ = 0 in the Jeffreys prior of λ for the skew Student-t model with ν ≤ 2 degrees of freedom. Remark 1 shows that this feature is present in many other skew-symmetric models, and that this sort of singularity is linked to the existence of the moments of the underlying symmetric density f . Liseo and Loperfido (2006) and Branco et al. (2012) show that the Jeffreys priors of the parameter λ, for the cases where f and g are normal or Student-t distributions, are proper, decreasing in |λ|, and with tails O(|λ| − 3 2 ). Their proofs rely upon basic properties of these models, which suggests that there may be other models that lead to a Jeffreys prior of λ with the same properties under some reasonable regularity conditions. In order to establish this result, we introduce the following set of sufficient conditions.
Condition S Let f and g be continuous density functions with support on R that satisfy the following conditions: (i) f and g are symmetric about 0.
(ii) f is unimodal and there exists a finite constant M such that 0 < f (x) < M, for all x ∈ R.
Conditions S.i-S.ii include models of practical interest, such as: the normal distribution, the Student-t distribution, the exponential power distribution, the logistic distribution, among many others. Condition S.iii is simply used to restrict ourselves to those cases where the Jeffreys prior of λ exists. Theorem 1 provides conditions for the finiteness of I(0), however, for λ = 0 the finiteness of I(λ) may require a case by case analysis (for a more detailed study of this point we refer the reader to Hallin and Ley, 2012). The following theorem shows that the results in Liseo and Loperfido (2006) and Branco et al. (2012) can be extended to the family of distributions that satisfies Condition S.
Theorem 1 Let f and g be density functions that satisfy Condition S. Then, the Jeffreys prior of λ associated to model (3) with (µ, σ) = (0, 1) satisfies the following: (i) The Jeffreys prior of λ is given by (iii) The tails of π(λ) are of order O(|λ| − 3 2 ).

Proof. See appendix
Based on the tail behaviour, symmetry, and properness of the Jeffreys prior of λ shown for the skew-normal model in Liseo and Loperfido (2006), Bayes and Branco (2007) proposed an approximation to this prior using a Student-t distribution with ν = 1/2 degrees of freedom and an empirical choice for the scale parameter (π/2). Branco et al. (2012) also proposed a similar approximation for the Jeffreys prior of λ of the skew Student-t model. Theorem 1 shows that this approximation might be reasonable in other cases as well. However, the quality of this approximation and the choice for the scale parameter seem to require a case by case analysis.
In Section 4 we show that this approximation is reasonable for a skew-logistic sampling model.
Condition S.iii can be relaxed to those cases where I(λ) < ∞ for all λ = 0, possibly leading to an undefined Jeffreys prior at λ = 0 such as those models studied in Branco et al. (2012). The results (ii)-(iii) in Theorem 1 are valid under these relaxed assumptions given that they can be proved using essentially the same technique. The results in Branco et al. (2012) also suggest that it is possible to obtain a proper Jeffreys prior π(λ) for some sampling models despite the singularity of the Fisher information at λ = 0. However, the use of priors containing singularities might be less appealing to practitioners.
We now study the independence Jeffreys prior associated to the skew-symmetric model (3) including location and scale parameters. In the next section we also show that this prior leads to a proper posterior distribution under mild conditions.

Theorem 2
The independence Jeffreys prior of (µ, σ, λ) corresponding to a skew-symmetric model (3) that satisfies Condition S is given by where π(λ) is the function defined in (4).

Existence of the posterior
In this section, we provide sufficient conditions for the existence of the posterior distribution under the use of the priors studied in the previous section.
Proof. The result follows by the properness of (4) under Condition S. Liseo and Loperfido (2006) show that (4) is proper for the skew-normal sampling model (Azzalini, 1985); and Branco et al. (2012) show that this is also the case for the prior (4) associated to a skew-symmetric Student-t sampling model (Azzalini and Capitanio, 2003). In Section 4 we show that the prior (4) associated to a skew-logistic sampling model (Nadarajah, 2009) is also proper.
For the general model (3), with unknown location and scale parameters, the independence Jeffreys prior (5)

Proof. See appendix
Since the skew-symmetric distributions of interest are continuous, it follows that the probability of obtaining repeated observations is zero. This implies that we can conduct valid Bayesian inference based on this prior whenever n ≥ 2 for almost any sample. In the Appendix we show that the proof of the propriety of the posterior distribution of (µ, σ, λ), under the assumptions in Theorem 3, can be reduced to proving the propriety of the posterior distribution in the symmetric case. This is, assuming that y is an i.i.d. sample from a scale mixture of normals f with location and scale parameters (µ, σ) and adopting the prior structure π(µ, σ) ∝ σ −1 . The propriety of the posterior distribution under the latter assumptions is studied in Fernández and Steel (1998), who also show that the presence of repeated observations in the sample may destroy the existence of the posterior distribution for some scale mixture of normal sampling models. They also present sufficient conditions for the propriety of the posterior distribution in cases when the sample contains repeated observations. We refer the reader to Fernández and Steel (1998) for further details on this.
Another scenario of interest is when the observations are recorded as sets of positive probability due to some kind of censoring mechanism. This is, when the collected sample consists of sets S 1 , . . . , S n with P(y i ∈ S i ) > 0, i = 1, . . . , n. This framework clearly covers all kinds of interval censoring. The following result shows that the independence Jeffreys prior (5) produces a proper posterior distribution in this case as well.

Proof. See appendix
This result implies that the posterior distribution of (µ, σ, λ) exists whenever the sample of set observations contains at least two observations that do not overlap.
4 Skew-Logistic model Nadarajah (2009) showed that an interesting member of the skew-symmetric family (3) is the skew-logistic distribution, obtained by using the logistic PDF and CDF, f (t) = e −t (1 + e −t ) 2 and G(t) = 1 1 + e −t . The skew-logistic density can be written in closed form, after some algebra, as follows sl(y; µ, σ, λ) = 1 4 sech 2 y − µ 2σ 1 + tanh λ y − µ 2σ , where tanh(·) and sech(·) represent the hyperbolic tangent and the hyperbolic secant functions, respectively. For this sampling model, the Jeffreys prior (4) can be written as indicated below: It is easy to check that (7) satisfies Condition S and therefore it is proper, as a consequence of Theorem 1. The tail behaviour, symmetry, and properness shown in this result suggest the use of a Student-t approximation, such as the one proposed in Bayes and Branco (2007) for the skew-normal model. Empirically, we have found that π(λ) can be reasonably well approximated by a Student-t distribution with 1/2 degrees of freedom and scale parameter 4/3. Figure 1 illustrates the quality of this approximation.
For the general skew-logistic model with unknown location and scale parameters it follows that the posterior distribution of (µ, σ, λ) using the independence Jeffreys prior (5) is proper, given that the logistic distribution can be represented as a scale mixture of normals (Stefanski, 1991), under the conditions in Theorem 3. Consequently, the results in Theorem 4 also hold for a skew-logistic sampling model.

Applications
In this section we present two applications of the Bayesian models studied in the previous sections for skew-symmetric distributions using real data. In the first application we consider the use of two skew-symmetric distributions in the context of binary regression to produce a more robust model. Using the Jeffreys prior (4)

Binary regression
Binary and binomial observations are common in contexts such as biology, medicine, quality control, among others (see e.g. Collet, 1999 for a good survey of this). Generalised linear models are a useful tool for modelling this sort of observations given that the probability of observing y successes (failures) of a binomial random variable Y ∈ {0, 1, . . . , n} can be related to a certain set of covariates x = (1, x 1 , ..., x k ) ⊤ , through the model where β = (β 0 , ..., β k ) ⊤ is a vector of regression coefficients, F (·; θ) is a univariate distribution function with shape parameter θ ∈ Θ, and F −1 is called the link function. The most common links correspond to the cases where F is the logistic distribution (logit) or the standard normal distribution (probit), which are often referred to as the canonical links. It has been found that these links do not always provide a good fit (see e.g. Aranda-Ordaz, 1981), and also that link misspecification can affect the inference about the parameters (Czado and Santner, 1992).
An approach for constructing more robust links consists of using a flexible distribution as a link function. In this line we can find a number of references such as Copenhaver and Mielke (1977), Chen et al. (1999), Basu and Mukhopadhyay (2000), Kim et al. (2008), Bazán et al. (2010), and Wang and Dey (2010). In this application, we propose a hierarchical prior based on the Jeffreys prior (4) for the generalised linear model (8) with a skew-symmetric link as in (3), as described below.
We propose a hierarchical prior structure for the parameters of model (9), based on a modification of the Jeffreys prior presented in Chen et al. (2008). We adopt the hierarchical prior structure π(β, λ) = π(β|λ)π(λ), where det X ⊤ W (β, λ)X denotes the determinant of the matrix X ⊤ W (β, λ)X, W (β, λ) = diag(w 1 (β, λ), ..., w m (β, λ)), and π(λ) is given by (4). Since the prior (11) is proper for any λ fixed , it follows that the hierarchical structure (10) is also proper if the conditions in Theorem 1 are satisfied. For example, under the use of the skew-normal or the skew-logistic link. In addition, for the skew-normal and the skew-logistic links we can employ the Student-t approximations of π(λ) described in Section 4 in order to facilitate the implementation of prior (10). The proposed prior also presents the following invariance property.

Proof. The result is a consequence of Theorem 5 from
As discussed by Chen et al. (2008), this result implies that C 0 (X) and C(X) are scaleinvariant with respect to the covariates, which is a desirable property in Bayesian modelling, particularly for conducting variable selection.
In order to illustrate the use of the proposed Bayesian model, we analyse the popular data set reported in Bliss (1935). The aim of this experiment was to model the response of confused fluor beetles to gaseous carbon disulphide. Aranda-Ordaz (1981) mentioned that this data set presents asymmetric departures from the logistic model, hence the use of a skew-symmetric link seems appropriate. Figure 2 shows the predictive dose-response curves associated to 4 links: the skew-logistic link with the prior (10), the skew-normal link with the prior (10), the logit link together with the prior π(β|λ = 0) which corresponds to the Jeffreys prior described in Chen et al. (2008), and the probit link together with the prior π(β|λ = 0) which again corresponds to the Jeffreys prior described in Chen et al. (2008). The Bayes factors of the different links against the skew-logistic link, AIC, and BIC values, shown in Table 6, favour the use of an asymmetric model and slightly favour the skew-logistic link over the other competitors. Table 5 shows the predicted observations with these models obtained by multiplying the number of subjects n j by the predicted probability at the corresponding dose level. This table suggests a better fit of the asymmetric models. Dose n j y j logit probit skew-logistic skew-normal
The parameter θ is called the stress-strength coefficient and it has been applied in several contexts (see Rubio and Steel, 2013). Note that, unlike Rubio and Steel (2013) who model the joint distribution of (X, Y ), here we are making distributional assumptions on the difference Z. In a Bayesian context, if we have a sample of paired observations (x i , y i ), i = 1, . . . , n, from (X, Y ), then we can obtain a sample from the posterior distribution of θ by first obtaining a sample from the posterior distribution of (µ, σ, λ), obtained in turn by using the sample of differences z i = x i − y i , and then by plugging these values into (12). It is worth pointing out that this approach can only be applied when the sample is complete and it does not contain censored observations. For a more general approach that covers these cases see Rubio and Steel (2013).
We consider the data set presented in Venkatraman and Begg (1996) which contains 72 lesion scores obtained using both a clinical scheme without a dermoscope (X Test), and a dermoscopic scoring scheme (Y Test). Their main interest was assessing the information provided by the use of the dermoscope. Here, we analyse the subset of n = 51 non-diseased patients.
The sample skewness of the differences z i = x i − y i is 0.57, and Figure 3a shows the histogram of these differences. These features suggest the need for using an asymmetric model. For this purpose, we compare the performance of the skew-normal distribution (1) and the skewlogistic distribution (6) together with the independence Jeffreys prior (5). Since the sample of differences does not contain repeated observations, it follows that the posterior distribution of (µ, σ, λ) is proper and consequently the posterior of θ is well-defined for both sampling models. Figure 3b shows the posterior distributions of θ. We can observe that the inference on θ under both distributional assumptions is very similar. A 95% posterior credible interval for θ for the skew-logistic model is (0.54, 0.76), while the corresponding interval for the skew-normal model is (0.52, 0.74). These intervals do not contain the value θ = 0.5, therefore this approach leads to similar results as those obtained in Rubio and Steel (2013

Discussion
We have studied the Jeffreys prior of the skewness parameter of a general class of scalar skewsymmetric models as well as the independence Jeffreys prior for the same class of models with unknown location and scale parameters. We have shown that this sort of priors has appealing properties such as symmetry, properness, and identifiable tail behaviour that allow in many cases a tractable approximation that facilitate their implementation. We have also presented easy to check conditions for the existence of the posterior distribution for a general subclass of skew-symmetric sampling models. Given that the prior on the skewness parameter has heavy tails, O(|λ| −3/2 ), it is expected to obtain good frequentist properties of the corresponding Bayesian models since heavy-tailed priors are usually employed as "vague priors". This feature was illustrated using a simulation study in Section 4.1.
One of the unpleasant properties of the priors studied in this paper is that they are welldefined at λ = 0 only when the transformed symmetric density f has a finite second moment.
This can be considered as a limitation for the use of these Bayesian models. However, things must be considered in perspective. In some cases where our prior cannot be defined at zero, maximum likelihood estimation fails also, and we do not know any other broadly satisfactory alternative method. Inspired by the structure of the independence Jeffreys prior (5) we can construct a more general benchmark prior for skew-symmetric models as follows π(µ, σ, λ) ∝ σ −1 p(λ), where p(λ) is any proper prior. Using this prior structure, the corresponding posterior distribution is proper for any skewing function G if f in (3) is a scale mixture of normals, the sample size n ≥ 2, and all the observations are different. The proof of this result is similar to that of Theorem 3. The study of appropriate choices for p(λ) is a matter of further research.
After the same change of variable u = λx we can rewrite the Fisher information as du.

Proof of Theorem 4
The result follows again by using inequality (19) and Theorem 4 from Fernández and Steel (1998).