Bayesian Model Choice for Directional Data

Abstract This article is concerned with the problem of choosing between competing models for directional data. In particular, we consider the question of whether or not two independent samples of axial data come from the same Bingham distribution. This is not a straightforward question to answer, due to the intractable nature of the parameter-dependent normalizing constant of the Bingham distribution. We propose three different methods to perform this task within a Bayesian framework, and apply the methodology to a real dataset on earthquakes in New Zealand. R code to run our methods is available in online supplementary materials.


Introduction
The Bingham distribution, introduced by Bingham (1974), is one of the most important and useful models for data which arise as unsigned directions (axes).Bingham (1974) derived its main properties, and since then it has been studied further; see for example Kent (1987), Mardia and Jupp (2000), Kume and Walker (2014), and Dryden (2005), who studied the properties of the Bingham distribution in very high dimensions.
The Bingham distribution can be constructed by conditioning a zero-mean multivariate Normal (MVN) distribution in R q to lie on the unit sphere S q−1 .For x ∈ S q−1 , the density of the Bingham distribution with respect to the uniform measure on S q−1 is given by where the parameter A is a real symmetric q × q matrix and c(A) is the corresponding normalizing constant.
Let A = V V T be the Singular Value Decomposition of A. Then maximum likelihood estimates of V (whose columns are the eigenvectors of A) are easily obtained.However, inferring the eigenvalues of A (the elements of the diagonal matrix ) is a rather challenging task, since evaluation of the parameterdependent normalizing constant c(A) is required, which, in the general case, is not available in closed form.This poses significant challenges when fitting the Bingham distribution to some data in either a frequentist or Bayesian setting.There have been several approaches proposed in the literature to approximate/estimate the normalizing constant, and hence facilitate maximum likelihood estimation, based on asymptotic expansions (Kent 1987), saddlepoint approximations (Kume and Wood 2005) and more recently, holonomic gradient methods (Sei and Kume 2015).However, these are nontrivial numerical methods to implement, which in any case involve some level of approximation of c(A).Likewise, standard approaches to model selection involving Bingham distributions would require evaluation of c(A).
In the Bayesian setting, MCMC is also complicated by c(A), since standard Metropolis-Hastings updates would require evaluation of this constant at current and proposed configurations.This is known as a doubly intractable problem, for which various algorithms have been proposed (e.g., Møller et al. 2006;Murray, Ghahramani, and MacKay 2006;Walker 2011).The specific case of the Bingham distribution was considered by Walker (2014) and Fallaize and Kypraios (2016).
The above discussion relates to inference given data from a single Bingham distribution.In this article, we consider, within a Bayesian framework, the problem of model selection in the context of Bingham distributions.That is, suppose we have two independent datasets consisting of some axial data, considered to be realizations from Bingham distributions.A natural question that often arises is the following: do these two datasets come from the same Bingham distribution?A common approach is to compute model probabilities/Bayes factors for the two competing scenarios.Again, standard techniques are complicated by the intractability of the constant c(A).Recall that, generally speaking, there are two approaches for calculating Bayes factors/model probabilities using MCMC: • within-model estimation.This involves estimating the model evidence (marginal likelihood) for each model separately, for example, using MCMC samples and Chib's method.In the doubly intractable case, although samples can be obtained from each posterior distribution using one of the aforementioned algorithms, one would also have to plug in an approximation to the unknown normalizing constant (c(A) for the Bingham distribution).So in practice, the model evidence has to be approximated.
• across-model estimation.Reversible jump can be used to run one sampler which jumps between competing models and thus allows direct (Monte Carlo) estimation of the model probabilities.In the doubly intractable case, the Reversible Jump Exchange algorithm (Caimo and Friel 2013) is an across-model approach that enables exact (in a Monte Carlo sense) calculation of posterior model probabilities without the need to approximate the intractable constant, but this is more computationally-intensive and requires careful design for constructing an efficient sampler.
Although these general approaches are available, implementations and performance are context-specific.The main purpose of this article is to show how methods in each of the above categories can be successfully applied to perform model selection for Bingham distributions, and to investigate performance in terms of ability to discriminate between competing models depending on sample size and parameter configurations.Specifically, we investigate three different approaches: the Reversible Jump Exchange algorithm of Caimo and Friel (2013); a within-model approach in which the model probabilities are approximated; an alternative to the calculation of Bayes Factors, based on calculating (multivariate) highest posterior density regions for a certain parameter of interest, one value of which corresponds to distributional equality.The latter two approaches are computationally fast and easy to implement.They only require the practitioner being able to fit a Bingham distribution to directional data within a Bayesian framework, something that can efficiently be done using the algorithm of Fallaize and Kypraios (2016).They also provide a useful cross-check for the Reversible Jump Exchange algorithm, which is appealing in the sense that it provides exact (Monte Carlo) calculation of posterior model probabilities, but requires more careful sampler design and is more computationally intensive.Hence, all three approaches are useful in practice.
The article is structured as follows.In Section 2 we introduce a motivating example of data from earthquakes in New Zealand.In Section 3, we review how Bayesian inference can proceed for the Bingham distribution, and in Section 4 we describe our three strategies for performing model choice for the Bingham distribution in the Bayesian framework.In Section 5, we illustrate the methods on simulated data, and perform a simulation study to investigate the extent to which it is possible to discriminate between models as parameters and sample sizes vary.In Section 6, we apply our methods to the New Zealand earthquake data, before concluding with a discussion.

Motivating Application
Here, we introduce a motivating example, which relates to earthquake data in New Zealand.An earthquake gives rise to three orthogonal axes, and geophysicists are interested in analyzing such data in order to compare earthquakes at different locations and/or at different times.An earthquake gives rise to a pair of orthogonal axes, known as the compressional (P) and tensional (T) axes, from which a third axis, known as the null (B) axis is obtained via B = P × T. Each of these quantities are determined only up to sign, and so models for axial data are appropriate.Arnold and Jupp (2013) treated these data as orthogonal axial 3-frames in R 3 , but, following Fallaize and Kypraios (2016) we will illustrate our methods using the B axes.In general, an orthogonal axial r-frame in R q , r ≤ q, is an ordered set of r axes, {±u 1 , . . ., ±u r }, where u 1 , . . ., u r are orthonormal vectors in R q (Arnold and Jupp 2013).The familiar case of data on the sphere S 2 is the special case corresponding to q = 3, r = 1, which is the case we consider here.
The data consist of three clusters of observations derived from earthquakes in New Zealand, where each observation represents a B axis obtained from a particular earthquake.The first two clusters each consist of 50 observations from earthquakes near Christchurch which took place before and after a large earthquake on the 22nd of February 2011.For these two clusters, the P and T axes are quite highly concentrated in the horizontal plane, and as a result the majority of the B axes are concentrated about the vertical axis (see Figure 1).From an applied viewpoint, it is of interest to geophysicists to establish whether there is a difference between the pattern of earthquakes before and after the large earthquake.The third cluster is a more diverse set of 32 observations obtained from earthquakes in the north of New Zealand's South Island.Visually, this cluster appears to exhibit more dispersion than the other two samples.Hence, one approach is to model the B axes using the Bingham distribution, and the principal task is to assess the evidence that the different samples are from populations with the same Bingham distribution or not.

Bayesian Inference
Consider the probability density function of the Bingham distribution as given in (1).If A = V V T is the Singular Value Decomposition of A where V is orthogonal and = diag(λ 1 , . . ., λ q ), then it can be shown that if x is drawn from a distribution with probability density function f (•; A), the random vector V T x is drawn from a distribution with density f (•; ) (see, e.g., Kume and Walker 2006;Kume and Wood 2007).Therefore, for ease of exposition and without loss of generality, we assume that A = = diag(λ 1 , . . ., λ q ).Moreover, to ensure identifiability, we assume that (Kent 1987) and denote λ := (λ 1 , . . ., λ q−1 ), the ordered vector of parameters to be inferred.Hence, the probability density function becomes with respect to the uniform measure on the sphere and (Throughout, we write λ in place of A or , since it is to be understood that A is of the form A = = diag(λ, 0).)This integral can be expressed as a hypergeometric function of matrix argument (see, e.g., Mardia and Jupp 2000), which is an infinite series involving zonal polynomials and hence not a tractable function to work with in practice.We note that in the case when there are only two distinct eigenvalues of A, the distribution reduces to a Watson distribution, in which case the normalizing The data have been rotated to principal axes, such that the principal axis is the North-South axis.Relatively, the two Christchurch clusters are more concentrated about the vertical axis, and the South Island cluster more dispersed.All points have been plotted in the same hemisphere to enable easier comparison of the variability in each sample.In some cases, the antipodal point was the observation-as these are axial data, both are treated as the same observation and it does not matter which is observed.
constant can be easily evaluated.The Watson distribution is appropriate when there is rotational symmetry about some axis, which is not the case for the Bingham distribution in general.
Here, we assume that this possibility has been discounted, and that the general Bingham distribution is appropriate.
The likelihood contains the normalizing constant c(λ) and the fact that there does not exist a closed form expression for it makes Bayesian inference for λ challenging.Denote by π(λ) a prior distribution that is placed on which also involves the intractable normalizing constant c(λ).
As explained in Fallaize and Kypraios (2016), in principle one can employ one of the proposed methods in the literature to approximate the normalizing constant (see, e.g., Kent 1987;Kume and Wood 2005;Sei and Kume 2015) and then use, for example, a Metropolis-Hastings algorithm to sample from π(λ|x).However, despite such an approach being feasible, it is not entirely satisfactory.First, the approximation would have to be computed at each iteration of the Markov chain and second, and most importantly, the stationary distribution of such an MCMC algorithm will not be the distribution of interest, but rather an approximation to it (despite how good this might be).Fallaize and Kypraios (2016) showed how one can bypass the need of calculating the normalizing constant by making use of recent developments in computational methods for doubly intractable distributions.
Of course, in general we will not have V = I.In practice, we do the following.Let X be the n × q data matrix, that is, row j of X is x T j , and X = 1 n (X T X).Then let Q Q T be the Singular Value Decomposition of X, whereby the diagonal elements of are the τ i above.These are unchanged under any rotation of X.Hence, in practice, we take our data τ i , i = 1, . . ., q, to be the eigenvalues of X. Due to the identifiability constraints on λ, we work with the ordered eigenvalues (with τ 1 being the smallest and so on), and τ q is redundant since the constraint λ q = 0 is imposed.This procedure is equivalent to performing inference for λ as above, with V = I.
This can be viewed as (and is equivalent to) the following hybrid algorithm for performing inference for V and λ.Q is in fact the maximum likelihood estimate (mle) of V. Given the data X in the original coordinates, compute Q and set Y = XQ.Then the eigenvalues of 1 n (Y T Y) are the same as those of X, which will also be equal to 1 n n j=1 y i j 2 .So working with the eigenvalues of X computed using the original data is equivalent to taking the mle of V, using this to rotate the data, and performing inference for λ using the posterior distribution (4) with the rotated data.

Preliminaries
Given a set of K competing models indexed by a model indicator k ∈ {1, . . ., K}, the principal interest in Bayesian model selection lies in quantifying the extent to which the observed data support each model k.The Bayesian treatment for model choice involves joint inference for k and a parameter vector λ k where the model indicator determines the dimension of the parameter, and this may vary from model to model.
Suppose data x is assumed to have been generated by model k; the posterior distribution is where π(x|λ k , k) represents the likelihood function and π(λ k |k) the prior distribution of the parameters of model k.The model evidence for model k is the integral It represents the probability of the data x given a particular model k and is vital in Bayesian model choice.The reason is that it allows us to make statements about posterior model probabilities which can be written as , where π(k) is the prior probability of model k.(In our examples we have two competing models and take π(k) = 0.5 as a default choice in the absence of any prior information to the contrary.)These posterior model probabilities give rise to the Bayes Factors which give the relative posterior likelihoods of pairs of models, for example where B 12 is the Bayes factor of model k 1 relative to k 2 , given the observed data x, assuming that the prior model probabilities, π(k 1 ) and π(k 2 ) are independent of x.The larger the Bayes factor is, the greater the evidence in favor of model k 1 compared to model k 2 .Jeffreys (1998) presents a scale in which one can interpret the strength of evidence of one model against another.

Computing Bayes Factors
In practice, the posterior model probabilities, which are essential to conduct Bayesian model choice, are very rarely analytically available.Therefore, very often, we rely on computational methods.One of the most favored computational methods is Markov chain Monte Carlo (MCMC).The main idea behind MCMC is to construct a Markov chain which has the posterior distribution of interest as its invariant distribution.
In general, there are two main approaches for computing Bayes factors: (i) across-model and (ii) within-model estimation.The within-model approaches aim to approximate/estimate the model evidence π(k|x), for each model k = 1, . . ., K.There are a number of approaches ranging from Laplace approximations to path sampling; see Friel and Wyse (2012) for a comprehensive review.The across-model strategy relies on constructing a Markov chain that crosses the joint model and parameter space and samples from and one of the most popular approaches used in this context is the reversible jump MCMC (RJMCMC) algorithm of Green (1995).
In the context of the Bingham distribution, both withinmodel strategies and across-model strategies (such as RJM-CMC) cannot be used straightforwardly because the normalizing constant c(λ) in the Bingham likelihood is not available analytically.Here, we present an implementation of the Reversible jump (RJ) exchange algorithm of Caimo and Friel (2013), which is a modification of the reversible jump algorithm of Green (1995) based on the exchange algorithm of Murray, Ghahramani, and MacKay (2006).Essentially, the RJ exchange algorithm enables us to implement a RJ algorithm without having to compute/approximate the unknown normalizing constant c(λ).We will also present a within-model approach for estimating the model evidence in the context of the Bingham distribution.The motivation for this is to provide a useful method for lowdimensional models to use as an alternative reference to compare with the RJ exchange algorithm.

Reversible Jump Exchange Algorithm
As described in Section 4.2, the evidence in favor of competing models can be obtained in an across-model fashion.In this case, the parameter space is (λ k , k), and a Markov chain which samples from the joint posterior distribution π(λ k , k|x) can in principle be constructed using RJMCMC.Given such a chain, inference for the model index k is readily available, since it is directly incorporated as a parameter.Given a current state (λ k , k), a RJMCMC algorithm essentially consists of two steps: (i) within-model moves which perform updates of the form λ k → λ k for the current model k, and (ii) across-model moves which perform updates of the form (λ k , k) → (λ k , k ), whereby a move from model k to model k is proposed, along with a parameter vector λ k for that model.In our setting, both types of move are complicated by the fact that the normalizing constant from the Bingham likelihood would need to be evaluated when computing the acceptance probabilities for both moves.
We now show how across-model updates can be performed in our setting using an exchange version of RJMCMC proposed by Caimo and Friel (2013), in order to provide model evidence in favour of competing Bingham models (Algorithm 2).We denote model k by an indicator m k .Within the algorithm, a move from model k to k is proposed from a proposal distribution h(k |k), and given the proposed value k , a candidate value for the modelspecific parameter λ k is then drawn from a proposal density w k (λ k |φ k , m k ).The latter is a model-specific proposal density with parameters φ k from which λ k is drawn independently of the current value of λ k .Hence, this is a form of independence sampler, which obviates the need to tune parameters for the proposal densities for trans-dimensional moves in RJMCMC, which is often a very challenging task.
Values for the parameters φ k which parameterize the densities w k (λ k |φ k , m k ) can be obtained by fitting the Bingham distribution to the individual models k = 1, . . ., K using pilot MCMC runs on each model separately (Fallaize and Kypraios 2016), prior to running the main RJMCMC sampler.In our examples, we find that a multivariate normal distribution (MVN) can approximate sufficiently π(λ k |x), and hence φ k consists of a mean vector and covariance matrix for the parameters {λ q−1 } corresponding to model k, estimated from the pilot runs.(The conclusion that the MVN approximation for use as a proposal distribution is adequate was determined by monitoring performance of the MCMC, where good mixing and acceptance rates were observed, in addition to visually inspecting the fitted MVN contours with those from a kernel density estimate.)Proposals which do not respect the constraints on the parameters are automatically rejected.
Algorithm 3 Reversible Jump exchange algorithm 1. for k = 1, . . ., K do 2. Draw samples {λ k } from π(λ k |x, k) 3. Set μ k and k as sample mean and sample covariance of 5. end for 6. repeat 7. perform a within-model update of λ k using Algorithm 1 8. perform an across-model move using Algorithm 2 9. until desired posterior samples required Given a proposal λ k , auxilary data ỹ are drawn from f (•; λ k , m k ) (Step 3), the sampling mechanism of the data under the model k with parameter λ k .In our case, this will be a form of Bingham distribution, for which exact samples can be drawn using the algorithm of Kent, Ganeiber, and Mardia (2018).The overall acceptance probability is given by α (5) This is the Metropolis-Hastings ratio of the move on the augmented state space {ỹ, λ k , k}, and the choice of proposal for the auxiliary data (ỹ ∼ f (•; λ k , k )) ensures the ratio c(λ k )/c(λ k ) appears in both numerator and denominator and hence cancels.This a RJMCMC version of the exchange algorithm (Murray, Ghahramani, and MacKay 2006), as described in Caimo and Friel (2013).The auxiliary data ỹ are drawn from the same sampling mechanism as the observed data, with parameters set at the candidate state (λ k , k ).Intuitively, the roles of x and ỹ are reversed (or "exchanged")-the auxiliary data ỹ are offered to the current state (λ k , k) in exchange for the observed data x, in order to persuade a move to the candidate state (λ k , k ).
Therefore, this is a form of reversible jump MCMC algorithm which avoids the need to evaluate the awkward normalizing constants in the same spirit as the exchange algorithm of Murray, Ghahramani, and MacKay (2006), which Caimo and Friel (2013) call the Auto-RJ exchange algorithm.
The process is summarized in Algorithm 3. In our applications, there are K = 2 competing models, and we set h(k |k) = 1, k = k, so that a move to the other model is proposed with certainty.

Estimating Model Evidence
In this section we present a within-model approach for estimating the model evidence based on the work of Chib (1995).Suppose that data x = (x 1 , x 2 , . . ., x n ) are assumed to have been generated from a Bingham distribution with parameter λ.The approach follows from noticing that for any λ , (4) implies that Hence, to estimate the model evidence π(x) we require both an estimate of the (normalized) posterior density at λ , π(λ |x), as well as an estimate of the normalizing constant c(λ ).For the latter, we used the method of (Kent 1987), based on asymptotic expansions.Another possibility is the holonomic gradient method of (Sei and Kume 2015), as implemented in the R package hgm.We also used this method, but occasionally encountered some numerical instability in computing the normalizing constant for some of the datasets we simulated.Avoiding such potential problems is one argument in favour of methods which do not require computation of the constant.With regards to estimating the posterior density at λ , once we have samples from π(λ|x) using the exchange algorithm of Fallaize and Kypraios (2016), then we can produce a kernel density estimate of the posterior density at λ .Of course, in practice, because of the curse of dimensionality this approach can only be used in the case where the number of parameters is fairly small.However, this is not uncommon in the applications where the Bingham distribution is applied to, where observed data are typically points on S 2 .

Model Selection via Contour Probabilities
We now consider an alternative approach, which uses posterior contour probabilities (Held 2004) of a suitable parameter in order to provide a measure of the evidence against the simpler model.The probability can be interpreted in a similar fashion to a p-value, and henceforth any reference to p-values will mean the posterior contour probability described below.
Suppose that we are given two independent samples of unit vectors in S q−1 , x = (x 1 , x 2 , . . ., x n ) and ỹ = (y 1 , y 2 , . . ., y m ), respectively.Let us assume that x and ỹ are independent random samples from some Bingham distributions with probability density function f (x; λ x ) and f (y; λ y ), respectively, where q−1 ) and λ y = (λ (y) 1 , . . ., λ (y) q−1 ).Assessing whether or not the samples x and ỹ come from the same Bingham distribution is equivalent to testing the hypothesis that λ x = λ y versus λ x = λ y .We propose fitting a Bingham distribution to the combined dataset ω = (x, ỹ) = (x 1 , x 2 , . . ., x n , y 1 , y 2 , . . ., y m ).By independence, the joint density of ω is n+m i=1 f ( ωi ; λ (i) ).Assuming a particular structure for the {λ (i) } allows us to assess the evidence against λ x = λ y .Specifically, let z i = I( ωi is a y sample), i = 1, . . ., n + m, where I(•) is the indicator function.Then, we endow the {λ (i) } with the structure where α = (α 1 , . . ., α q−1 ) T ∈ R q−1 .In other words, the combined data constitute n + m independent observations from the Bingham distribution, where the x samples take parameter value λ x = λ 1 , λ 2 , . . ., λ q−1 and the ỹ samples take parameter value Therefore, we can fit the Bingham distribution to the combined dataset ω within a Bayesian framework, where each observation ωi has parameter λ (i) of the form (6), and then determine if the posterior distribution of the coefficients α i , i = 1, . . ., q − 1 is "significantly" different from the zero vector.At first glance, one might investigate the corresponding univariate credible intervals of some predefined level separately.However, it is rather unclear how this will translate to a simultaneous probability statement since the coefficients are not independent.One option is to calculate simultaneous credible bands on various levels and determine the smallest level such that the point of interest (i.e., the zero vector in our case) is contained in the simultaneous credible band.Held (2004) proposed instead to use (multivariate) highest posterior density (HPD) regions to check the support of the posterior distribution for a certain parameter vector of interest.Held (2004) defined the (posterior) contour probability Pr(α 0 | ω) as 1 minus the content of the HPD region of the posterior distribution π(α| ω) which just covers α 0 : If the functional form of the posterior density of α is available analytically, then we can obtain an unbiased estimate of the contour probability by Monte Carlo; in other words, by computing the proportion of posterior samples with density value smaller or equal than π(α 0 | ω).In our context, the functional form of the posterior density is unavailable, and therefore we use the modified Rao-Blackwell estimate of the simple Monte Carlo estimator proposed by Held (2004) and implemented in the bayesSurv R package.Using the algorithm that was proposed in Fallaize and Kypraios (2016), one can sample from the posterior distribution of the parameters (λ x ,α), and then determine the posterior simultaneous credible region of the coefficients α i , i = 1, . . ., q − 1.

Computational Intensity
In applications of practical interest, observations are unit vectors in R 3 , which is the situation for our numerical results in Sections 5 and 6.The most intensive case we have considered in our simulations is the reversible jump exchange algorithm with a total of 1000 observations (which is larger than typical axial datasets).Running the sampler for 100,000 sweeps took 4 min on 3.4-GHz cpu using a single core.The other two approaches require only runs of the within-model sampler of Fallaize and Kypraios (2016) which is less intensive.The additional computation of the HPD region for the contour method is minimal, and ran in a matter of seconds using the simult.pvaluefunction from the bayesSurv R package.
For higher dimensions, that is, data in R q for larger q than 3, there will of course be increased computational cost.However, our focus in this article is on investigating the use of the computational methods considered, in the cases of most practical importance for real applications of axial data.

Simulation Study
The aims of this section are to (i) investigate the performance of the proposed methods and the extent to which they agree and (ii) assess the extent to which it is possible to discriminate between models as parameters and sample sizes vary.We address these aims by performing simulation studies in which datasets of varying size using different parameter combinations are generated, to which we apply our model selection methods.
In this section, and also the real data application in Section 6, the observations are unit vectors in R 3 , so the Bingham distribution has 2 (eigenvalue) parameters λ 1 ≥ λ 2 ≥ 0 to be inferred.For a prior distribution, we use a product of exponential distributions subject to the above constraint, that is, where μ i are rate parameters.In all of our numerical results we have used μ i = 0.01.

Comparison
In order to test the performance of the different methods we have proposed, we performed simulations for two different parameter cases.In the first case, we set λ x = λ y = (20, 10) and in the second case we set λ x = (20, 10) and λ y = (20, 6).In both cases we simulated data for two different sample sizes, n = m = 20 and n = m = 200.For the four different combinations of parameter and sample size, we simulated 200 sets of data and performed inference using each of the three approaches we have introduced.The results are presented in Figure 2, where m 1 denotes the event that both populations have the same Bingham distribution.The first row shows the results for λ x = λ y = (20, 10) when n = 20 (left) and n = 200 (right).For both sample sizes, there is very good agreement between the RJMCMC and Chib's method.As expected, for the larger sample size, for most of the datasets the (true) simpler model is favored, whereas there is more variability in p(m 1 ) for the smaller sample size of 20.For the contour probabilities, with a sample size of 20, the smallest p-value was 0.046, and all other p-values were greater than 0.1 (75% were greater than 0.5).As expected, when the sample size was increased to 200, there were some smaller pvalues, but all but 9 were greater than 0.05, as we would expect.In summary, the methods agree well and correctly favor the simpler model.
The second row shows the results when λ x = (20, 10) and λ y = (20, 6).Again, we see good agreement between the two approaches for calculating p(m 1 ).In the case when n = 20 (left), the majority of the points are concentrated in the upper right of the plot, with p(m 1 ) > 0.8, despite the samples coming from different Bingham distributions.In fact, in the next section we will investigate further the ability to discriminate between models for various parameter settings, and see that it is quite hard to distinguish between different Bingham distributions for the parameters used here when sample sizes are small.When n = 200 (right), there is a dense cluster of points in the bottom left of the plot (p(m 1 ) < 0.2), but still a fairly uniform spread of values across the remaining values, indicating that it is still reasonably challenging to discriminate between the two different Binghams when the sample size is 200 per group.We also note that, for the points corresponding to larger estimates of p(m 1 ), the p-values from the contour method are larger too, for example, the points in the upper right of the plot have pvalues greater than 0.01.So the three approaches are again in good agreement.

Model Discrimination
Now that we have a reliable method for determining model evidence, we attempt to answer the following question: how well can we discriminate between models in different parameter and sample size settings?Here, we present some further simulation experiments to investigate this.In particular, we wish to know how well we can discriminate between models when the parameters in each population are close/far apart, for different sample sizes.Specifically, we set λ x = (20, 10) and consider λ y of the form (20, λ (y) 2 ), with λ (y) 2 ∈ {2, 4, 6, 8, 10}.Hence, the cases range from equal parameters through to quite different parameters, in terms of the second eigenvalue.In each case, we also consider varying the sample size, with n = m and n ∈ {20, 100, 200, 500}.For each scenario, we simulated 200 datasets, and ran the reversible jump method.Let p 1 be the posterior probability of the simpler model, that is, the model where both populations have the same Bingham distribution.We estimate P(p 1 > 0.5) via the proportion of the 200 repetitions for which our estimate of p 1 was > 0.5; the results are plotted in Figure 3.We see that, as expected, it becomes easier to discriminate between models as n increases and as the parameters are either very similar or very different.More specifically, it seems that once we have n = 100, there is a high chance of choosing the correct model when λ (y) 2 = 2, 4, 10.When n = 200 or 500, most cases are clear, apart from when λ (y) 2 = 8 in which case there is still a good chance that the simpler model will be selected, even though the parameters are different.For the case n = 200, λ (y) 2 = 6, there is also a probability of around 0.4 that the simpler model will be preferred, which agrees with our observations in Section 5.1.

Earthquake Data Results
We now turn our attention to the motivating application described in Section 2. Recall that the data consist of three orthogonal axes arising from earthquakes in three situations: 50 observations near Christchurch which took place before a large earthquake on the 22nd of February 2011; 50 observations near Christchurch which took place after this large earthquake; and 32 observations obtained from earthquakes in the north of New Zealand's South Island.We will label the three samples as CCA  (Christchurch before the large earthquake), CCB (Christchurch after the large earthquake) and SI (South Island).We will model the B axes from each earthquake, and assess the evidence that the samples come from the same Bingham distribution.Fallaize and Kypraios (2016) analyzed these data in a Bayesian framework by fitting Bingham models to the B axes from each of the individual clusters and considering the posterior distributions of the parameters of the diagonal component of the Bingham parameter matrix.Let us denote these parameters from the CCA, CCB, and SI models as λ A i , λ B i , and λ S i , respectively, i = 1, 2. The authors investigated informally if there is any evidence of a difference between the two Christchurch clusters by considering the bivariate quantity 2 ) and approximating its posterior distribution with a bivariate Normal distribution to visually assess if the origin (0, 0) was contained comfortably within a 95% probability region.Here, we perform formal model selection to assess the evidence that the samples of B axes come from the same, or different, Bingham distributions.We will perform two pairwise comparisons: CCA-CCB and CCA-SI.
Let the data for the two samples under consideration be stored in matrices x and ỹ, where the columns of x are the observed axes x i ∈ S 2 , i = 1, . . ., n and similarly for ỹ, which contains m observed axes.The combined data are contained in the 3×(n+m) matrix ω.In practice, we take our data for x to be the eigenvalues τ x = {τ x j , j = 1, . . ., 3|τ x 1 < τ x 2 < τ x 3 , τ x j = 1} of the sample moment of inertia matrix xx T /n, and similarly for ỹ.This is equivalent to first rotating to principal axes and taking V to be the identity matrix.(Recall that A = V V T is the Singular Value Decomposition of the Bingham parameter matrix A.) For the combined data (i.e., when fitting one Bingham model to the combined samples) we can then take the data to be the weighted average of τ x and τ y .
The results are summarized in Table 1.For the CCA-CCB comparison, the model evidence strongly favors the simpler model, whereas for the CCA-SI comparison, the model evidence supports the case that the data come from two different Bingham populations.In terms of the contour probabilities, the same conclusions are reached.This suggests that there is no difference in the pattern of Christchurch earthquakes before and after the large event, but that Christchurch and South Island earthquakes follow different patterns.Regarding a test of equality of two populations based on full axial frame data, (Arnold and Jupp 2013) obtained a p-value of 0.890 for the CCA-CCB comparison and a p-value of less than 0.001 for the CCA-SI comparison, so our results agree with these conclusions.
The plots from the contour method are shown in Figure 4.For the CCA-CCB comparison, the point (0, 0) is comfortably contained within the main point cloud, and vice-versa for the CCA-SI comparison, agreeing with the conclusions from the numerical results in Table 1.

Discussion
In this article, we have introduced three strategies for performing model choice for the Bingham distribution in the Bayesian setting, which is not a straightforward task due to the intractable nature of the normalizing constant.We have illustrated three approaches, two of which avoid the need to approximate this constant, and one which uses an approximation.Application to simulated data show that the methods can successfully discriminate between models, but if the parameters are not very different, relatively large sample sizes are needed to detect a true difference.Analysis of a real dataset shows how formal model choice can be performed in a situation of practical interest to geophysicists.
Although we have focused on the Bingham distribution, the methods could be useful for other distributions in use in directional statistics which have intractable normalizing constants.Also, although the RJMCMC and Chib's methods are applicable to any number of models in theory, we have focused on choosing between two models in our applications.It will be interesting to investigate performance for other distributions and as the number of competing models increases.

R code
The R code to run our methods is available in the online supplementary materials.This includes a "readme" file explaining how to run

Figure 1 .
Figure1.The B axes for the earthquake data.Top left: Christchurch before the large earthquake.Top right: Christchurch after the large earthquake.Bottom: South Island.The data have been rotated to principal axes, such that the principal axis is the North-South axis.Relatively, the two Christchurch clusters are more concentrated about the vertical axis, and the South Island cluster more dispersed.All points have been plotted in the same hemisphere to enable easier comparison of the variability in each sample.In some cases, the antipodal point was the observation-as these are axial data, both are treated as the same observation and it does not matter which is observed.

Figure 3 .
Figure3.Results from the simulation experiments described in Section 5.2.For each sample size/parameter combination, we simulated 200 datasets and computed the proportion of times that the simpler model was favored, as determined by the rule p 1 > 0.5.

Figure 4 .
Figure 4. Samples of (α 1 , α 2 ) from the contour method for the CCA-CCB data (left) and CCA-SI data (right).The cross marks the point (0, 0) and the triangle denotes the maximum likelihood estimate.

Table 1 .
Results from earthquake data.