Bayesian Classification of Multiclass Functional Data

We propose a Bayesian approach to estimating parameters in multiclass functional models. Unordered multinomial probit, ordered multinomial probit and multinomial logistic models are considered. We use finite random series priors based on a suitable basis such as B-splines in these three multinomial models, and classify the functional data using the Bayes rule. We average over models based on the marginal likelihood estimated from Markov Chain Monte Carlo (MCMC) output. Posterior contraction rates for the three multinomial models are computed. We also consider Bayesian linear and quadratic discriminant analyses on the multivariate data obtained by applying a functional principal component technique on the original functional data. A simulation study is conducted to compare these methods on different types of data. We also apply these methods to a phoneme dataset.


Introduction
Functional data analysis (FDA) deals with the analysis of data occurring in the form of functions.Wang et al. (2016) gave an overview of FDA including functional principal component analysis, functional linear regression, clustering and classification of functional data.FDA is increasingly drawing attention in many areas, such as biomedicine, environmental studies, and economics (Ullah and Finch, 2013).Mallor, Moler and Urmeneta (2017) proposed a model based on functional principal component analysis to predict household electricity consumption.Wagner-Muns et al. (2018) proposed a method that uses functional principal components analysis to forecast traffic volume.Classification of functional data, especially when the data units can come from more than two categories, is a fundamental problem of interest.Generalized linear models are often used to classify the functional data (Müller and Stadtmüller, 2005;James, 2002).The linear discriminant analysis is also used for functional data classification (James and Hastie, 2001).Preda, Saporta and Lévéder (2007) proposed the partial least squares regression on functional data for linear discriminant analysis.Rossi and Villa (2006) adapted support vector machines to functional data classification.Li and Yu (2008) proposed a functional segment discriminant analysis (FSDA), which combines the classical linear discriminant analysis and support vector machines.Wavelets approaches are also applied to classify and cluster functional data (Ray and Mallick, 2006;Antoniadis et al., 2013;Chang, Chen and Ogden, 2014;Suarez and Ghosal, 2016).There are also nonparametric approaches for functional data classification (Biau, Bunea and Wegkamp, 2005;Ferraty and Vieu, 2003).However, there are only a few approaches proposed in the context of Bayesian classification for functional data.Wang, Ray and Mallick (2007) developed a Bayesian hierarchical model which combines the adaptive wavelet-based function estimation and the logistic classification.Zhu, Vannucci and Cox (2010) proposed a Bayesian hierarchical model that takes into account random batch effects and selects effective functions among multiple functional predictors.Stingo, Vannucci and Downey (2012) proposed a Bayesian conjugate normal discriminant model on the wavelet transform of the functional data.Zhu, Brown and Morris (2012) introduced two Bayesian approaches: the Gaussian, wavelet-based functional mixed model and the robust, wavelet-based functional mixed model.
In this paper, we consider a response Y taking values k = 1, . . ., K, with functional covariate {X(t), t ∈ [0, 1]}.The main problem is to estimate the probability P(Y = k|X), which can be conveniently modeled by a function of β(t)X(t)dt where H k is a cumulative distribution function, and β(•) is an unknown (possibly vector of) coefficient function(s).Unordered multinomial probit, ordered multinomial probit and multinomial logistic models are considered in this paper which correspond to different choices of H k , k = 1, . . ., K. For an ordered multinomial probit model, there are additional order restrictions.Finite random series priors (Shen and Ghosal, 2015) are applied to the three multinomial models.We compare these methods with Bayesian linear and quadratic discriminant analyses applied on the data reduced to multivariate form by a functional principal component technique.Following a Bayesian approach, the posterior distribution of the parameters are obtained using the training data, and then the classification rules are applied to the test data using the posterior probability of class membership.The primary goal of a basis expansion method is to reduce a more complex problem to a simpler problem which has either a known solution or is likely to be easier to solve.A prior on function through finite random series is a standard tool in nonparametric Bayesian inference, but in the context of functional data, the technique has not been utilized to its fullest potential, especially regarding the study of theoretical property of Bayesian methods.Only one paper (Shen and Ghosal, 2015) has one example of functional linear regression treated using finite random series priors.We take that idea but develop it in the context of functional data classification.Characterizing contraction rate is a major goal of this paper.For this, we need to estimate the complexity of the model and the prior concentration.Even though, the model reduces to the finite dimensional setting from the computational point of view, the effect of the residual bias in the approximation of function must be properly addressed.Hence the treatment substantially different from that of a parametric problem.In particular, the dimension of the basis must be adapted with the smoothness and the sample size by using a prior on it.
The paper is organized as follow.In Section 2, the three functional multinomial models are described.Section 3 gives the description of applying the finite random series prior to these models.The marginal likelihood estimation is described in Section 4. In Section 5, the posterior contraction rates of the three functional multinomial models with finite random series prior are computed.Section 6 describes the Bayesian discriminant analysis of functional data, which is used to compare with the proposed models.In Section 7, a simulation study is conducted on various types of data.In Section 8, the three multinomial models and Bayesian discriminant analysis are tested on phoneme dataset.

Ordered Multinomial Probit Model
Let X i (t), i = 1, . . ., n, t ∈ [0, 1], be the orbserved functional data associated with a categorical variable Y i taking possible values 1, . . ., K. We assume that (X i , Y i ), i = 1, . . ., n, are independent and identically distributed (i.i.d) observations.Following Albert and Chib (1993), we consider the model described implicitly as follows: there exists a latent variable W i distributed as N( β(t)X i (t)dt, 1), for i = 1, . . ., n, and that where k = 1, . . ., K. The latent variables W i , i = 1, . . ., n, are independent.The coefficient function β(•) is unkown.The cut-points γ k are also unknown except that γ 0 = −∞ and γ K = ∞.To ensure identifiability, we set γ 1 = 0.Under the assumed model, the probability of choosing a category k is given by where Φ stands for the distribution function of the standard normal distribution.

Unordered Multinomial Probit Model
Let X i (t), i = 1, . . ., n, be the same as in the Section 2.1, and also same for Section 2.3.
The probability of choosing the kth (k = 1, . . ., K − 1) alternative is given by and the probability of choosing alternative K is given by P(Y i = K|X i ) = P(W il < 0 for all l = 1, . . ., K − 1). (2.5)

Multinomial Logistic Model
In this model, the probability of choosing category k is given by To ensure model identification, set β K (t) = 0. Then the probability of choosing categoty k (k = 1, . . ., K − 1) is given by and the probability of choosing category K is given by (2.8)

Finite Random Series Prior
The functional coefficient β(t) (or β 1 (t), . . ., β K (t) for unordered multinomial probit and multinomial logistic model ) is given a prior which is a finite linear combination of a certain chosen basis functions: β(t) = J j=1 θ j ψ j (t), where {ψ 1 (t), . . ., ψ J (t)} is a basis, for example, formed by B-splines, Fourier functions, or wavelets.A prior is put on the unknown coefficients (θ 1 , . . ., θ J ).The number of basis function J is also unknown and should be given a hyperprior.Instead of sampling across the different dimensions using reversible jump MCMC (Green, 1995) which has computational difficulty for complicated models, we can implement MCMC for a given J value, and repeat it for relevant J values.Thus, we can compute the marginal likelihood m(Y |J) for potentially interesting values of J, and obtain the posterior probability of J, which are discussed in Section 4.
The advantage of a using finite random series prior is that the inner product between the functional coefficient and the functional data β(t)X i (t)dt is reduced to a simple linear combination where Z ij = ψ j (t)X i (t)dt is known, and can be computed by Simpson's rule.

Ordered Multinomial Probit Model
Using a finite random series β(t) = J j=1 θ j ψ j (t), the model in (2.1) can be rewritten as where Define θ = (θ 1 , . . ., θ J ) T , and Z i = (Z i1 , . . ., Z iJ ) T .Then (3.2) can be written compactly as Clearly the unobserved latent variable W i follows N J (Z T i θ, 1), where N J stands for the J-variate normal distribution.Assign a conjugate prior θ ∼ N J (θ 0 , B 0 ), where θ 0 is J × 1 mean vector, and B 0 is a J × J covariance matrix.Then the posterior distribution of θ is given by where Z = (Z T 1 , . . ., Z T n ) T , and W = (W 1 , . . ., W n ) T .We follow the scheme introduced by Albert and Chib (1993).The posterior distribution of W i is given by where TN(Z T i θ, 1, γ k−1 , γ k ) is the truncation of the (univariate) normal distribution with mean Z T i θ, and variance 1 to the interval (γ k−1 , γ k ).Albert and Chib (1993) assigned a diffuse prior on the cut-points.However, model averaging needs a proper prior.A normal prior is not appropriate due to the order restriction on γ 1 , . . ., γ K .Albert and Chib (1997) proposed a transformation of γ = (γ 1 , . . ., γ K ) which avoids the order restriction. (3.6) Note that γ 1 = 0 and by the inverse map Then γ can be reparameterized by α = (α 1 , . . ., α K−2 ).Assign a multivariate normal prior with mean α 0 , and covariance A 0 on α.To sample γ, apply the following steps of Metropolis-Hastings algorithm.
1. Sample α from a proposal distribution q(α , α|Y, θ, W ).Here we allow the proposal density to depend on the data and the two remaining blocks for the convenience of computing the marginal likelihood in the future.2. Move to α from the current α with probability 3. Compute γ by the inverse map (3.7).
To implement the MCMC sampling, first draw γ by the above steps.Then sample from the posterior distributions (3.5) and (3.4).
The values of γ sampled from the Metropolis-Hastings algorithm converges quickly.We demonstrate it on the real data in Section 8 by plotting the sampling values of γ.

Unordered Multinomial Probit Model
Let β l (t) = J j=1 θ lj ψ j (t), where l = 1, . . ., K. Then (2.3) can be rewritten as where Let θ lj = θ lj − θ Kj , where j = 1, . . ., J. Define θ l = (θ l1 , . . ., θ lJ ) T , and Z i = (Z i1 , . . ., Z iJ ) T .Then (3.9) is given by In the model described in Section 2, Σ is known with 2 on diagonal entries and 1 on all off-diagonal entries.The only parameter needs to be estimated is Θ.In order to draw the matrix Θ using the Gibbs sampling, we can stack the data in a matrix form which is given by where M is an n × p mean matrix, U is an n × n row variance matrix, V is a p × p column variance matrix, tr stands for the trace of a matrix, and |U | and |V | denote the determinants of U and V respectively.
Here the row variance-covariance matrix I n is an identity matrix of rank n, since W 1 , . . ., W n are independent.We consider the matrix normal prior Θ ∼ MN J×(K−1) (U 0 , V 0 , Σ).By a standard conjugacy calculation, the posterior is given by To draw a sample of W , we use the method introduced by McCulloch and Rossi (1994) • denote the ith row of Z, the vector Θ •,l denote the lth column of Θ, the matrix Θ •,−l denote Θ without the lth column, the scalar Σ l,l denote the (l, l)th entry of Σ, Σ −l,−l denote Σ without the lth row and the lth column, Σ −l,l denote the lth column of Σ without the lth entry, and Σ l,−l denote the lth row of Σ without the lth entry.We draw W il from the conditional truncation of the normal distribution with the mean m il and variance τ 2 il to the interval (a, b) described below: To implement the Gibbs sampling, we draw samples from (3.13) and (3.14).
where θ −k denotes all the blocks except the kth one.

Marginal Likelihood and Model Averaging
In Section 3, we described the MCMC sampling technique for a given J value, which we need to repeat it for all possible J values.In the actual computation, however, it is impossible to consider all values of J.With a given prior on J, for example, geometric or Poisson distribution, the posterior probabilities for very small or very large values of J decay to zero very quickly.Thus, we do not need to consider these J values.Let J 1 , . . ., J S denote the values of J we need to consider.If we can get the marginal likelihood m(Y |J s ), then we can compute the posterior probability of J s using Bayes's rule where p(J s ), s = 1, . . ., S, is the prior probability for J = J s .
For each given J s , we have a misclassification rate r s , which is defined as the ratio of the number of falsely classified data to the total number of data.Then we can obtain the average misclassification rate r for each multinomial model: We call it the model averaging method.
The marginal likelihood can be written as the normalizing constant of the posterior density where B is a convenient value of the parameter in the context of the support of the posterior distribution such as the posterior mean, because (4.3) holds for any B. The numerator is the product of the likelihood and the prior.The denominator is the posterior density of B. For a given B * , the posterior density π(B * |Y, J m ) can be estimated from the Gibbs output (Chib, 1995) and the Metropolis-Hasting output (Chib and Jeliazkov, 2001).Then the estimated marginal likelihood in the logarithm scale is The following sections give the details for π(B * |Y, J s ) estimation.

Ordered Multinomial Probit Model
There are two parameter blocks in this model, θ and α, where α is the transformation of γ as in (3.6).Given θ * = G −1 G g=1 θ (g) , and g) , where {θ (g) , α (g) } G g=1 are from the MCMC output, the joint posterior density can be written as where where The draws of W from the Gibbs sampler are from the distribution [W |Y, J s ], so π(θ * |Y, J s , α * , W ) cannot be averaged directly by the Gibbs sampling output.Addtional sampling for W is needed.We sample {θ (m) } from the density π(θ|Y, J s , α * , W ), and given that θ (m) , we sample The explicit distribution of α * given (Y, J s ) is unknown, and hence the draws of α are obtained from a Metropolis-Hastings sampling.By the local reversibility condition (see Chib and Jeliazkov (2001) for details), the posterior density of α can be written as where ρ(α, α * |Y, J s , θ, W ) is defined in (3.8), q(α, α * |Y, J s , θ, W ) is the proposal density, the expectation E 1 is with respect to the distribution π(θ, α, W |Y, J s ), and E 2 is with respect to the distribution π(θ, W |Y, J s , α * ) × q(α * , α|Y, J s , θ, W ).

Unordered Multinomial Probit Model
The only unknown parameter is Θ. g) , where {Θ (g) } are from the Gibbs sampling output, the posterior density of Θ at Θ * can be written as where {W (g) } G g=1 are from the Gibbs sampling output.For the unordered multinomial probit model, we also need to estimate the likelihood at some convenient values in the support of the posterior distribution.From Section 3.2, Θ = (θ 1 , . . ., θ K−1 ), where θ l = θ l −θ K , l = 1, . . ., K−1.Then (2.5) can be rewritten as where Θ •,l denotes the lth column of Θ.
Due to the exchangeable correlation structure of Σ, (4.13) can be reduced to a one dimensional integral (Dunnett, 1989) given by The expression in (4.12) can also be reduced to the same form as in (4.14).Then (4.14) can be approximated by Gaussian quadrature as follow where w q and x q are the weights and roots of the Laguerre polynomial of order Q.
Thus, the likelihood of this unordered multinomial probit model can be approxiamted using (4.15).

Multinomial Logistic Model
There are K − 1 unknown parameters: g=1 are from the Metropolis-Hastings sampling output, the joint posterior density can be written as By the local reversibility, each full conditional density can be written as where ) is the proposal density, E 1 is the expectaion with respect to the distribution π(θ i , Ψ i+1 |Y, J s , Ψ * i−1 ), and E 2 is that with respect to π(Ψ i+1 |Y, J s , where {θ

Posterior Contraction Rate
For classification problem, the most important object to study is the misclassification rate.By examining convergence to the true distribution, it follows that the Bayes procedure has misclassification rate close to that of the oracle procedure which uses the true values of the regression functions and other parameters (if any), e.g., cut-points in the ordered multinomial probit model.In the Bayesian nonparametric setting, Hellinger convergence is established by applying the general theory (Ghosal and van der Vaart, 2017).Thus, in this section, we only consider the contraction rate of the posterior distribution with respect to a metric on the probability of categories, which is equivalent with Hellinger distance on the joint distribution.The posterior contraction rates of the three multinomial models with finite random series prior can be obtained using calculation similar to those in Shen and Ghosal (2015) on posterior contraction rates for finite random series.
We use to denote an inequality up to a constant multiple, f g for f g f .For a vector θ Similarly, for a function f with respect to a measure G, we define T, d) denote the -covering number of a set T for a metric d.Let h 2 (p, q) = ( √ p − √ q) 2 dµ be the squared Hellinger distance, K(p, q) = p log(p/q)dµ, V (p, q) = p log 2 (p/q)dµ be the Kullback-Leibler (KL) divergences.
Theorem 1. Assume that π 0 is bounded away from zero.Let n ≥ ¯ n be two sequences of positive numbers satisfying n → 0 and n¯ 2 n → ∞.Let X 0 be such that P(X ∈ X 0 ) = 1 and π k (x), k = 1, . . ., K for x ∈ X 0 is bounded away from 0. Let W n be a subset of the parameter space such that the following conditions hold for some positive constants a 2 and a 1 > a 2 + 2: (5.7) The proof follows from Theorem 4 of Ghosal and van der Vaart (2007a), by observing that (5.8) and by expanding in Taylor's expansion (5.9) Let Π be a generic notation for priors on the number J of basis functions.As in Shen and Ghosal (2015), the priors on J and the coefficients of the basis functions θ = (θ 1 , . . ., θ J ) T need to satisfy the conditons (A1) and (A2).For the ordered multinomial probit model, we add condition (A3).
where c 3 is some positive constant, H is chosen sufficiently large, and > 0 is sufficiently small.Also, assume that Π(θ Geometric distribution with t 1 = t 2 = 0, and Poisson distribution with t 1 = t 2 = 1 on J satisfy (A1).The multivariate normal distribution on θ and γ satisfy (A2) and ( A3) respectively.
Remark 1. Parameter estimation plays a secondary role here.The problem of estimating model parameters is interesting in its own right but is not necessary for good classifications.Cai and Hall (2006) and Yuan and Cai (2010) showed that the parameter function estimation and the prediction from an estimator of the parameter function have different characteristics.
Further, by Jensen's inequality, we have (5.21) If k = 1, by the mean value theorem and the uniform positivity of Φ on compact interval, then (5.22) (5.23) is also small.By the mean value theorem and the uniform positivity of Φ on compact interval, we have (5.25) Hence |γ 2 −γ 02 | is small.Similarly, we can prove that |γ k −γ 0k | is small for any k.
Following the same arguments as (5.18)-(5.20), the posterior contraction rate is
Theorem 4. Assume that X 1 = |X(t)| dt is a bounded random variable, the priors satisfy the conditons (A1) and (A2), and that the basis ψ(t) satisfies (5.10) and (5.11) with r = ∞.Then the posterior contraction rate of the multinomial logistic model is Proof.The proof is similar to that of Theorem 3.

Discriminant Analysis
As a comparison to those multinomial models, we use Bayesian discriminant analysis to classify the functional data.Instead of modeling the class probability directly, the discriminant analysis uses Bayes's rule to compute the marginal likelihood of Y i (Gelman et al., 2013).The classical discriminant analysis applies only to multivariate data.For functional data, we can use certain orthogonal linear functions to determine the classification probabilities: Ideally these β 1 (t), . . ., β m (t) are unknown, but putting a prior on these functions with identifiability restrictions is complicated.We instead consider β 1 (t), . . ., β m (t) to be known as the first m principal components (Ramsay and Silverman, 2005), but let the means and the covariance matrices be unknown.Then discriminant analysis can be applied to the m principal components.

Linear Discriminant Analysis
Linear discriminant analysis assumes that for each of the K category, the set of linear function (f 1 , . . ., f m ) follows a normal distribution with the same covarince matrix: (f il1 , . . ., f ilm ) T ∼ N(µ l , Σ), where µ l is the population mean of category l, l = 1, . . ., K, i = 1, . . ., n l , and n l is the number of data in category l.Then the probability of choosing category k is given by where φ(f 1 , . . ., f m ; µ, Σ) is the multivariate normal density function with mean µ and covariace Σ, and p l , l = 1, . . ., K, are the probability of choosing category l.
The variables f il1 , . . ., f ilm are the m principal components of X i (t) in categoty l, where l = 1, . . ., K. Define f il = (f il1 , . . ., f ilm ) T , where i = 1, . . ., n l , and K l=1 n l = n.To estimate the mean µ l for each category l, and the common covariance Σ among all categories, we use the conjugate normal-inverse-Wishart prior with hyperparameters (Gelman et al., 2013) for (µ l , Σ) Then the posterior distribution of (µ l , Σ) can be obtained in the following order where (6.5) and (6.6)

Quadratic Discriminant Analysis
Quadratic discriminant analysis is defined in a similar way, except that it has a different covariance matrix for each category.The probability of choosing category k is given by To estimate the mean µ l and the covariance Σ l for each category l, where l = 1, . . ., K, we use the conjugate normal-inverse-Wishart prior with hyperparameters for for l = 1, . . ., K. Then the posterior distribution of (µ l , Σ l ) can be obtained in the following order where The simulated data are generated following different data generating process.All of the simulated data have three categories.In all cases considered below, we generate the functional data from a Gaussian process at discrete time points 0, 0.01, . . ., 0.99, 1, with the mean function sin t and variation kernel 100 exp{−100(t i − t j ) 2 }, t i and t j were the discrete time point 0, 0.01, . . ., 0.99, 1.
For the ordered multinomial probit data, the coefficient function β(t) is plotted in Figure 1 (a), and the four threshold points are chosen to be −∞, 0, 8, ∞.The four cut-off points construct three intervals.If the inner product of a functional data and the coefficient function plus a standard normal variable falls in the kth interval (γ k−1 , γ k ), then the functional data attributes to the category k.
For unordered multinomial probit data, the coefficient functions β 1 (t), β 2 (t), β 3 (t) are plotted in Figure 1 (b)-(d).The inner product of a functional data and the three coefficient functions are added with standard normal variables, respectively.We sample from these three normal variables, and obtaine the corresponding probabilities.Then the functional data belonges to the category with the largest sampled value.
For the multinomial logistic data, the coefficient functions β 1 (t), β 2 (t) are plotted in Figure 1 (e)-(f), and the third coefficient function β 3 (t) can be assumed to be zero everywhere.We compute the probability of a functional data falling into each category.Then the data attributes to the category with the largest probability.
For data satisfying the assumption of the linear discriminant analysis, we generate them from three Gaussian processes with different mean functions sin t + 2 cos t, sin t, and sin t − 3 cos t, but the same variation kernel exp{−30(t i − t j ) 2 }.
For data satisfying the assumption of the quadratic discriminant analysis, we generate them from three Gaussian processes with different mean functions and different variation kernels.The mean functions are sin t + 2 cos t, sin t, and sin t − 3 cos t, and the variation kernels are exp{−2 sin 2 (π(t i − t j ))}, exp{−30(t i − t j ) 2 }, and exp{−|t i − t j |}, respectively.
In this simulation study, we generate total 900 (300 for each category) functional data for each type of dataset.We constructe the training data with 720 (240 for each category) of them and the testing data with the remaining 180 (60 for each category) of them.

Basis Functions
For models using the finite random series prior, we consider the B-spline basis.The B-spline basis functions on interval [0, 1] can be created using the R package fda.In this simulation study, we put a geometric prior with p = 0.5 on J.We only consider the possible number of B-spline basis functions to be J = 5, . . ., 15, since the probability outside this range is too small.Those B-spline basis functions are generated at the same discrete time points as the functional data, that is 0, 0.01, . . ., 0.99, 1.

Results
Under the chosen models, we apply Baysian estimation methods described in Section 3 on the training data.In this study, 5000 MCMC iterations are obtained, and the first 1000 of them are discarded as burn-in.We use the last 4000 MCMC output of the parameter B to classify the 180 transformed testing data, where B = (θ, γ 2 , γ 3 ) for the ordered multinomial probit model, B = Θ for the unorederd multinomial probit model, B = (θ 1 , θ 2 ) for the logistic model, B = (µ 1 , µ 2 , µ 3 , Σ) for the linear discriminant analysis model, and , where l = k.Then we use the techniques described in Section 4 to average the results from the multinomial models.As a comparison with the Bayesian method, the linear support vector machine (SVM) is also applied to the principal components of these training data, and made predictions on the testing data.To apply SVM, we use the R package e1071.
Table 1 shows the averaged misclassification rates for each data type under different models.

Application
We also test our models on phoneme data.This dataset can be found in the R package fds, and can also be found at https://www.math.univ-toulouse.fr/staph/npfda/.The original data has 2000 (X, Y ) pairs, and five categories.For computational efficiency, we only use 900 of them from three categories.We split the data into training and testing set by randomly sampling from each class, and keeping the same percentage of samples of each class as the complete set.The size of the testing data is 20% of the total data size.That is we have 240 data for each class in the training set, and 60 data for each class in the testing set.We put a geometric prior with p = 0.5 on J, and it is enough for us to consider the number of B-spline basis functions to be J = 5, . . ., 15.We obtain 5000 MCMC iterations and discard the first 1000 of them as burn-in.According to Table 2, the unordered multinomial probit model is the best model for the phoneme data.For this data, the categories are not naturally ordered, and hence ordered multinomial probit model is not natural for this problem, but we include it in the analysis for comparison.Figure 2 displays the cut-point γ 2 sampled by Metropolis-Hastings under different J, and we can tell that γ 2 converges around 500 iterations.Tables 3, 4, and 5 show the estimate and standard error of the posterior mean of the phoneme data under ordered multinomial probit model, multinomial logistic model, and unordered multinomial probit model, when J = 6, J = 10, and J = 14, respectively.We choose these J values because under these values the model has the largest posterior probability P(J|Y ).Although ordered multinomial probit model is not intuitive in this context, its performance is not too inferior.(13.10, 18.25, 6.04, −15.29, 15.52, 1.30, −5.81, 4.65, −28.24, −16.91)Standard error (0.94, 1.08, 0.64, 0.63, 0.75, 0.66, 0.37, 0.48, 0.70, 1.03) θ 3 (39.42, 34.30, −3.47, 5.26, −7.36, 0.38, −17.99, −4.43, −11.23, 2.44) Standard error (1.21, 1.44, 0.42, 0.31, 1.08, 0.27, 0.53, 0.33, 0.93, 0.32)

Table 1 :
Averaged misclassification rates for simulated data

Table 2 :
Averaged misclassification rates for phoneme data

Table 3 :
Estimate and standard error of the posterior mean for the ordered multinomial model (J = 6)

Table 4 :
Estimate and standard error of the posterior mean for the multinomial logistic model (J = 10)

Table 5 :
Estimate and standard error of the posterior mean for the unordered multinomial model (J = 14)