Bayesian Estimation of MIRT Models with General and Speciﬁc Latent Traits in MATLAB

Multidimensional item response models have been developed to incorporate a general trait and several speciﬁc trait dimensions. Depending on the structure of these latent traits, diﬀerent models can be considered. This paper provides the requisite information and description of software that implement the Gibbs sampling procedures for three such models with a normal ogive form. The software developed is written in the MATLAB package IRTm2noHA . The package is ﬂexible enough to allow a user the choice to simulate binary response data with a latent structure involving general and speciﬁc traits, specify prior distributions for model parameters, check convergence of the MCMC chain, and obtain Bayesian ﬁt statistics. Illustrative examples are provided to demonstrate and validate the use of the software package.


Introduction
Item response theory (IRT) is a popular approach for solving various measurement problems, and has been found useful in a wide variety of applications (e.g., Bezruckzo 2005; Boomsma, van Duijn, and Snijders 2001;Chang and Reeve 2005;Embretson and Reise 2000;Feske, Kirisci, Tarter, and Plkonis 2007;Imbens 2000;Kolen and Brennan 1995;Lord 1980).For binary response data, IRT provides a collection of models that describe how persons and items interact to yield probabilistic 0/1 responses.The influence of each person and item on the response is modeled by distinct sets of parameters, namely, the person latent trait(s) θ i and the item's characteristics ξ j so that P (y = 1) = f (θ i , ξ j ), where f can assume a probit or a logit function, giving rise to what is called in the IRT literature a normal ogive model or a logistic model.Unidimensional IRT models assume one θ i parameter for each person, signifying that each test item measures some facet of the unified latent trait.In some situations, it suffices to assume one latent dimension and use the unidimensional model.However, in many other situations where multiple traits are required in producing the manifest responses to an item, multidimensional IRT (MIRT; Reckase 2009) models need to be considered.A special case of the MIRT model is the multi-unidimensional IRT model (Lee 1995;Sheng 2008b;Sheng and Wikle 2007), applicable in situations where a test involves multiple traits with each item measuring one of them.The items measuring a specific latent trait can be grouped together to form a subtest, and hence the test can be viewed as one consisting of multiple such unidimensional subtests.In such situations, it is also reasonable to assume that each subtest measures some facet of an overall or general latent trait dimension and hence to consider MIRT models incorporating both the general trait and traits specific for individual subtests so that separate inferences can be made about persons for each of them.Here, the general trait signifies the unified latent dimension that involves cognitive process common for all test items, whereas the specific trait involves cognitive process needed for answering items in an individual subtest that may be related to the general trait or other specific traits.Different such MIRT models have been developed in the literature so that the general trait may be underlying the specific traits to form a hierarchical structure (Sheng and Wikle 2008), or the general and specific traits form an additive structure, i.e., each item measures a general trait and a specific trait directly (Sheng and Wikle 2009).These two types of models make different assumptions on the underlying continuous dimensions.Their latent structures can be comparable to that of second-order factor models (Schmid and Leiman 1957) and that of bifactor models (Holzinger and Swineford 1937, though bifactor models assume that the general and specific factors are orthogonal), respectively, which have been well studied in the factor analytic framework.Recent research on factor analysis indicates that the second-order and bifactor models, though with similar interpretations, are not mathematically equivalent (Gustafsson and Balke 1993;McDonald 1999;Mulaik and Quartetti 1997;Yung, Thissen, and McLeod 1999).In particular, Yung et al. (1999) have demonstrated that the former is actually hierarchically nested within the latter (see also Rindskopf and Rose 1988).This finding further sheds light on the relation between the two MIRT models considered in this paper.
Parameter estimation offers the basis for the theoretical advantages of IRT, and hence has been a major concern in the application of IRT models.As the influence of items and persons on the responses is modeled by distinct sets of parameters, simultaneous estimation of these parameters results in statistical complexities in the estimation task, which have made estimation procedure a primary focus of psychometric research over decades (Birnbaum 1969;Bock and Aitkin 1981;Molenaar 1995).Recent attention is focused on the fully Bayesian approach using Markov chain Monte Carlo (MCMC) simulation techniques, which are extremely general and flexible and have proved useful in practically all aspects of Bayesian inferences, such as parameter estimation or model comparisons.One of the simplest MCMC algorithms is Gibbs sampling (Casella and George 1992;Geman and Geman 1984).The method is straightforward to implement when each full conditional distribution associated with a particular multivariate posterior distribution is a known distribution that is easy to sample.Gibbs sampling has been applied to common unidimensional models (Albert 1992;Sahu 2002) and two-parameter normal ogive (2PNO) multi-unidimensional models (Lee 1995) using the data augmentation idea of Tanner and Wong (1987).Sheng andWikle (2008, 2009) further extended their approaches and developed the Gibbs sampling procedures for the MIRT models with a general trait and several specific trait dimensions.The model with an additive structure, in particular, has been found to provide a better description to the data in various situations compared with the other IRT models (Sheng and Wikle 2009).
This paper provides a MATLAB (The MathWorks, Inc. 2007) package that implements Gibbs sampling for the 2PNO MIRT models incorporating a general and several specific traits with the option of specifying uniform or conjugate priors for item parameters.One may note that different R packages have been developed to estimate dichotomous and/or polytomous IRT models (see e.g., Anderson, Li, and Vermunt 2007;Johnson 2007;Mair and Hatzinger 2007;Rizopoulos 2006).These packages focus on the maximum likelihood estimation of either Rasch models or models with one latent trait, and shall be differentiated from the fully Bayesian approach and/or the multidimensional dichotomous IRT models considered in this study.The remainder of the paper is organized as follows.Section 2 reviews the models and Section 3 briefly describes the MCMC algorithms implemented in the package IRTn2noHA.In Section 4, a brief illustration is given of Bayesian model choice or checking technique for testing the adequacy of a model.The package IRTn2noHA is introduced in Section 5, where a description is given of common input and output variables.In Section 6, illustrative examples are provided to demonstrate the use of the source code.Finally, a few summary remarks are given in Section 7.This paper only considers normal ogive models given that more complicated MCMC procedures have to be adopted for the logistic form of these models (e.g., Patz and Junker 1999a,b) and that the logistic and normal ogive forms of the IRT model are essentially indistinguishable in model fit or parameter estimates given proper scaling (Birnbaum 1968;Embretson and Reise 2000).

IRT models with general and specific traits
Suppose a K-item test consists of m subtests, each containing k v dichotomous (0/1) items, where v = 1, 2, . . ., m.Let y vij denote the ith person's response to the jth item in the vth subtest, where i = 1, 2, . . ., n and j = 1, 2, . . ., k v1 .With a probit link, the probability of a correct response can be defined as where θ vi is a scalar person trait parameter specific for the vth latent dimension, α vj is a positive scalar slope parameter representing the item discrimination in the vth dimension, and β vj is a scalar intercept parameter that is related to the location in the vth dimension where the item provides maximum information.This probability function takes the form of that of the multi-unidimensional model (see Sheng and Wikle 2007;Sheng 2008b) where m specific latent traits, θ vi , are considered.We may incorporate an additional general latent dimension by assuming a hierarchical structure so that each θ vi either (a) is a linear function of the general trait parameter θ 0i , or (b) linearly combines to form the general trait parameter, (Sheng and Wikle 2008), where δ v and λ v are positive coefficients relating each θ vi with θ 0i .
For simplicity, the two formulations are referred to as hierarchical model 1 and hierarchical model 2, respectively.
Alternatively, the general and specific latent dimensions can assume an additive structure so the probability function is defined as (Sheng and Wikle 2009), where α 0vj is a positive scalar slope parameter associated with θ 0i .Hence, a correct response is assumed to be determined directly by two latent traits -a general one and a specific one.Due to the additive structure of the latent traits, this model is referred to as the additive model in this paper.As Figure 1 shows, it differs from the two hierarchical models in the effect of the general trait on item responses.One may note the similarity of this model with the original bifactor model proposed by Holzinger and Swineford (1937), where each item has a nonzero loading on a primary factor and a second loading on only one of several secondary factors.However, they differ in several aspects.First, the additive model is for binary response data and is hence essentially similar to the bifactor full-information factor model presented by Gibbons and Hedeker (1992); see also Li, Bolt, and Fu (2006).Second, as with other IRT models, the additive model uses "difficulties" (which are related to intercepts in our discussion) and "discriminations" instead of "thresholds" and "loadings" as with factor models.Third, in the additive model, it is assumed that the slope is positive, whereas in the bifactor model, the threshold can assume negative values.Finally, the bifactor model constrains the correlation between the general and specific dimensions to be zero, whereas the additive model does not, though it is pointed out later in the paper that large correlations tend to result in high collinearity, which creates problem in parameter estimation.Since the additive model specified in ( 4) is more complex than the conventional IRT models or even the hierarchical models, certain constraints need to be adopted to help identification.See Section 3.3 for a detailed illustration of the model specification.

MCMC algorithms
Gibbs sampling is one of the simplest MCMC algorithms that can obtain item and person parameter estimates simultaneously.The method is particularly straightforward when each full conditional distribution can be obtained in closed form.Its general underlying strategy is to iteratively sample item and person parameters from their respective posterior distributions, conditional on the sampled values of all other person and item parameters.This iterative process continues for a sufficient number of samples after the posterior distributions reach stationarity (a phase known as burn-in).As with standard Monte Carlo, the posterior means of all the samples collected after the burnin stage are considered as estimates of the true model parameters.Similarly, their posterior standard deviations are used to describe the statistical uncertainty.However, Monte Carlo standard errors cannot be calculated using the sample standard deviations because subsequent samples in each Markov chain are autocorrelated (e.g., Patz and Junker 1999b).Among the standard methods for estimating them (Ripley 1987), batching is said to be a crude but effective method (Verdinelli and Wasserman 1995) and hence is considered in this paper.
Here, with a long chain of samples being separated into contiguous batches of equal length, the Monte Carlo standard error for each parameter is then estimated to be the standard deviation of these batch means.The Monte Carlo standard error of the estimate is hence a ratio of the Monte Carlo standard error and the square root of the number of batches.More sophisticated methods for estimating standard errors can be found in Gelman and Rubin (1992).
The Gibbs sampler for each MIRT model involving both general and specific latent traits is now described.

Gibbs sampling for the hierarchical model 2
Assuming θ 0i ∼ N ( v λ v θ vi , 1), the hierarchical model 2 calls for a slightly different Gibbs sampling procedure.In particular, we can assume a multivariate normal prior density for person specific trait parameters θ i so that θ i ∼ N m (0, P), where P is a covariance matrix with the variances being fixed at 1.It follows that the off-diagonal element of P is the correlation ρ st between θ si and θ ti , s = t, i = 1, . . ., m.It is noted that the proper multivariate normal prior for θ vi with their location and scale parameters specified to be 0 and 1 ensures unique scaling and hence is essential in resolving a particular identification problem for this model (see e.g. Lee 1995, for a description of the problem).Moreover, introduce an unconstrained covariance matrix Σ, where Σ = [σ vv ] m×m , so that the correlation matrix P can be readily transformed from Σ using (the same approach was adopted for the multi-unidimensional model; see Lee 1995;Sheng 2008b).A noninformative prior can be assumed for Σ so that p(Σ) Hence, with prior distributions assumed for ξ vj and λ, where λ = (λ 1 , . . ., λ m ) , the joint posterior distribution of (θ, ξ, Z, where f (y|Z) is as defined in ( 6).
The implementation of the Gibbs sampler thus involves six processes, namely, a sampling of the augmented Z parameters from ( 7); a sampling of the item parameters ξ from (10) assuming uniform priors or from (11) assuming conjugate normal priors; a sampling of person specific trait parameters θ from where A and B are as defined in (8); a sampling of the person general trait parameters θ 0 from (3); a sampling of the hyperparameters λ from where (an inverse Wishart distribution), where S = n i=1 α mj ) 1/km ; and a transformation of Σ to P using (14).

Gibbs sampling for the additive model
For the additive model, the Gibbs sampler again involves the introduction of an augmented variable Z so that Z vij ∼ N (η vij , 1), where η vij = α 0vj θ 0i +α vj θ vi −β vj .Assume a multivariate normal prior distribution for person general and specific trait parameters θ i , where θ i = (θ 0i , θ 1i , . . ., θ mi ) , so that θ i ∼ N m+1 (0, P).Here, P is a constrained covariance matrix, with the diagonal element being 1 and the off-diagonal element being the correlation ρ st between θ si and θ ti , s = t, i = 0, . . ., m.It is noted again that the location and scale parameters for person parameters are specified to be 0 and 1, which ensures unique scaling and helps resolve the model identification problem.Similar to the procedure described in Section 3.2, introduce an unconstrained covariance matrix Σ = [σ vv ] (m+1)×(m+1) so that P can be transformed from Σ using ( 14), and assume p(Σ) ∝ |Σ| − m+2 2 .Then, with prior distributions assumed for ξ vj , where ξ vj = (α 0vj , α vj , β vj ) , the joint posterior distribution of (θ, ξ, Z, Σ) is where is the likelihood function with p vij being as defined in (4).
Due to an additional model indeterminacy resulted from the additive nature of θ 0i and θ vi , further constraints are adopted to help identify the model.Specifically, at each iteration, the θ 0i and θ vi parameters are normalized to have a mean of 0 and a standard deviation of 1, with the corresponding α 0vj , α vj and β vj parameters being rescaled so that the likelihood is preserved (see Bafumi et al. 2005).This procedure, as mentioned in Section 3.1, also improves mixing and computational efficiency.

Bayesian model choice or checking
In Bayesian statistics, the adequacy of the fit of a given model is evaluated using several model choice or checking techniques, among which, Bayesian deviance and posterior predictive model checks are considered and briefly illustrated.

Bayesian deviance
The Bayesian deviance information criterion (DIC; Spiegelhalter, Best, Carlin, and van der Linde 2002) is based on the posterior distribution of the deviance.This criterion is defined as DIC = D + p D , where D = E(−2 log L(y|ϑ)) is the posterior expectation of the deviance (with L(y|ϑ) being the model likelihood function, where ϑ denotes all model parameters) and p D = D − D( θ) is the effective number of parameters (Carlin and Louis 2000).Further, D( θ) = −2 log(L(y| θ)), where θ is the posterior mean.To compute Bayesian DIC, MCMC samples of the parameters, ϑ (1) , . . ., ϑ (G) , can be drawn using the Gibbs sampler, . Small values of the deviance suggest a better-fitting model.Generally, more complicated models tend to provide better fit.Hence, penalizing for the number of parameters (p D ) makes DIC a more reasonable measure to use.

Posterior predictive model checks
The posterior predictive model checking (PPMC; Rubin 1984) method provides a popular Bayesian model checking technique that is intuitively appealing, simple to implement, and easy to interpret (Sinharay and Stern 2003).The basic idea is to draw replicated data y rep from its posterior predictive distribution p(y rep |y) = p(y rep |ϑ)p(ϑ|y)dϑ, and compare them to the observed data y.If the model fits, then replicated data generated under the model should look similar to the observed data.A test statistic known as the discrepancy measure T (y, ϑ) has to be chosen to define the discrepancy between the model and the data.For each y rep drawn from the predictive distribution, the realized discrepancy T (y) = T (y, ϑ) can be compared with the predictive discrepancy T (y rep ) = T (y rep , ϑ) by plotting the pairs on a scatter plot.Alternatively, one can obtain a quantitative measure of lack of fit by calculating the tail-area probability or the PPP-value (Sinharay, Johnson, and Stern 2006), P (T (y rep ) ≥ T (y)|y) = T (y rep )≥T (y) p(y rep |ϑ)p(ϑ|y)dy rep dϑ.
To implement the method, one draws G samples from the posterior distribution of ϑ using Gibbs sampling.Then for each simulated ϑ (g) , a y rep(g) can be drawn from the predictive distribution so there are G draws from the joint posterior distribution p(y rep , ϑ|y).The predictive test statistic T (y rep(g) ) and the realized test statistic T (y) are computed and subsequently compared to provide graphical and numerical evidence about model inadequacy.Specifically, the proportion of the G simulated samples for which the replicated data could be more extreme than the observed data, i.e., 1 G G g=1 I(T (y rep(g) ) ≥ T (y)), provides an estimate of the PPP-value.Extreme PPP-values (close to 0 or 1) indicate model misfit.It is noted that although both are defined as tail-area probabilities, the PPP-value has to be differentiated from the traditional hypothesis-testing p-value in that the posterior predictive checking approach does not perform a hypothesis test (Gelman, Meng, and Stern 1996).The choice of the discrepancy measure is critical in implementing this method.Sinharay et al. (2006), after evaluating a number of discrepancy measures for assessing IRT models, concluded that the odds ratio for measuring associations among item pairs, T (y) = OR ij = n 11 n 00 n 01 n 10 , is powerful for detecting lack of fit when the model assumption is violated.Hence, this measure is adopted as the test statistic for the PPMC method in this paper.

Package IRTm2noHA
The package IRTm2noHA contains two major user-callable routines.A function for generating binary response data using the multidimensional model involving a general and several specific latent dimensions titled, simm2noHA, and a function that implements Gibbs sampler to obtain posterior samples, estimates, convergence statistics, or model choice/checking statistics, gsm2noHA.
The function simm2noHA has input arguments n, kv, model, b, r, and iparm for the number of respondents, the number of items in each subtest, the MIRT model based on which the binary response is generated (model = 1 for the hierarchical model 1, model = 2 for the hierarchical model 2, and model = 3 for the additive model), the actual δ or λ coefficients relating the general trait to the specific traits in the two hierarchical models, the actual correlation matrix for person specific traits in the hierarchical model 2 or for person traits in the additive model, and user-specified item parameters, respectively.The optional iparm argument allows the user the choice to input item parameters for the model, or randomly generate them from uniform distributions so that α vj ∼ U (0, 2), β vj ∼ U (−2, 2) for the hierarchical models, or ) for the additive model.The user can further choose to store the simulated person (theta) and item (item) parameters.It is noted that a transformation is adopted in the Gibbs sampler for the hierarchical model 1 to rescale the θ vi estimates (see Section 3.1).To keep the simulated values on the same scale, θ vi are also normalized to have a mean of 0 and a standard deviation of 1 for this model.
With the required user-input binary response data (y) and number of items in each subtest (kv), the function gsm2noHA initially reads in starting values for person and item parameters (th0, item0) and the person hyperparameters (b0, sigma0), or sets them to be θ = 1, and vj = 1, and P (0) = I for the additive model.It then implements the Gibbs sampler to a userspecified MIRT model incorporating general and specific traits (model) and iteratively draws random samples for the parameters and hyperparameters from their respective full conditional distributions.The prior distributions for the item parameters or the δ/λ parameters can be uniform (unif = 1, default) or conjugate (unif = 0).In the latter case, the user can specify any values of interest or use the default values, namely, µ αv = 0, µ βv = 0 (xmu), σ 2 αv = 1, σ 2 βv = 1 (xvar) for α vj and β vj , and µ δv /µ λv = 0 (bmu), σ 2 δv /σ 2 λv = 1 (bvar) for δ v /λ v in the hierarchical models, or µ α 0v = 0, µ αv = 0, µ βv = 0 (xmu), σ 2 α 0v = 1, σ 2 αv = 1, σ 2 βv = 1 (xvar) for α 0vj , α vj and β vj in the additive model.It is noted that the prior location and scale parameters for α vj , β vj , or α 0vj can be set to be different across the m subtests.The algorithm continues until all the (kk) samples are simulated, with the early burn-in samples (burnin) being discarded, where kk and burnin can be 10, 000 and kk/2 (default) or any values of interest.It then computes the posterior estimates, posterior standard deviations, and Monte Carlo standard errors of the person (pparm), item (iparm) or hyperparameter (hparm) estimates.Posterior samples after the burn-in stage of these parameters can also be stored (samples) for further analysis.
In addition to Monte Carlo standard errors, convergence can be evaluated using the Gelman-Rubin R statistic (Gelman et al. 2004) for each person or item parameter.The usual practice is using multiple Markov chains from different starting points.Alternatively, a single chain can be divided into sub-chains so that convergence is assessed by comparing the between and within sub-chain variances.Since a single chain is less wasteful in the number of iterations needed, the latter approach is adopted to compute the R statistic (gr) with gsm2noHA.The Bayesian deviance estimates, including D, D( θ), p D and DIC, can be obtained (deviance) to measure the relative fit of a model.Moreover, the PPMC method can be adopted using the odds ratio as the discrepancy measure so that PPP-values (ppmc) are obtained to evaluate model misfit.Extreme PPP-values can be further plotted using the function ppmcplt, in which the threshold (crit) can be 0.01 (default) or any level of interest so that PPP-values larger than 1 − crit/2 are denoted using the right triangle sign and those smaller than crit/2 are denoted using the left triangle sign.The functions' input and output arguments are completely specified in the m-files.

Illustrative examples
To demonstrate the use of the IRTm2noHA package, simulated and real data examples are provided in this section to illustrate item parameter recovery as well as model comparisons.The code to reproduce the results of each example is provided in the m-file v34i03.m.

Parameter recovery
For parameter recovery, tests with two subtests were considered so that the first half measured one latent trait and the second half measured another.A 1000-by-32 (i.e., n = 1000, m = 2, k 1 = 16, k 2 = 16, and K = 32) dichotomous data matrix was generated from each of the three MIRT models described in Section 2. The item parameters were randomly simulated from α 0vj ∼ U (0, 2), α vj ∼ U (0, 2), and β vj ∼ U (−2, 2), and are shown in the first column of Tables 1 through 7. Gibbs sampling was subsequently implemented to recover model parameters assuming the noninformative uniform or informative conjugate prior distributions described previously.The posterior means and standard deviations of item parameters (α v , β v , and/or α 0v ), as well as their Monte Carlo standard errors of estimates and Gelman-Rubin R statistics were obtained and are displayed in the rest of the tables.The estimation accuracy was also evaluated using the average square error between the actual and estimated item parameters, which is shown at the bottom of each table .The Gelman-Rubin R statistic provides a numerical measure for assessing convergence for each item parameter.With values close to 1, it is determined that in the implementations of the Gibbs sampler, Markov chains reached stationarity with a run length of 10, 000 iterations and a burn-in period of 5, 000 iterations.The posterior estimates of the item parameters shown in the seven tables are fairly close to the true parameters, suggesting that each MIRT model incorporating a general trait and several specific trait dimensions recovers its model parameters with enough accuracy.In addition, the two sets of posterior estimates, resulted from different prior distributions, differ only slightly from each other, signifying that the posterior estimates are not sensitive to the choice of uniform or conjugate priors for the item and δ/λ parameters.This point is further supported by the small differences between the average square errors.and ρ 12 = .8(chain length = 10,000, burn-in = 5,000).

Model comparison
To illustrate the use of the Bayesian model choice or checking techniques for evaluating the relative (mis)fit of a model, two 1000-by-30 dichotomous data matrices were simulated from the hierarchical model 2 and the additive model, respectively, so that 15 items measured θ 1 and another 15 items measured θ 2 .This way, the three MIRT models incorporating both general and specific latent dimensions could be evaluated and compared under the situation where the actual structure was hierarchical or additive.To generate the data matrices, item parameters were randomly drawn from the uniform distributions described in Section 5.The Gibbs sampler assuming noninformative priors for the item and δ/λ parameters was subsequently implemented for each of the three MIRT models so that 10, 000 samples were simulated with the first 5, 000 set to be burn-in.The Gelman-Rubin R statistics suggest that the chains converged to their stationary distributions within 10, 000 iterations.Hence, the Bayesian deviance estimates, including D, D( θ), p D and DIC, were obtained from each implementation and are displayed in Table 8.It is clear from the table that both posterior deviance (column 2) and Bayesian DIC (column 5) estimates pointed to the more complicated additive model, indicating that it provided a better description of the simulated data regardless of the actual relationship between general and specific traits.A close examination of these estimates reveals that compared with the situation where the latent structure was hierarchical, the additive model resulted in a larger reduction in the deviance or DIC estimate and hence had a higher improvement in model-data fit over the hierarchical models when the actual structure was additive.Another interesting observation of Table 8 pertains to the effective number of parameters (p D ).Given the model complexity, the additive model consistently had a larger p D than the two hierarchical models.Among the two scenarios considered, the difference in p D between the two types of models was noticeably larger when the actual structure was additive.It has to be noted that the difference between the DIC estimates for the two hierarchical models was rather trivial.Indeed, the two models had almost identical values of D, D( θ), p D and DIC, and hence were essentially similar in model-data fit in spite of different assumptions made regarding the latent structure.
PPMC was also implemented to obtain PPP-values for the three MIRT models where odds ratios were used as the discrepancy measure.with a threshold of 0.01, the plots indicate that the additive model had relatively fewer number of extreme predicted odds ratios and hence provided a better description than the hierarchical models in the two situations considered.On the other hand, the overwhelmingly larger numbers of extreme PPP-values clearly suggest the lack of fit of the hierarchical models when the actual structure was additive (see Figures 2(b), 2(d)).The two hierarchical models performed fairly similarly using PPMC.These results agree with those using DIC.
Hence, the model comparison results based on the Bayesian deviance and PPMC criteria suggest that the additive model performs better than the hierarchical models in a wide range of test situations and even when the actual latent structure is hierarchical.This is consistent with findings from Sheng and Wikle (2009).The fact that the additive model works better even when the hierarchical model is true can be explained by mathematically showing that the latter is a special case of the former.Indeed, we can rewrite ( 2) and (3) as θ vi = δ v θ 0i + vi and θ 0i = λ v θ vi + i , where vi ∼ N (0, 1) and i ∼ N (0, 1); replace θ vi in the hierarchical model so that the systematic component in ( 1 and show that each of these two forms can be viewed as a constrained version of the additive model as defined in (4).This is similar to Yung et al.'s (1999) finding that the second-order factor model is hierarchically nested within the bifactor model.

Empirical example
A subset of the College Basic Academic Subjects Examination (CBASE ; Osterlind 1997) English data was further used to illustrate the Bayesian approach for model choice or model checking.The data contain independent binary responses of 1, 200 college students to a total of 41 English multiple-choice items.The English test is further organized into levels of increasing specificity by two subtests, namely, writing and reading, so that 16 items are in one subtest and 25 are in the other.One may assume that the test measures an overall English dimension with two specific trait dimensions.Model comparison is hence necessary for establishing the model that provides a relatively better representation of the data.The three MIRT models incorporating both general and specific traits were then each fit to the CBASE data using Gibbs sampling with a run length of 10, 000 iterations and a burn-in period of 5, 000, which was sufficient for the chains to converge.As expected, Bayesian deviance and PPMC, displayed in Table 9 and Figure 3, respectively, pointed to the more complicated additive model as it had a smaller Bayesian DIC estimate and fewer number of extreme PPP-values for odds ratios.Given the observations made from the simulation study in Section 6.2, comparisons of the DIC and p D estimates using Bayesian deviances or that of the extreme PPP-values using PPMC suggest that the actual latent structure for the CBASE data is closer to be additive, with the general and specific traits being moderately correlated.

Discussion
With functions for generating dichotomous response data from and implementing the Gibbs sampler for the 2PNO MIRT models that incorporate a general trait and several specific trait dimensions, IRTm2noHA allows the user the choice to set the number of total or burn-in samples, specify starting values or prior distributions for model parameters, check convergence of the Markov chain, as well as obtain Bayesian model choice or model checking statistics.
The package leaves it to the user to choose between uniform and conjugate priors for the item parameters and certain hyperparameters.In addition, the user can choose to set the location and scale parameters for the conjugate normal priors of α vj , β vj , and/or α 0vj to reflect different prior beliefs on their distributions.For example, if there is a stronger prior opinion that the item intercept parameters in the additive model should be close to 0, a smaller σ 2 βv can be specified in the gsm2noHA function such that xvar = [ones(m+1,1), ones(m+1,1), 0.5*ones(m+1,1)].If, however, this prior opinion concerns with the item intercepts in the first subtest, not the entire test, the input argument becomes xvar = [ones(m+1,1), ones(m+1,1), [0.5; ones(m,1)]].This way, different prior distributions can be specified for α vj , β vj , or α 0vj across the m subtests.
The intertrait correlations can be estimated fairly accurately using gsm2noHA.
In the examples of parameter recovery in Section 6.1, ρ 12 was set to be 0.5 for the hierarchical model 2, and it was estimated to be 0.5505 assuming uniform priors and 0.5554 assuming informative priors.
Similarly, for the additive model, the actual ρ 01 , ρ 02 , and ρ 12 parameters were 0.4, 0.4, and 0.8, respectively.Their posterior estimates were 0.3865, 0.4615, and 0.8148, respectively, assuming uniform priors, and 0.3728, 0.4758, and 0.8173, respectively, assuming informative priors.Hence, in addition to item or person parameters, the package provides a good estimation of the actual interrelations among latent traits provided that the latent structure is specified correctly.As illustrated, the additive model allows the general trait to be correlated with each specific trait.However, high correlations between them would give rise to collinearity and create problem with the Gibbs sampler.
One should note that during an implementation of the Gibbs sampler, if a Markov chain does not converge within a run length of certain iterations, additional iterations can be obtained by invoking the gsm2noHA function with starting values th0, item0, sigma0, and/or b0 set to be their respective posterior samples drawn on the last iteration of the Markov chain (see Sheng 2008a, for a demonstration of such procedure).
The illustrative examples provided in Section 6 deal with two subtests for simplicity.On a 2.8GHz Intel dual core processor with 2GB of RAM, the execution time for the three examples is about 2-3 minutes, 5.5-6 minutes and 10.5-11 minutes, respectively.For tests with three or more subtests, IRTm2noHA can be used in a similar fashion with possibly increased complexity and consequently a longer computing time.In addition, Bayesian deviance and PPMC are adopted in this paper to evaluate the model-data fit of a candidate model.One may also want to consider Bayes factors, which provide more reliable and powerful results for model comparisons in the Bayesian framework.However, they are difficult to calculate due to the difficulty in exact analytic evaluation of the marginal density of the data (Kass and Raftery 1995) and hence are not considered in the paper.Finally, this paper adopts the Gelman-Rubin R statistic to assess convergence.Its multivariate extension, the Brooks-Gelman multivariate potential scale reduction factor (Brooks and Gelman 1998), may be considered as well.

Figure 1 :
Figure 1: Graphical representations of the three MIRT models incorporating general and specific latent traits.(Circles represent latent traits and squares represent observed item responses.) (a) with an actual hierarchical structure (b) with an actual additive structure (hierarchical model 1) (c) with an actual hierarchical structure (d) with an actual additive structure (hierarchical model 2) (e) with an actual hierarchical structure (f) with an actual additive structure (additive model)

Figure 2 :
Figure 2: Plots of PPP-values for odds ratios for the three MIRT models fit to multidimensional data involving general and specific traits.(Triangles represent extreme PPP-values, where values smaller than 0.005 are denoted using left triangle signs and values larger than 0.995 are denoted using right triangle signs.)

Figure 3 :
Figure 3: Plots of PPP-values for odds ratios for the three MIRT models with the CBASE data.(Triangles represent extreme PPP-values, where values smaller than 0.005 are denoted using left triangle signs and values larger than 0.995 are denoted using right triangle signs.)

Table 7 :
Posterior mean, standard deviation (SD), Monte Carlo standard error of the estimate (MCSE) and Gelman-Rubin R statistic for β v in the additive model where ρ 01 = .4,ρ 02 = .4

Table 8 :
Graphical representations of extreme PPPvalues are shown in Figure2, where the upper diagonal is left blank due to symmetry.Here, Bayesian deviance estimates for the three MIRT models fit to the simulated data where the actual general and specific traits form a hierarchical structure or an additive structure.

Table 9 :
Bayesian deviance estimates for the three MIRT models with the CBASE data.