New Developments in Mokken Scale Analysis in R L

Mokken (1971) developed a scaling procedure for both dichotomous and polytomous items that was later coined Mokken scale analysis (MSA). MSA has been developed ever since, and the developments until 2000 have been implemented in the software package MSP (Molenaar and Sijtsma 2000) and the R package mokken (Van der Ark 2007). This paper describes the new developments in MSA since 2000 that have been implemented in mokken since its rst release in 2007. These new developments pertain to invariant item ordering, a new automated item selection procedure based on a genetic algorithm, inclusion of reliability coe cients, and the computation of standard errors for the scalability coe cients. We demonstrate all new applications using data obtained with a transitive reasoning test and a personality test.


Introduction
Mokken scale analysis (MSA; Mokken 1971;Sijtsma and Molenaar 2002) is a scaling technique for ordinal data, and mainly used for scaling test 1 and questionnaire data.Having approximately 3,810 hits on Google Scholar, MSA is a rather popular scaling method, though not as popular as Rasch analysis (13,900 hits) or factor analysis (1,380,000 hits).MSA is based on nonparametric item response theory (IRT) models and consists of two parts: (1) an automated item selection procedure (AISP), which partitions a set of ordinal variables (from here on called items) into scales (called Mokken scales) and possibly leaving some items unselected, and (2) methods to investigate the goodness-of-t of nonparametric IRT models for each of the Mokken scales.For a thorough discussion of nonparametric IRT, MSA, and data analysis strategies we refer to the following literature: Nonparametric IRT and MSA for (ordinal) dichotomous item scores were developed by Mokken (1971; also see Mokken and The aim of the R package (R Development Core Team 2011) mokken (Van der Ark 2007, 2010b) was to oer all methods and procedures available in MSP as free and open source software (licensed under GPL) that would run irrespective of the operating system, and to provide a platform for new developments in MSA.The rst version of mokken was a free version of MSP but over the years procedures and methods resulting from new results in MSA were embedded.Also the availability of the package generated new research.Because MSA is now embedded in R, it is much easier to compare MSA with other scaling procedures (e.g., Brusco, Koehn, and Steinley 2011;Smits, Timmermans, and Meijer 2011;Straat, Van der Ark, and Sijtsma 2011).This paper discusses new developments in Mokken scale analysis (MSA) and their implementation in mokken.The new developments are classied into four categories: Investigating invariant item ordering, a new algorithm for the AISP, computation of test-score reliability, and computation of standard errors of scalability coecients.The remainder of the paper is organized as follows: Section 2 briey discusses nonparametric IRT models and Mokken scale analysis; Section 3 discusses the new developments and the corresponding code for mokken; Section 4 demonstrates MSA by applying the functions in mokken to data obtained using a transitive reasoning test and a personality test; and Section 5 is the Discussion.
2. A summary of nonparametric IRT and MSA 2.1.Nonparametric IRT models Suppose a test or a questionnaire contains a set of items which are numbered 1; : : : ; J and indexed by j.For convenience, but without loss of generality, suppose that each item has m+1 ordered answer categories.Let X j denote the score on item j with realization x j = 0; 1; : : : ; m; Scores X 1 ; : : : ; X J are referred to as item scores.If m = 1 the item is called dichotomous; if m > 1 the item is called polytomous.The test score is dened as X + = J j=1 X j .In IRT it is assumed that a latent trait triggers the item responses.It is also assumed that the ordering of the scores of each item reects the hypothesized ordering on .Expression Unidimensionality : Only one latent variable is required to explain the association between item scores; Local independence : P(X 1 = x 1 ; : : : ; X J = x J j) = J j=1 P(X j = x j j); Latent monotonicity : P(X j !xj a ) P(X j !xj b ), for all a < b for j = 1; : : : ; J; x = 1; : : : ; m; and Nonintersection : if for a xed value 0 P(X i !xj 0 ) ! P(X j !yj 0 ) then P(X i !xj) !P(X j !yj) for all .This is true for all item pairs i; j (i T = j) and for all pairs of item scores x; y.
Local independence implies that the response to any item is unrelated to any other item when latent variable level is controlled.Latent monotonicity means that the item step response functions are nondecreasing functions of .(Latent monotonicity is usually referred to as monotonicity, but we prefer the term latent monotonicity to distinguish it from manifest monotonicity that it introduced later on.)Nonintersection means that the item step response functions do not intersect.
The assumptions unidimensionality, local independence, and latent monotonicity dene the most general nonparametric IRT model: the monotone homogeneity model (Mokken 1971) also known as the nonparametric graded response model (Hemker, Sijtsma, Molenaar, and Junker 1997).Assumptions unidimensionality, local independence, latent monotonicity, and nonintersection dene the double monotonicity model (Mokken 1971).Several other nonparametric IRT models have been proposed (see Van der Ark 2001, for an overview).All popular unidimensional parametric IRT models, such as the Rasch model (Rasch 1960), the two-and three-parameter logistic model (Birnbaum 1968), the graded response model (Samejima 1969), also assume unidimensionality, local independence, and latent monotonicity.Therefore, investigation of the assumptions of nonparametric IRT models is also useful when parametric IRT models are used.In addition, parametric IRT models assume that the item step response functions have a parametric functional form.The next two paragraphs describe the measurement properties of nonparametric IRT models.Ordinal person measurement.For dichotomous items, the monotone homogeneity model implies stochastic ordering of by X + (known under the acronym SOL), that is, P( > ajX + = L) !P( > ajX + = K) for all a and for all K < L (1) (Hemker, Sijtsma, Molenaar, and Junker 1996;Grayson 1988;Huynh 1994) Ordinal item measurement.In several situations it is important to know whether the ordering of the items in terms of diculty or popularity is the same for all levels of the trait measured by the test (for an overview see Ligtvoet, Van der Ark, Te Marvelde, and Sijtsma 2010).This property is known as invariant item ordering (IIO; Sijtsma and Junker 1996;Sijtsma and Hemker 1998).Sijtsma and Hemker (1998) dened IIO as follows: A set of J items has an IIO if the items can be numbered 1; : : : ; J and ordered accordingly such that E(X 1 j) !E(X 2 j) !¡ ¡ ¡ !E(X J j) for all : (2) For dichotomous items, IRT models that have non-intersecting item response functions (e.g., the Rasch model and the double monotonicity model) imply IIO.Other IRT models for dichotomous items do not imply IIO.
For polytomous IRT models an invariant ordering of items cannot be dened unequivocally.Stronger invariant ordering properties than IIO in Equation 2 were coined latent scales (Ligtvoet, Van der Ark, Bergsma, and Sijtsma 2011).In order of increasing strictness of ordering, they are the latent scale for cumulative probability models (LS-CPM), which means that for all x and all P(X 1 !xj) !P(X 2 !xj) !¡ ¡ ¡ !P(X J ! xj); (3) the latent scale for continuation ratio models (LS-CRM), which means that for all x and all and the latent scale for adjacent category models (LS-ACM), which means that for all x and all P(X 1 = xjX 1 = x 1 X 1 = x; ) !P(X 2 = xjX 2 = x 1 X 2 = x; ) ! ¡ ¡ ¡ !P(X J = xjX J = x 1 X J = x; ): (5) The hierarchical structure of IIO and the latent scales can be summarized as follows (Ligtvoet et al. 2011): IIO @ LS-CPM @ LS-CRM @ LS-ACM (6) Sijtsma and Hemker (1998) showed that only a few very restrictive parametric and nonparametric polytomous IRT models imply IIO (Equation 2); the most well-known being the rating scale model (Andrich 1978), the sequential rating scale model (Tutz 1990), the strong double monotonicity model (Sijtsma and Hemker 1998).Most commonly used polytomous IRT models, such as the graded response model (Samejima 1969), the partial credit model (Masters 1982), the generalized partial credit model (Muraki 1992), and the monotone homogeneity model for polytomous items (Molenaar 1997)  For detailed information on these methods see Molenaar and Sijtsma (2000); Sijtsma and Molenaar (2002); Van der Ark (2007, 2010b).

Scalability coecients
Three types of scalability coecients play an important role in MSA.
For each pair of items, there is an item-pair scalability coecient H ij ; i; j = 1; : : : ; J; i T = j.
Let COV(X i ; X j ) be the covariance between X i and X j , and let COV(X i ; X j ) max be the maximum covariance between X i and X j given the marginal distributions of X i and X j .If the variances of X i and X j are both positive, then H ij is the normed covariance between the item scores: H ij = COV(X i ; X j ) COV(X i ; X j ) max (7) (Molenaar 1991).The range of H ij is h I; 1].The monotone homogeneity model implies that 0 H ij 1, for all i T = j; so negative values of H ij indicate that at least one of the items does not t the monotone homogeneity model and that item may be removed.In general, higher H ij values result in stronger scales but in MSA the magnitude of positive H ij coecients is seldom interpreted (Van der Ark, Croon, and   Sijtsma 2008).
For each item, there is an item scalability coecient H j ; j = 1; : : : ; J. Let R (j) = X + X j ; R (j) is called the rest score.Let COV(X j ; R (j) ) be the covariance between X j and R (j) , and let COV(X j ; R (j) ) max be the maximum covariance between X j and R (j) given the marginal distributions of X j and R (j) .If X j and R (j) both have positive variance, then H j is the normed covariance between the item score and the rest score: H j = COV(X j ; R (j) ) COV(X j ; R (j) ) max : (8) The monotone homogeneity model implies that 0 H j 1, for all j, so a negative item-scalability coecient indicates that the corresponding item violates the monotone homogeneity model.Van Abswoude, Van der Ark, and Sijtsma (2004) argued that H j can be interpreted in a similar way as the discrimination parameters in parametric IRT.
H j < 0 indicates that the corresponding item response function is generally decreasing; if H j = 0, the item does not give information on ; and H j > 0 indicates that the item response function is generally increasing.If H j = 1, the item is a so-called `Guttman item' with perfect discrimination (Guttman 1950;Sijtsma and Molenaar 2002).To avoid items with very little discriminatory power, Mokken advocated to retain only those items for which H j is greater than some positive lower bound c.The default lower bound in software is c = :3 is most often used.
For the entire set of items, there is a test-scalability coecient H: The monotone homogeneity model implies that 0 H 1. If H = 1 the test data follow a perfect Guttman scalogram.Mokken (1971) proposed the following rules of thumb for H: A scale is considered weak if :3 H < :4, moderate if :4 H < :5, and strong if H > :5.The predicates `weak', `moderate', and `strong' refer to the degree to which the ordering of persons by test score X + accurately reects an ordering on .
Function coefH in mokken (see Van der Ark 2007, 2010b) yields sample estimates of the scalability coecients, which will be denoted by H ij , H j , and H, respectively.Recent changes in coefH are discussed in Section 3.3.

Automated item selection procedure
The AISP partitions a set of items into so-called Mokken scales possibly leaving some items unscalable.A Mokken scale is a set of items for which, for a suitably chosen positive lower bound c, all inter-item covariances are strictly positive and H j !c > 0 (cf.Mokken, 1971 p. 184).Partitioning a set of items into Mokken scales using the AISP is an exploratory method for obtaining sets of items that satisfy some basic observable properties implied by the monotone homogeneity model and reasonable discriminatory power.Mokken (1971) devised a hierarchical clustering algorithm for the AISP; for a detailed description we refer to Sijtsma and Molenaar (2002, Chap. 5).The AISP can be run in mokken using the function aisp (see Van der Ark 2010b).Because sample estimates of the scalability coecients are used to check whether the criteria of a Mokken scale are met, sampling uctuations may aect the partitioning of an item set into Mokken scales.Recently, a new algorithm for the AISP has been included, which is discussed in Section 3.2.

Methods for investigating latent monotonicity
Manifest monotonicity is an observable property of the test data.Let r and s be realizations of R j (r; s = 0; : : : ; [J 1] ¢ m), then manifest monotonicity is dened as P(X j !xjR (j) = s) !P(X j !xjR (j) = r) for all j; x; s > r: For dichotomous items, latent monotonicity implies manifest monotonicity, but for polytomous items, some counterexamples have been found (Junker and Sijtsma 2000).However, Molenaar and Sijtsma (2000, p. 128) assumed that in practice, also for polytomous items, manifest monotonicity can be used to assess latent monotonicity.Manifest monotonicity can be investigated using check.monotonicity in mokken (see Van der Ark 2007, 2010b).
Methods for investigating intersection of item step response functions Molenaar and Sijtsma (2000, pp. 74-88) describe three methods to investigate nonintersection: method \pmatrix", method \restscore", and method \restsplit".Method \pmatrix" investigates for all item pairs i; j (i T = j; X i > X j ), and for all combinations of realizations x; y, whether P(X i !x; X k !z) !P(X j !y; X k !z) for all k T = i; j and all z (9) and P(X i < x; X k < z) < P(X j < y; X k < z) for all k T = i; j and all z: (10) Let R (ij) be the test score minus the scores on X i and X j with realization r (r = 0; : : : ; [J 2] ¢ m).Method \restscore" investigates for all item pairs i; j (i T = j; X i > X j ), and for all combinations of realizations x; y, whether Method \restsplit" is a variant of method \restscore"; it investigates for all item pairs i; j (i T = j; X i > X j ), and for all combinations of realizations x; y whether P(X i !xjR (ij) !r) !P(X j !yjR (ij) !r) for all r: (12) Equations 9, 10, 11, and 12 are observable properties of the double monotonicity model.Methods \pmatrix" and \restscore" can be investigated in mokken using functions check.pmatrixand check.restscore;method \restsplit" is not yet included in mokken.For dichotomous items these methods are useful because the double monotonicity model allows ordinal item measurement.For polytomous items, the double monotonicity model does not imply ordinal item measurement (e.g., Sijtsma, Meijer, and Van der Ark 2011).As a result, these methods have little use for polytomous items.New methods for investigating IIO, which can be applied to both dichotomous and polytomous items, have been developed (discussed hereunder); we advocate using these instead.
3. New developments in MSA 3.1.Investigating IIO Ligtvoet et al. (2010) derived an observable property of IIO (Equation 2), which they coined manifest IIO (MIIO).MIIO as implemented in mokken means that items can be numbered and ordered accordingly such that E(X i jR (ij) = r) !E(X j jR (ij) = r) for all i < j and all r: (13) Furthermore they developed a method to investigate MIIO and investigated the sensitivity and specicity of the new method.First, the items are ordered by descending mean item score, and numbered accordingly such that X 1 !X 2 !¡ ¡ ¡ !X J .Second, for each item pair (i; j), i < j, it is investigated whether Equation 13 is violated, yielding J 2 ¡ Boolean outcomes on violation of MIIO.For a particular item pair, the following statistical testing procedure is applied to determine whether or not the item pair violates Equation 13: If the sample means X i jR (ij) = r and X j jR (ij) = r are reversely ordered (i.e., X i jR (ij) = r is less than X j jR (ij) = r), a one-sided t-test is conducted for deciding whether the violation is signicant.Violations less than minvi (default minvi = m ¢ :03) are ignored to avoid testing very small violations on a scale from 0 to m.  and   5).They proved that MIIO (Equation 13) is also an observable property for all latent scales.Moreover, they found two other observable properties for the latent scales, and devised a method for investigating these observable properties similar to the method for investigating MIIO.The observable properties are called manifest scale of the cumulative probability model (MS-CPM), which is implied by all latent scales but not by IIO, and increasingness in transposition ( IT Rosenbaum 1987), which is implied only by the latent scale for adjacent category models (Equation 5).Hence, Equation 6 can be extended to IIO @ LS-CPM @ LS-CRM @ LS-ACM C C C MIIO MSCPM IT For a description of these manifest properties see Ligtvoet et al. (2011).
A pilot study showed that the methods based on MIIO, MSCPM, and IT may suggest dierent items as candidates for removal.Under violations of IIO, the method based on MSCPM was the most sensitive and, in general, suggested the removal of more items than the methods based on IT and MIIO.At rst glance, this may seem a strange result because IT relates to a stricter ordering property than MSCPM.It may be explained as follows: Suppose that IIO does not hold.By implication, none of the latent scales hold, and it is expected that all methods indicate that IIO should be rejected.Unlike the latent scales, the three methods are not (necessarily) hierarchically related; in fact, the three methods investigate dierent properties in the data.Therefore, the three methods may suggest dierent items for deletion.The function check.iio in mokken consists of methods to investigate IIO, and is used as follows: check.iio(X, method = "MIIO", minvi = default.minvi,minsize = default.minsize,alpha = .05,item.selection= TRUE, verbose = FALSE) The argument method indicates the method used to investigate IIO.Besides default option "MIIO", options "MSCPM" and "IT" are also allowed.(Gough and Heilbrun 1980).Each item has ve ordered answer categories (0 = completely disagree, 1 = disagree, 2 = agree nor disagree, 3 = agree, 4 = completely agree).The respondents were instructed to consider whether an adjective described their personality, and mark the answer category that ts best to this description.The rst ten items are extremely popular or unpopular self-descriptive adjectives.The unpopular items (indicated by an asterisk) were reversely coded.For example, the item \cruel" is extremely unpopular as a self-descriptive adjective.shows, for each item, the number of conicting items.Item j is a conicting item to item i (and vice versa) if their estimated item-response functions intersect; resulting in a violation of Equation 13.In this example, the estimated item response functions of items 2 (unintelligent*) and 4 (unfriendly*), and items 9 (honest) and 10 (deceitful*) intersect.To check this, pairs of estimated item response functions can be plotted using plot(check.iio(X)).The item or items having the highest number of conicting items are candidates for removal.If there is more than one candidate for removal, then the item having the lowest H j value is removed.
In this example, items 2, 4, 9, and 10 all have one conicting item.Item 2 had the lowest H j value and was removed.
Step 2 shows for each of the remaining items the number of conicting items.The procedure is repeated until no more violations occur.Finally, the fourth element ($HT) the value of coecient H T computed on the remaining items.The output for methods MS-CPM and IT is similar to the output for method MIIO.

New algorithms for the automated item selection procedure
Mokken's AISP aims at partitioning a set of items into Mokken scales, possibly leaving some items unscalable.The hierarchical clustering algorithm starts with the two items having the highest H ij value that is signicantly greater than 0; hence, a scale consists of at least two items.The algorithm keeps adding items that meet the criteria of a Mokken scale, until there are no more items left that meet the criteria.Then the procedure is repeated for the remaining unselected items resulting in a second Mokken scale; then the procedure is repeated a second time, until no more scales can be formed.Hence, the idea is to have as many good items in the rst scale, then have as many good items -that were not selected in the rst scale -in the second scale, and so on.Straat, Van der Ark, and Sijtsma (in press) noticed two problems in the hierarchical clustering algorithm.First, although the objective of the AISP is clear, the algorithm lacked a clear objective function.Thus, it is dicult to assess whether an optimal partitioning was obtained.Second, in some cases the algorithm does not yield Mokken scales because some item-scalability coecients H j are slightly less than the required lower bound c.They proposed an objective function (for details, see Straat et al. in press) that was completely in line with Mokken's original idea, and devised a genetic algorithm for the AISP.The two algorithms are included in the function aisp: aisp(X, search = "normal", lowerbound = .3,alpha = .05,popsize = 20, maxgens = default.maxgens,pxover = 0.5, pmutation = 0.1, verbose = FALSE) The argument search (name taken from MSP) gives the algorithm: "normal" is the hierarchical clustering algorithm, "ga" is the genetic algorithm.The arguments lowerbound, alpha, and verbose aect the hierarchical clustering algorithm: lowerbound species lower bound c for scalability coecient H j ; alpha species the nominal Type-I error rate for testing H ij = 0; and verbose species whether the hierarchical steps in the scaling procedure should be send to the screen.The arguments popsize, maxgens, pxover, and pmutation aect the genetic algorithm: Argument popsize is the size of the population of partitionings.Argument maxgens is the number of generations considered before the genetic algorithm stops; maxgens depends on the number of items and its default value is 1000 £ 10 log 2 J 5 .Argument pxover is the probability of cross-over, and pmutation is the probability of mutation.The following code gives the partitioning of the items into Mokken scales according to the hierarchical cluster algorithm and the new genetic algorithm.
R> library(mokken) R> data(acl) R> X <-acl[, 1:10] R> scale.normal<-aisp(X) R> scale.ga<-aisp(X, search = "ga") R> cbind(scale.normal,scale.ga) Function aisp returns a J ¢ 1 matrix containing the partitioning of the items.The numbers 1; 2; : : : indicate the scale that the items are assigned to.A value 0 means that the item is unscalable.Using default option search = "normal" resulted in two scales: Scale 1 consists of four items and Scale 2 consists of four items; two items are unscalable.Using search = "ga" resulted in one scale consisting of seven items, leaving three items unscalable.The partitioning provided by the two algorithms is rather dierent.The genetic algorithm yielded a better partitioning in terms of Mokken's desiderata because the longest scale consists more items than in the partitioning obtained using the hierarchical clustering algorithm.The hierarchical clustering algorithm started with item pair (\reliable", \dependable"), and then added item \honest" to the rst scale.The inclusion of \honest" prevented the inclusion of any other item except \deceitful*" because due to the presence of \honest" the item-scalability coecients of the remaining items (except \deceitful*") did not exceed the lower bound.Due to the hierarchical structure of the algorithm \honest" stayed in the rst scale.In the genetic algorithm, also partitionings without \honest" were considered, which resulted in a longer rst scale.
The two algorithms were compared in a simulation study (Straat et al. in press).On the one hand, results showed that the genetic algorithm yielded better partitionings in terms of the objective function, and the genetic algorithm did not yield scales violating the criteria of a Mokken scale.On the other hand, results showed that the genetic algorithm can be slow if the number of items is large.

Computing standard errors for scalability coecients
Several heuristic rules in MSA involve interpreting absolute values of the scalability coecients.For example, Mokken used the label strong scale if H > :50; and an item is selected in a Mokken scale only if, H j > c, and all H ij > 0. For a sound application of these guidelines, the standard errors of the scalability coecients should be taken into account.For example, if H = :51 but its standard error equals .09,then the qualication `strong scale' is not appropriate because the standard error indicates that the population value of H may well be less than .50.Van der Ark, Croon, and Sijtsma (2008) derived standard errors for the scalability coecients for dichotomous item scores using marginal models.This approach required the evaluation of all possible response patterns.For a typical psychological test having J = 20 items , this results in 2 20 > 1; 000; 000 response patterns, which is infeasible for practical computation.
For polytomous items, the curse of dimensionality is even greater because 20 Likert items (5 ordered answer categories) result in almost 100 trillion response patterns.Kuijpers, Van der Ark, and Croon (2011) generalized the approach to scalability coecients for polytomous items, and solved the dimensionality problem.The standard errors are estimated under the assumption that the observed frequencies of the item-score patterns follow a multinomial distribution.Using normal approximation, the standard errors can be used to compute condence intervals.Because the theoretical maximum of the scalability coecients equals 1, these condence intervals may not be range preserving; alternatively, bootstrap condence intervals of the scalability coecients may be computed (Van Onna 2004).As of version 2.6 of mokken, function coefH, which provides the scalability coecients, also gives the theoretical standard errors: coefH(X, se = TRUE, nice.output= TRUE) By default coefH returns an object of class noquote, which presents the scalability coecients and standard errors in a pretty format.Argument se is a Boolean variable indicating whether or not standard errors should be computed.For huge sample sizes in combination with large numbers of items, the computation of standard errors may take rather long or memory problems may occur.In this case, it is advised to set se = FALSE.Argument nice.output is a Boolean variable indicating whether the resulting output should be of class noquote: nice.output= TRUE produces pretty output of class noquote, whereas nice.output= FALSE produces a matrix, which may be convenient if the scalability coecients are being used as input later on, for example, in simulation studies.Because the scalability coecients are ratios, standard errors can be very large even for large sample sizes when the item scores have a skewed distribution.This is illustrated by the following code: The scores on two items are simulated for a large sample (N = 10; 000); coecient H is computed; and a cross-classication of the item scores is presented.

Computation of test-score reliability
Test-score reliability (denoted X X 0 ) is the degree of stability of a respondent's test score across independent replications of a test administration.It is one of the most important concepts in psychometrics.Most often, psychological tests are only administered once, so there are no replications of the test administration.Even if a test is administered multiple times to the same sample, it is very unlikely that these would be independent replications because at the second administration, the respondents are very likely to remember the test items from the rst administration.As a result, the test-score reliability cannot be computed exactly and has to be estimated.The most common estimator is Cronbach's alpha (Cronbach 1951), which estimates the test-score reliability using the data obtained in a single test administration.
Cronbach's alpha is in fact one of the worst estimators because it underestimates the testscore reliability to a larger degree than readily-available alternative statistics (e.g., lambda-2) do (Schmitt 1996;Sijtsma 2009).Mokken (1971), (also see Sijtsma and Molenaar 1987;Molenaar and Sijtsma 1988) derived a statistic that is an unbiased estimator of the test-score reliability given that the double monotonicity model holds.This is a rather strong assumption.This statistic, which is known under several names including \rho" and the MS statistics, is included in MSP but suers from a programming error (Van der Ark 2010a).As a result MSP may not produce the correct value.
Alternatively, Van der Ark, Van der Palm, and Sijtsma (2011) proposed a statistic coined the latent class reliability coecient (LCRC) that is an unbiased estimator of the test-score reliability given that an unconstrained latent class model holds.First, the joint density of the item scores is estimated using a latent class model (e.g., Linzer 2011; Vermunt, Van Ginkel, Van der Ark, and Sijtsma 2008), and, second, the reliability coecient is derived from the latent class model.When latent class models are used for density estimation, the number latent classes should be large to allow precise estimation of the density; issues such as interpretation of the latent classes, identiability, and local maxima, which are important in the traditional use of latent class models, are not so important.Compared to the assumptions of the double monotonicity model (unidimensionality, local independence given , monotonicity, and non-intersection), the assumptions of the unconstrained latent class model (local independence given class membership) are rather weak; especially when the number of latent classes is large.This gives LCRC an advantage over the MS statistic.A drawback of LCRC is that the number of latent classes should be specied beforehand, which means that the t of several latent class models must be investigated.Simulation studies showed that MS and LCRC are superior to Cronbach's alpha, and if the data are not unidimensional LCRC is the statistic to be preferred (Van der Ark et al. 2011).
The MS statistic, Cronbach's alpha, lambda-2 and LCRC have been included in mokken in the function check.reliability:Table 2: Characteristics of the 12 items measuring transitive reasoning.

Real-data examples
This section is a demonstration of the new methods in mokken.For the larger part of the demonstration, we use the scores of 425 children on 12 items measuring transitive reasoning (Verweij, Sijtsma, and Koops 1996).Transitive reasoning is the ability to deduce a relationship from two or more other relationships of equality or inequality.For example, Item 1 (Table 2) contains three sticks of dierent length, here labeled A (12cm), B (11.5cm), and C (11cm).First, the test administrator shows the child two pairs of sticks, consecutively; for example, rst A and B and then A and C. Second, the test administrator asks the child whether B is longer, equal, or smaller than C. If the item consists of four objects rather than three, the test administrator shows three pairs of objects that, together, are sucient to deduce the relationships among all objects, and asks for the relationship between the objects in each of the three remaining pairs.Table 2 shows the characteristics of the 12 items.The items are ordered by the proportion of correct answers.For items 11 and 12, transitive reasoning was not a necessary ability to solve the items; they are referred to as pseudo items.For more detailed information, see Verweij et al. (1996).The data set transreas is available from the package mokken.The rst variable is the students' grade, which is not used for scaling and omitted.
R> library(mokken) R> data(transreas) R> X <-transreas [, -1] First, it should be investigated whether all items measure transitive reasoning.If this is true, it is expected that the 12 items form a Mokken scale, and all H ij s are positive and all H j s are greater than or equal to an appropriately chosen lower bound c; here c = :3.

R> coefH(X)
The results (output not included) show that the 12 items do not form a Mokken scale because several negative H ij values were found, and several values of H j were less than .3.
Second, because the entire of items does not form a Mokken scale, one should nd the largest set of items that forms a Mokken scale.Both algorithms in the AISP were used.
R> scale.normal<-aisp(X) R> scale.ga<-aisp(X, search = "ga") R> cbind(scale.normal,scale.ga) Unlike the example on page 11, the two algorithms yield the same item partitioning.The two pseudo items (11 and 12) are unscalable (Scale == 0).This seems logical because the items do not measure transitive reasoning.Also, item 5 is unscalable, which is dicult to explain.
If the lower bound is set lower than .30(e.g., .25),item 5 is included in the scale.Items 2 and 4 form a separate second scale, which may be explained by the fact that both items pertain to equalities rather than inequalities.Items 1,3,6,7,8,9,and  H > :5.Items 8 (a very easy item, p value = .97)and 9 (dicult item, p value = .30)have a perfect scalability coecient, which means that none of the children made a Guttman error; that is, none of the children failed the easy item and passed the dicult item.However, if the standard errors are taken into account, the results seem less promising.Notice that the standard errors of coecients H ij are generally very large.For example, when computing the bounds of the asymptotic 95% condence intervals by H ij ¦ 1:96 ¢ se( H ij ), it becomes clear that for some coecients the population value may be negative (e.g., H 1;9 = :336 but P( :174 < H 1;9 < :846) = :95).The standard errors of the H j coecients are also rather large; although it is reasonable to assume that the H j values in the population exceed the lower-bound value .3.Given the standard error of coecient H, it is plausible that the population value of H does not exceed .5;when computing the 95% condence interval, we nd P(:413 < H < :617) = :95.Therefore, the label `moderate scale' is more appropriate than `strong scale'.Several other properties may be investigated, such as monotonicity using the function check.monotonicity,and nonintersection of item step response functions using check.pmatrixand check.restscore.These functions are described in Van der Ark (2007).Fourth, one may be interested in the reliability of the selected items.As can be expected for sets of item scores that are not completely unidimensional, coecients MS and LCRC have higher values than alpha and lambda-2 which underestimate the true test-score reliability.The value of LCRC may show small deviations across runs due to local maximums in the estimation of the latent class models.Also, the value of LCRC may change for dierent numbers of latent classes.Whether, the test-score reliability is high enough depends on the purpose of the test.If the test scores are used to investigate the dierence in transitive reasoning between two groups (e.g., boys and girls), the test-score reliability suces.If the test scores are used in a correlational study, then test-score reliability is more important.The correlation between the test scores and any other variable cannot exceed the square root of the product of reliability of both test-scores (Lord and Novick 1968, pp. 69-74).The test-score reliability is most important for individual diagnosis.Following Lord and Novick (1968, p. 59), lower and upper bounds of the 95% condence interval of the respondents true score equal X + ¦ 1:96 ¢ X p 1 X X 0 : Using LCRC as a plug-in estimate of X X 0 , and the sample standard deviation as a plugin estimate for X , the following code shows the condence interval bounds for test score X + = 4.Note that set.seed(1) eliminates dierences in rXX over replications due to local maxima in the latent class model.The researcher must decide whether this interval is precise enough for the purpose of the psychological test.As the test-score reliability increase, the interval becomes smaller.Fifth, one may be interested to nd out whether the items are invariantly ordered.Here, the items are dichotomous.As a result, only the default option (method = "MIIO") in check.iiocan be used.

Discussion
With approximately four new versions per year since 2007, mokken is frequently upgraded and new features are added.This is not likely to stop.The to-do list already contains the implementation of several more existing and new methods.The following methods are implemented in MSP (see Molenaar and Sijtsma 2000) but not yet in mokken: MSA for dierent groups in the sample, especially the check of equal item (step) ordering per group; the search extended option in the AISP; the frequency distribution of the number of Guttman errors; and method \restsplit" for the investigation of nonintersection of item step response functions.Several aspects of MSP have not been implemented in mokken because standard R code can be used to obtain the results; for example, computing the test scores and providing a histogram of the test scores can be done easily using standard R code The following new methods will hopefully be implemented in mokken: An investigation of local independence based on Straat et al. (2011), a branch-and-bound algorithm for item selection based on Brusco et al. (2011), and condence regions that are graphically depicted in the plots of estimated item response functions and item step response functions.
check.reliability(X,MS = TRUE, alpha = TRUE, lambda.2= TRUE, LCRC = FALSE, nclass = nclass.default)Arguments MS, alpha, lambda.2,and LCRC are Boolean variables indicating whether or not MS, Cronbach's alpha, lambda-2, and LCRC should be computed.Argument nclass denotes the number of classes used for computing LCRC; the default value equals J 1.The output is a list containing the values of the required test-score reliability estimators.
R> set.seed(1)R> rXX <-check.reliability(Y,LCRC = TRUE)$LCRC R> sigmaX <-sd(apply(Y, 1, sum)) R> cat(" Lower bound:", round(4 -1.96 * sigmaX * sqrt(1 -rXX), 4), "\n", R> + "Upper bound:", round(4 + 1.96 * sigmaX * sqrt(1 -rXX), 4), fill = TRUE) Lower bound: 2.7066 Upper bound: 5.2934 R> test.scores<-apply(Y, 1, sum) R> hist(test.scores)and the number of negative H ij values per item can be obtained by R> Hij <-coefH(Y, se = FALSE)$Hij R> apply(Hij, 1, function(x) sum(x < 0)) (Hemker et al. 1997)ne homogeneity model is the most general IRT model, SOL also holds for other popular IRT models for dichotomous item scores.SOL allows the ordering of persons on by their test score.Although, in practice, the test score is almost always used to order or classify persons, test users do not always investigate whether SOL is a reasonable assumption.+=Kas a criterion.The practical relevance of weak SOL is that it allows to select the n respondents having the highest or lowest values by means of X + .Examples are the selection of the 10 most depressed respondents in the sample or the single most qualied respondent in the sample.In general, the stricter form of SOL (Equation1) does not hold for polytomous items(Hemker et al. 1997)but violations are rare if the number of items exceeds ve (Van der Ark 2005).
(Van der Ark and Bergsma 2010).Weak SOL allows dividing the sample into a group of respondents having high values and a group of respondents having low values using X Ligtvoet et al. (2010)s in MSA, signicance is tested at level without a correction for multiple testing, because each signicant violation in itself provides evidence against MIIO (cf.Molenaar and Sijtsma, 2000, p. 72).If one or more signicant t values are found, the item pair is said to violate MIIO.Adjacent rest-score groups r; r + 1; : : : containing few observations, may be joined to increase statistical power.The conventions for joining rest-score groups used for all methods in MSP and mokken are applied(Molenaar and Sijtsma 2000, p. 67).If MIIO does not hold for all item pairs, items are removed one-by-one using a backwards selection algorithm until no violations remain.For the remaining set of items for which MIIO holds coecient H T is computed (H T means coecient H computed on the transposed data matrix).Ligtvoet et al. (2010)advocated that if MIIO holds, :3 < H Ligtvoet et al. (2011)eted as a weak ordering, :4 < H T :5 as a moderate ordering, and H T > :5 as a strong ordering.Ligtvoet et al. (2011)extended the method from IIO to latent scales(Equations 3, 4, Changing the default values of arguments minvi, minsize, and alpha change minvi, the minimum number of respondents in each rest-score group, and the nominal Type-I error rate of the statistical tests, respectively.If item.selection== FALSE, the backward selection algorithm is omitted, and if verbose == TRUE, additional output is sent to the screen.Investigating IIO is demonstrated using the item scores on rst ten items in the data set acl; Data set acl contains the item scores of 433 rst-year Psychology students on 218 items from a Dutch version of the Adjective Checklist They are indicators for a response style called communality.Respondents that have a high test score are particularly good at giving responses that are commonly accepted.Investigating IIO of these 10 items provides information whether these adjectives are invariantly ordered in popularity.Function check.iioreturnsanobject of class iio.class containing all test results.Typically check.iio is used in combination with summary, returning a list.The rst element of the output ($method) echoes the method used.The second element ($item.summary)showsasummary of the results for each item; it includes itemH = item-scalability coecient H j (Equation8); #ac = the number of possible violations in which the item can be involved; #vi = number of actual violations in which the item is involved; maxvi = maximum violation; sum = sum of all violations in which the item is involved; tmax = maximum t statistic; tsig = number of times the item appears in a signicant violation of MIIO; and crit = the crit value(Molenaar and Sijtsma 2000, pp.47); which is a weighted sum of the other components (i.e., itemH, #ac, etc.), and by some researchers used as a diagnostic statistic.High crit values indicate bad items (for more details and examples see, e.g.,Van Schuur 2011, p. 54).The third element ($backward.selection)shows the backward selection procedure: Step 1

Table 1 :
Table1(rst row) shows for 10; 15; : : : ; 40 items, the maximum number of generations in the genetic algorithm and the computation time of the two algorithms in minutes and seconds.Hence, for large numbers of items the genetic algorithm may not be suitable.Default maximum number of generations maxgens in the genetic algorithm of the AISP, and computation time (CPU) in minutes and seconds for the two algorithms in the AISP for an increasing number of items.Default settings were used on an ASUS U6Sg notebook under Windows 7.
Van der Ark (2007)his suggests a strong support for IIO.Methods check.restscoreandcheck.pmatrix,describedinVanderArk (2007)are also suitable for investigating IIO for dichotomous items.A demonstration of check.iioforpolytomous items is taken from Adjective Checklist.Items 101 to 110 are adjectives that describe an aggressive personality.Negatively worded items are indicated by an asterisk.The 8 selected items have rather dierent mean item scores, with Item 104 (unkind) being the least popular and Item 105 (impatient) being the most popular.It is investigated whether the ordering of the adjectives in terms is invariant across all levels of aggression.Investigating the least restrictive manifest ordering property MIIO yielded the following resultsThe eight items do not meet the MS-CPM.Several signicant violations were found.Only if four items were removed, no signicant violations of MS-CPM remain; leaving a scale consisting of four items.For the four remaining items H T = :48, suggesting moderate support for IIO.Finally, method \IT" can be applied, yielding the following results: R> summary(check.iio(X,method = "IT")) -CPM : latent scale for cumulative probability models LS-CRM latent scale for continuation ratio models -CPM : manifest scale for cumulative probability models SOL : stochastic ordering of the latent trait by the test score A. List of acronyms AISP : automated item selection procedure IIO : invariant item ordering IRT : item response theory IT : increasing in transposition LCRC : latent class reliability coecient LS-ACM : latent scale for adjacent category models LSMIIO : manifest invariant item ordering MSA : Mokken scale analysis MS