From “Aisle” to “Labile”: A Hierarchical National Adult Reading Test Scale Revealed by Mokken Scaling

Decline in cognitive ability is a core diagnostic criterion for dementia. Knowing the extent of decline requires a baseline score from which change can be reckoned. In the absence of prior cognitive ability scores, vocabulary-based cognitive tests are used to estimate premorbid cognitive ability. It is important that such tests are short yet informative, to maximize information and practicability. The National Adult Reading Test (NART) is commonly used to estimate premorbid intelligence. People are asked to pronounce 50 words ranging from easy to difficult but whether its words conform to a hierarchy is unknown. Five hundred eighty-seven healthy community-dwelling older people with known age 11 IQ scores completed the NART as part of the Lothian Birth Cohort 1936 study. Mokken analysis was used to explore item responses for unidimensional, ordinal, and hierarchical scales. A strong hierarchical scale (“mini-NART”) of 23 of the 50 items was identified. These items are invariantly ordered across all ability levels. The validity of the interpretation of this briefer scale’s score as an estimate of premorbid ability was examined using the actual age 11 IQ score. The mini-NART accounted for a similar amount of the variance in age 11 IQ as the full NART (NART = 46.5%, mini-NART = 44.8%). The mini-NART is proposed as a useful short clinical tool to estimate prior cognitive ability. The mini-NART has clinical relevance, comprising highly discriminatory, invariantly ordered items allowing for sensitive measurement, and adaptive testing, reducing test administration time, and patient stress.

The NART comprises words of graded difficulty starting with more commonly used words, such as "ache" and "chord" and becoming more difficult as it progresses to less frequently used words, such as "syncope" and "campanile." Whereas NART items may be considered as forming an informal hierarchy, as planned by the test's constructors, it is important to investigate item properties explicitly to determine whether the items conform to a formal hierarchy of difficulty and whether this hierarchy is the same for all respondents (i.e., is the ordering for people with higher levels of ability the same as for those with lower ability levels). The effect of ability level on item ordering was investigated by Deary, Watson, Booth, and Gale (2013) who determined that the strength of hierarchies of item ordering of the Warwick-Edinburgh Mental Well-being Scale varied according to the cognitive ability of the sample. Item difficulty represents the ease at which an item is responded to correctly, with mean item scores used to indicate an item's difficulty level. Establishing whether a scale has hierarchical properties adds another dimension to its use other than just using total summed scores. If a scale has hierarchical item ordering it implies that the items are ordered relative to each other and that all are ordered along the latent trait being measured. A hierarchy with the same ordering of difficulty for all subgroups from the population of interest, referred to as invariant item ordering (IIO; Sijtsma & Junker, 1996). IIO simplifies the interpretation of the results by avoiding different item ordering between different relevant subgroups which would warrant further analysis to find the reason for these differences (Sijtsma & Hemker, 1998).
From a clinical perspective, hierarchical tests are attractive for their ease of use and scoring (Kempen, Myers, & Powell, 1995). Confirming hierarchies of item difficulty has meaningful clinical implications; continuing to test patients on words that they are predictably going to be unable to pronounce correctly may cause undue distress without adding any valuable information. Also, responses to individual items and not just total scores can provide insight into a respondent's level of ability based on the item's location in the hierarchy (Watson, Deary, & Austin, 2007). Hierarchical tests have proven valuable in the assessment of several constructs, for example, psychological distress (Watson, Deary, & Shipley, 2008), feeding difficulty in dementia (Watson, 1996), and activities of daily living (Fieo, Watson, Deary, & Starr, 2010;Kempen & Suurmeijer, 1990).
An analogy of climbing a staircase can be used to illustrate the properties of a hierarchical scale. Each step represents an item in a scale. Any given height within the staircase represents the level of latent trait. It follows that you cannot reach the ninth step without having previously climbed the eight steps below; and by having climbed to the ninth step you will not have reached any step above this level.
The degree to which items in a test form a hierarchy can be determined using Mokken scaling analysis which searches multivariate data for unidimensional, ordinal, and hierarchical scales. Mokken scaling is a nonparametric application of item response theory (IRT) that explores the relationship between items and the latent trait (Watson et al., 2012). Mokken scaling analysis can be applied to examine clinically valuable properties of items within scales, including item discrimination. Discrimination reflects the degree of accuracy with which an item can distinguish between respondents of different levels of the latent trait and indicates the strength of the relationship between an item and the other items within a scale. Considering item discrimination allows for the creation of scales with greater precision without having to increase the number of items. For example, Sabourin, Valois, and Lussier (2005) used IRT methods to create a four-item abbreviated form of the Dyadic Adjustment Scale, which was as effective as the original 32 item scale. Similarly, a 10-item scale was derived from the 19-item Feelings Scale without the loss of measurement precision (Edelen & Reeve, 2007).
IRT methods have been applied to two measures of premorbid intelligence: a French language version of the NART, the fNART (Mackinnon, Ritchie, & Mulligan, 1999) and the Adult Reading Test (ART; Letz et al., 2003). Mackinnon et al. (1999) used a two-parameter logistic IRT model to examine the measurement properties of the 40-item fNART. The discrimination of the scale items varied considerably with several of the items contributing little to the assessment of premorbid intelligence. A refined 33item fNART was revealed with the elimination of seven items with poor discriminatory power. Letz et al. (2003) fit a one-parameter logistic (Rasch) model to the items of the ART, adapted from the North American Adult Reading Test (NAART; Blair & Spreen, 1989). Rasch analysis provided an improved ordering of difficulty from the original subjective ranking, finding "two" to be one of the least difficult items and "demesne" to be the most difficult item. Results from this Rasch analysis formed the basis for the implementation of a computerized-adaptive ART whereby items are matched to respondents by difficulty. This prevents individuals being presented with items far beyond their ability level helping to reduce frustration or anxiety and minimizing the boredom or carelessness of those with higher ability when faced with very easy items.
The possibility of deriving a briefer scale from the NART from which to estimate premorbid IQ is not new. Beardsall and Brayne (1990) explored the idea of creating a shortened version of the NART. A regression equation was developed based on scores from the first 25 words of the NART to predict scores on the remaining 25 words (i.e., items 26 to 50). This method provided a reasonably accurate estimation of the full NART score with predicted NART and true NART scores correlating strongly, r ϭ .93, p Ͻ .001. While the application of the Short NART left a proportion (23-31%) of the variance unaccounted for, the accuracy with which the Short NART predicted Wechsler Adult Intelligence Scale IQ was effectively equal to that of the full NART (Crawford, Parker, Allan, Jack, & Morrison, 1991). The authors suggest the application of the Short NART with reasonable confidence where helpful or convenient in place of the full scale.
While these studies have analyzed and refined the assessment of premorbid cognitive ability, to our knowledge, there has been no application of Mokken scale analysis to the NART. Therefore the aim of the present study was to examine the item properties and the hierarchical structure of the NART by assessing the fit of the items to Mokken's monotone homogeneity model (MHM) and the nonintersection of item response functions (IRFs). The IRF is the building block of IRT and represents the probability of endorsing as item as a function of the underlying trait (Fraley, Waller, & Brennan, 2000). When the assumptions of the MHM hold, the IRFs increase as levels of the latent trait increase, meaning that respondents can be ordered with respect to their latent trait level based on the summed total score of their responses. Nonintersection, now more commonly referred to as IIO, is an item ordering property whereby the IRFs for total scores on a set of items do not intersect and is crucial for establishing hierarchical scales. In the case of dichotomous items where IRFs are nonintersecting, IIO (formally known as double monotonicity) can be said to hold. Establishing the fit of the data to these models would allow the use of total scale scores (in the case of the MHM) and individual items (IIO) to assess estimated levels of premorbid cognitive ability. Additionally this analysis aims to determine the contribution of each item. Redundant items can be removed to form a new brief scale.

Participants
The Lothian Birth Cohort 1936 (LBC1936) comprises 1091 community-dwelling older adults most of whom completed the Moray House Test No. Twelve (MHT) (Scottish Council for Research in Education (SCRE), 1933) of verbal reasoning at a mean age of 11 as part of the Scottish Mental Survey of 1947 (Scottish Council for Research in Education, 1949;Deary, Whalley, & Starr, 2009). The Scottish Mental Survey of 1947 (SMS1947) measured the mental ability of almost all Scottish schoolchildren born in 1936 and attending school at age 11 years on June 4th 1947 using the MHT. The MHT is a well-validated measure of general intelligence comprising mostly verbal reasoning items with a maximum possible score of 76. Childhood MHT scores were highly correlated with the Stanford-Binet intelligence test, r ϭ .81 in boys (N ϭ 500) and r ϭ .78 in girls (N ϭ 500; SCRE, 1933). Between 2004 and 2007 those residing in Edinburgh and the Lothians who may have taken part in the SMS 1947, who were then approximately age 70, were contacted and invited to participate in the LBC1936. The Community Health Index was used to identify potential participants born in 1936. All participants spoke English as their first language.
Social class was derived from the participants' reported highest occupational level as well as that of participants' fathers. Social class for the participants was calculated using the Office of Population Censuses and Surveys; Classification of Occupations, 1980. Social class of participants' fathers was calculated using the General Register Office's Census, 1951 Classification of Occupations. Both were classified as one of six categories from I (professional) to V (unskilled) with lower numbers designating higher social class. Married women also reported the occupation of their spouses which was used if higher. The recruitment and testing of this cohort has been described in detail elsewhere (Deary, Gow, Pattie, & Starr, 2012;Deary et al., 2007).
Participants in the LBC1936 returned for detailed cognitive and physical testing from age 70 (wave 1, N ϭ 1091), and item level responses to the NART were recorded at wave 3 (2012), at a mean age of about 76 years. Age 70 IQ was measured by the MHT (M ϭ 65.7, SD ϭ 7.7) corrected for age in days at time of testing, and converted to an IQ score (mean IQ ϭ 102.42, SD ϭ 13.16).
Self-reported medical background was obtained for all participants at the cognitive and physical assessment. After excluding those who had a self-reported clinical history of dementia (N ϭ 8) data from all other participants returning at wave 3 with complete NART item level data were included for analysis (N ϭ 587, 51% male). Mini Mental State Examination (MMSE) (Folstein, Folstein, & McHugh, 1975) scores indicated that 99.6% of this sample scored Ն23. The characteristics of study participants are shown in Table 1.
The percentage of respondents correctly pronouncing the NART items was used to indicate level of item difficulty with lower percentages indicating greater degree of difficulty.

Mokken Scaling
Mokken scaling analysis was applied to investigate whether the ordering of items by difficulty is the same for all respondents, making it invariantly ordered.
The fit of the items to Mokken scaling properties was assessed by examining whether they conformed to the four assumptions; unidimensionality, local stochastic independence, monotonicity, and nonintersection. Mokken scaling analysis was performed using the Mokken package in R (van der Ark, 2007). These assumptions were investigated using a hierarchical clustering algorithm, scalability coefficients, latent monotonicity, and the H T coefficient.
Automated item selection procedure (AISP). The assessment of unidimensionality involves an automated item selection procedure (AISP) which partitions items into scales, or groups of related items measuring a common latent trait, using a hierarchical clustering algorithm. The AISP is a bottom-up sequential item selection method based on interitem covariances and the strength of the association between the items and the latent trait. The process begins with the selection the pair of items with the highest positive item-pair scalability coefficient (H ij ). This selection procedure proceeds until no additional items meet this criterion. From any items remaining unselected a new scale can be formed in the same way. Any items remaining out with a scale are deemed unscalable (Sijtsma & Molenaar, 2002). Scalability coefficients. Item, item-pair, and scale scalability coefficients are computed and used as criteria for partitioning items into scales and as measures of strength of the scales. Item scalability coefficients (H i ) express item discrimination. This coefficient is equivalent to item-test correlation or a factor loading. Item-pair scalability coefficients (H ij ) reflect the joint scalability of item pairs. Scale scalability coefficient (H) expresses the strength of the overall scale. A general rule of thumb for interpretation of H exists: scales with values below 0.3 are not considered unidimensional, between 0.3 and 0.4 are considered as weak, values between 0.4 and 0.5 are indicative of a medium strength, and values greater than 0.5 can be considered as strong (Mokken, 1971).
Latent monotonicity. The assessment of monotonicity is important as it enables the respondents to be ordered on the latent trait with respect to the summed score of the scale. Items violating this assumption can be detected and removed if necessary. To avoid the model being rejected due to trivially small violations only violations greater than the default minimum of 0.03 are considered relevant (van der Ark, 2007).
Invariant item ordering. The method to investigate IIO used here is advocated by van der Ark (2012) and can be performed using the Mokken package in R by running the check.iio command. Here, all violations of IIO are detected and removed. The item with the largest violation is removed first and the remaining items checked again for IIO violations. This is done iteratively as the exclusion of one item may affect the IIO violations of the remaining items. Scalability coefficient H T is computed and is considered a measure of the accuracy of item ordering within a scale with a similar rule of thumb for interpretation as H (Ligtvoet et al., 2010).
Reliability can be estimated using the Molenaar Sijtsma statistic (MS) (Molenaar & Sijtsma, 1984). MS provides a direct estimate of the test score reliability with MS Ͼ0.7 indicative of a reliable scale.

Graphical Analysis
The R package KernSmoothIRT (Mazza, Punzo, & McGuire, 2014) was used to graphically present item properties. The package applies kernel smoothing in the estimation of item response functions and related graphical analysis. It provides several plotting and analytical methods to consider properties of the items, subjects, and test as a whole. The exploratory nature of the package makes it ideal to be used alongside Mokken analysis since it provides plots which can be helpful when examining the monotonicity and discrimination of items. For more details on this package see Mazza et al. (2014).

Validation
The present study had access to childhood IQ scores which enabled the retrospective validity of the NART items as proxies for prior cognitive ability across the life span to be assessed. The correlation between NART items and prior and concurrent cognitive ability, both measured by converting MHT scores at age 11 and age 70 into IQ scores, was investigated. Regression and correlation analyses were performed using SPSS v. 19.0. Table 1 shows descriptive statistics for the sample variables. Mean (SD) total NART score for this sample was 35.3 (7.7), equivalent to an IQ of 112.3 (based on regression equations calculated by Nelson and Willison (1991)). The mean (SD) MHT score at age 11 for this sample of the LBC 1936 cohort was 50.6 (11.6) compared with a mean of 36.7 (16.1) for Scotland (N ϭ 70,805) (Deary, Gow, Pattie, & Starr, 2012;SCRE, 1949). Converted to an IQ score, the mean IQ for this sample, 0.864 standard deviations above a mean of 100 (SD ϭ 15) is 113.

Results
Items ordered from least to most difficult in Table 2 demonstrates several inconsistencies between this ordering by mean scores and the test order in this sample. For example, "capon" and "drachm" which are seventh and 33rd in the test administration order, respectively, are the 22nd and 50th item in the ordering by sample mean scores. Note. NART ϭ National Adult Reading Test; NART order ϭ item number of word order in current NART testing procedure/hierarchy (i.e. Item 1, "chord," presented first); Correct (%) ϭ percentage of respondents correctly pronouncing the items with higher percentages indicating lower difficulty.
The Mokken automated item selection procedure partitioned 38 of the 50 items into one scale, three items into a second scale, and determined the remaining nine items to be unscalable (see Appendix A for a table of items in each scale). The scalability coefficients of the 38 items of scale 1 were examined. All item-pair scalability coefficients (H ij s) were non-negative and all item scalability coefficients were above 0.3, indicating that these 38 items belong in the same unidimensional Mokken scale. There were no significant violations of monotonicity. All 38 items of this abbreviated NART form a Mokken scale meeting MHM criteria (H ϭ 0.471, SE ϭ 0.017) (see Appendix B for a table of 38 abbreviated NART items ordered by discrimination).

The Mini-NART
Removing the items that violated IIO resulted in a 23 item scale (the "mini-NART") which had no more significant violations of IIO (Table 3). The total scale scalability coefficient for this subset was 0.534 (SE ϭ 0.017), indicating a strong Mokken scale. H T was 0.71, which indicates sufficient accuracy of item ordering within this scale (Ligtvoet, van der Ark, Te Marvelde, & Sijtsma, 2010). Reliability was very high (MS ϭ 0.89).
The pattern of correlations between the NART and the mini-NART and IQ measured at age 11 and age 70 are presented in Figure 1. The NART and the empirically derived mini-NART positively correlated with age 11 IQ (NART: r ϭ .68, P ϭ Ͻ0.001; Mini-NART: r ϭ .67, P ϭ Ͻ0.001). Both original and short versions of the NART correlated with age 70 IQ (NART: r ϭ .66, p Ͻ0.001; mini-NART: r ϭ .62, P ϭ Ͻ0.001).
To investigate the predictive accuracy of the total score from the 23 item mini-NART, regression analyses were carried out. The mini-NART accounted for 44.8% of the explained variability in age 11 IQ-tested 65 years previously in this sample whereas the full version of the NART accounted for 46.5% of the variance. The 38-item abbreviated NART, conforming to the properties of the MHM, accounted for 48.3% of the variance. The regression equations (with 95% confidence interval (CI)) estimating an individual's premorbid cognitive ability from performance on the mini-NART and NART are presented below:  For ease of use the table in Appendix C converts NART, abbreviated NART, and mini-NART scores to predicted IQ scores using these regression equations.

Item Discrimination
Looking at some items rejected by Mokken scaling it is clear that some NART items are not adequately distinguishing between respondents and are not contributing much to the accurate estimation of premorbid functioning. Figure 2 graphically presents the discriminatory power of two items of the NART: "leviathan" (Mini-NART) and "radix" (unscalable). These IRFs, produced by KernSmooth provide a representation of item discrimination. The slope here reflects the rate of change, designating the level of effectiveness at any point along the latent trait (DeJong & Molenaar, 1987). The poor discrimination value (H i ϭ 0.001) of item 19 ("radix") is reflected in the relatively flat IRF. This means that large differences in ability are associated with very modest changes in the probability of correctly pronouncing with "radix." Practically speaking, two people of different levels of ability are likely to achieve the same score on this item. This item adds little information to the overall estimate of premorbid cognitive ability as some respondents of different levels of ability have similar response profiles. The curve of item 43 ("leviathan") is very steep in the region of higher ability with small differences in ability at this level associated with substantial differences in the likelihood of correctly pronouncing the word.

Discussion
The present study investigated the hierarchical nature of the NART by determining whether the data conformed to the assumptions of the MHM and IIO in 587 mostly healthy older adults with prior IQ measured at age 11. It demonstrated the utility of Mokken scaling and graphical analyses in exploring item level responses in the NART.
Two subscales within the NART were revealed: (a) a 38 item abbreviated NART scale conforming to the MHM, and (b) a 23 item mini-NART with IIO. The items in the abbreviated NART can be stochastically ordered by degree of latent trait. However this ordering is not invariant across respondents of different levels of latent trait, that is, the total score of this abbreviated NART, but not individual items, can be used by clinicians and researchers to obtain an estimation of a respondent's level of premorbid cognitive ability.
The mini-NART, comprising only items strongly related to the latent trait with good discrimination values, conforms to a strong and invariantly ordered hierarchy. This adds value and clinical relevance to a scale since it implies a consistent ordering of items which is invariant for all values of the latent trait. Individual items within the mini-NART can be used to approximate a respondent's level of premorbid cognitive ability. A person's estimated prior cognitive ability can be represented by the score on a single item in the Mini-NART, the most difficult item correctly responded to. This scale could be applied adaptively whereby only a section of the NART either in the higher or lower difficulty range of the scale needs to be applied, according to the ability of the individual patient. The test can be administered in order of ascending difficulty starting with "aisle" or descending difficulty starting with "labile." For example, a participant who is able to correctly pronounce "labile" or "sidereal" would most likely be able to pronounce all other (less difficult) items in the scale. Likewise, any participant unable to correctly pronounce "aisle" or "debt" would most likely be unable to correctly pronounce any of the other (more difficult) words.
Administering IIO scales adaptively can help to reduce the time needed to test patients, reducing the burden placed on the patient and helping to diminish the stress or frustration of the patient (van der Lee, Roorda, Beckerman, Lankhorst, & Bouter, 2002). Although the NART in full is a relatively quick scale to administer the reading of progressively more difficult and infrequently encountered words aloud may still cause embarrassment and anxiety among those who are experiencing difficulty. Participants with early dementia or mild cognitive impairment with awareness of declining cognitive abilities are likely to be anxious facing a lengthy test battery. Shorter tests with less potential for distress and embarrassment may reduce the likelihood of participants withdrawing from testing, and may be particularly useful in clinical (medical) environments where time is limited. Adaptive testing or tailored assessment appears to be increasingly appealing in addressing the need for quick and reliable measurement. Ware et al. (2003) reported that the use of an adaptive form of the Headache Impact Survey performed better that the traditional version in terms of reducing respondent burden, measuring change over time and in test reliability and validity. Like the Rasch-derived computerized-adaptive ART (Letz et al., 2003), the mini-NART can be applied adaptively but, importantly, without the expense and practical implications of testing patients with a computerized test.
IRT methods can be used to ensure a scale is measuring what it is designed to measure (Langenbucher et al., 2004;Noerholm et al., 2004). With regard to the NART, 12 items were identified that did not conform to the unidimensional MHM, indicating that in this sample the NART in full includes items not measuring the same latent trait. Also, Mokken scaling suggests that "drachm," "topiary," and 'prelate" form a separate cluster which may measure something other than premorbid cognitive ability. The inclusion of these items may mean that the total NART score does not solely reflect premorbid cognitive ability. Rasch analysis of the ART, which has several items in common with the NART, identified "aeon" and "banal" as candidates for removal from mis-fit statistics (Letz et al., 2003). Neither of these items was retained in the mini-NART which adds validity to the removal of these items from the full NART.
By removing poor discriminatory items, the mini-NART with similar predictive accuracy was identified. We have found that adding extra items to the mini-NART does not increase the amount of variance of age 11 IQ explained in this sample. This mini-NART, like the Short NART, offers predictive accuracy effectively equal to that of the full scale. However the mini-NART avoids the complications of the Short NART testing process. Beardsall and Brayne (1990) suggest testing patients on the first half (Short NART) and applying a regression equation to predict the full score for patients scoring between 12 and 20 on this Short NART. If a patient scores less than 12 on the Short NART this score should be taken as the full NART score and for those scoring over 20 the full NART should be administered to determine their score. To observe these discontinuation rules a tally of errors must be kept during testing. Short NART total scores must then be converted to a NART error score before premorbid ability can be estimated. The mini-NART requires no extra calculations and has the distinct advantage of being a hierarchical scale.
One limitation of the mini-NART as a means of estimating premorbid cognitive ability is that with only 23 words, it is not as finely graded as the full 50 item scale or the 38 item abbreviated NART. With only 23 items it may not differentiate as efficiently between the higher levels of cognitive ability since its ceiling level of 23 items is predictive of an IQ score of 119. In this sample of 587 Participants 59 have IQ scores greater than 119. However, using the full 50-item NART, this ceiling is only extended by approximately two IQ points to 121. An estimated IQ based on a maximum score should be interpreted as a lower-limit estimate only with a mini-NART score of 23, indicative of an IQ of 119 or higher.
The present analysis demonstrates the utility of IRT in examining item properties of established scales and how this insight can be used in the development of a shorter hierarchical scale. This study applied novel methods in a well-characterized sample with relatively large numbers. A particular strength of this study is the availability of a valid intelligence test score from age 11 for the sample, which ensures the scores are free from age-related decline. This permitted the validity of the mini-NART to be assessed using the actual premorbid cognitive ability. Dykiert and Deary (2013) and Crawford, Deary, Starr, and Whalley (2001) also utilized the prior ability of the LBC to examine the retrospective validity of the NART. Due to the rarity of actual premorbid ability data previous validation studies typically compared NART performance with measures of current abilities (Crawford et al., 1989;Nelson, 1982).
Some limitations of the study should be noted. The self-selected LBC1936 cohort is not fully representative of the population. First, the cohort is geographically restricted. The LBC 1936 cohort is also somewhat restricted in range with regards to childhood cognitive ability. The individuals in this sample are of a higher than average ability level, scoring almost 14 MHT points higher at age 11 than their peers across Scotland (Scottish Council for Research in Education, 1949;Deary, Gow, Pattie, & Starr, 2012). This is reflected in how few items there are with low percent correct in the NART in this aboveaverage ability sample. Performing the same analysis on a more representative sample with lower cognitive abilities with fewer participants approaching ceiling performance for many items would be a valuable extension to this analysis. Also, this analysis was carried out using a sample of elderly participants without self-reported dementia. The self-reported history of dementia is subject to the accuracy of recall. However with only 1% of participants scoring less than 24 points on the MMSE, suggesting possible dementia, the sample is mostly cognitively healthy. To examine the generalizability of these findings it is necessary to examine the accuracy of the mini-NART in a cross-validation sample before applying the scale in clinical practice. Replication using participants with a range of abilities, and diagnoses of dementia and mild cognitive impairment is necessary to investigate the performance of the mini-NART in pathological cognitive decline. Also, the NART and mini-NART account for less than 50% of the reliable variance in premorbid cognitive ability leaving a significant percentage unaccounted for. However, this is a lowerbound estimate which does not account for restriction of range or measurement error.
The value of H T here is very high and, as such, it is worth noting that in some cases elevated H T values can be caused by violations of local stochastic independence (Watson, Wang, & Thompson, 2014). Local stochastic independence is violated when items within a scale are linked (i.e., the response to one item is dependent on the response to another). In the case of the NART local stochastic independence is very unlikely to have been violated since the responses are not dependent on each other.
One possible reason to explain why IIO did not hold for some items may reflect how people's knowledge of some of the more difficult and unusual words, some of which depend on specialist experience (e.g., medical terms like syncope, puerperal), is quite unpredictable, which will have an effect on responses. This could also help to explain the inconsistencies between the item ordering by mean scores and the test administration order. The effect of regional variation in pronunciation is also likely to contribute this irregular response ordering. With regard to unscalable items, it is possible that agreement between raters could be partly responsible. Crawford et al. (1989) found "aeon" to have an agreement rate closer to chance than perfect agreement, which could help to explain why this item did not follow the typical pattern of response one would expect.

Conclusions
Good scales with good psychometric properties, including IIO, are sought for accurate assessment in clinical practice and this paper demonstrates how Mokken scaling can help contribute to this goal. Mokken scaling analysis revealed that some NART items do not contribute to the measurement of premorbid cognitive ability in this sample and identified other items whose contribution is low. This analysis identified a useful, unidimensional, and highly discriminatory scale within the NART; the mini-NART, a hierarchical subset of 23 invariantly ordered items. While further research to support the validity of the mini-NART, particularly in populations more representative of the general population, is necessary, the 23-item scale is presented as a promising alternative to the original NART for both clinicians and researchers. The mini-NART could prove to be of clinical and practical benefit in the estimation of premorbid cognitive ability.