Item-Level Scores on the Boston Naming Test as an Independent Predictor of Perirhinal Volume in Individuals with Mild Cognitive Impairment

We explored the methodological value of an item-level scoring procedure applied to the Boston Naming Test (BNT), and the extent to which this scoring approach predicts grey matter (GM) variability in regions that sustain semantic memory. Twenty-seven BNT items administered as part of the Alzheimer’s Disease Neuroimaging Initiative were scored according to their “sensorimotor interaction” (SMI) value. Quantitative scores (i.e., the count of correctly named items) and qualitative scores (i.e., the average of SMI scores for correctly named items) were used as independent predictors of neuroanatomical GM maps in two sub-cohorts of 197 healthy adults and 350 mild cognitive impairment (MCI) participants. Quantitative scores predicted clusters of temporal and mediotemporal GM in both sub-cohorts. After accounting for quantitative scores, the qualitative scores predicted mediotemporal GM clusters in the MCI sub-cohort; clusters extended to the anterior parahippocampal gyrus and encompassed the perirhinal cortex. This was confirmed by a significant yet modest association between qualitative scores and region-of-interest-informed perirhinal volumes extracted post hoc. Item-level scoring of BNT performance provides complementary information to standard quantitative scores. The concurrent use of quantitative and qualitative scores may help profile lexical–semantic access more precisely, and might help detect changes in semantic memory that are typical of early-stage Alzheimer’s disease.


Introduction
Semantic memory (SM) is a multidimensional ability that enables the processing of semantic knowledge (SK). SK representations are stored throughout the entire cortex [1,2], and, as with information processed by any other type of memory, these are learnt and, subsequently, accessed via mechanisms of encoding and retrieval, respectively. A very large meta-analysis based on an "Activation Likelihood Estimate" methodology indicates that the system responsible for encoding and retrieving SK is supported by a widespread cerebral network (the SM network) that involves the limbic and hippocampal/parahippocampal regions, but also extends to the inferoparietal, middle-temporal, and prefrontal cortices [3].

BNT Literature Review
The BNT is a test of picture naming that, over the years, has been used in clinical research to characterise anomias and deficits of semantic processing in individuals with neurological conditions. Individuals at a mild stage of dementia of the AD type perform worse than controls on the BNT [34], and their performance is a significant predictor of further cognitive decline after 30 months [35]. Individuals with dementia of the AD type also perform significantly worse on the BNT than individuals with a form of dementia with Lewy Bodies of comparable severity [36], and the performance of individuals with a diagnosis of frontotemporal dementia depends on the pathophysiological profile of the disease. Individuals with a behavioural variant obtain similar levels of performance to that of individuals with AD dementia, whereas individuals with a semantic variant perform significantly worse [37,38]. Data indicate that the total BNT score of individuals with amnestic MCI is within the range of normality. These individuals, however, tend to make disproportionately more semantic errors (such as paraphasias and circumlocutions) [39], but, at the same time, they also tend to provide significantly fewer spontaneous responses than those of healthy controls, suggesting that naming difficulties in this group cannot be fully accounted for by SK disruption [40]. Findings obtained from neuroimaging studies provide further evidence in support of a range of diverse mechanisms characterising BNT performance in individuals with MCI and AD dementia. Reduced BNT performance at the MCI stage is associated with GM density in the left anterior temporal lobe. This association extends to the bilateral mediotemporal lobe when individuals with MCI and dementia of the AD type are analysed as part of a single inferential model [41], indicating that disease processes may contribute to altering the underlying neurocognitive mechanisms of BNT performance at some point along the clinical trajectory. Moreover, changes in volume in the left hippocampus in a sample defined along the continuum between MCI and AD dementia is predictive of longitudinal BNT decline over a span of~2 years [42]. When MCI individuals are evaluated as a function of their subsequent progression to AD dementia, performance on the BNT appears to be a significant predictor of atrophy progression. AD converters who were low-BNT MCI performers, in fact, develop significantly more atrophy (than that of high-BNT MCI performers) in a large set of prefrontal, temporal, and parietal regions one year after conversion to dementia [43]. Taken together, these findings suggest that the neurological resources that support performance on the BNT at the MCI stage are influenced by disease mechanisms, but are also associated with the integrity of a series of regions known to sustain memory retrieval and semantic control.
In light of this heterogeneous pattern, the aforementioned study hypotheses were addressed to test the added value of a BNT item-level score, under the assumption that this qualitative index predicts patterns of GM density independent of those predicted by "standard" quantitative scores.

Participants
The Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu) was identified as an appropriate source of data to address the study hypotheses. The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography, other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of MCI and early AD. For up-to-date information, see www.adni-info.org.
The online repository of the ADNI initiative includes item-level scores of individual BNT performances (see Section 2.2.1. for more details). These can be found within the "Item Level Data" database, a spreadsheet that includes scores from individual items of a number of tests and batteries [44]. As of April 2023, availability of item-level data is limited to the first phase of ADNI (i.e., "ADNI1") that includes > 700 participants with a clinical diagnosis of "control", "MCI", or "AD dementia". For the purpose of this study, control (n = 205) and MCI (n = 369) individuals with available cognitive and good quality neuroimaging data were considered for inclusion. As repeated BNT performance is characterised by practice effects [45], baseline data only were scrutinised to address the study hypotheses. All participants included in this study were between the ages of 55 and 90, had a Mini Mental State Examination score between 24 and 30, were fluent in English or Spanish, and were supported by a reliable informant that could contribute to the diagnostic procedures. All healthy controls had a Clinical Dementia Rating score of 0 and no symptoms indicating the presence of depression or cognitive impairment. MCI participants instead had a Clinical Dementia Rating score of 0.5, a subjective memory complaint, objective memory decline (as informed by education-corrected scores on Wechsler Memory Scale Logical Memory II), no decline in activities of daily living, and no dementia.

Boston Naming Test Scores
Based on the original 60-item version of the test [32], a short version of the BNT is included as part of the ADNI neuropsychological battery, and this consists of the 30 odd (i.e., 1, 3, 5, 7, 9, etc.) trials only. Each individual BNT performance was initially inspected for data-cleaning purposes. At this stage, 8 control and 17 MCI datasets were discarded because of missing data on one or more BNT item responses (marked by ADNI as "999"), leaving a cohort of 197 cognitively normal controls and 352 individuals with MCI. Items were then classified as "correct" (if correctly named on the first attempt) or "incorrect" (including those named only after the provision of a cue).
The values for word-related SMI were obtained from a normative study carried out on a cohort of >1200 native English speakers [46]. SMI values were not available for 3 of the 30 words ("OCTOPUS", "VOLCANO", and "TRELLIS"). As a result, item-level scoring was carried out on the remaining 27 items. For each individual, the following scores were calculated: (1) a quantitative BNT score based on the 27 items that had a valid SMI value; and (2) a qualitative BNT score obtained by averaging item-level SMI scores of correctly named items. A qualitative BNT score could not be calculated for 2 MCI participants due to missing data, as they had correctly named <2 items. These datasets were discarded, and this left us with a sub-cohort of 350 MCI individuals. All participants included in this study were tested in English.

MRI Processing
Volumetric T1-weighted images (all acquired at 1.5 T) were extracted from each individual MRI protocol archived by ADNI. This was carried out by minimising the temporal distance (in days) between BNT and MRI dates. Images were acquired in compliance with the ADNI1 MRI specification protocol [47]. All files were preprocessed and analysed with Statistical Parametric Mapping 12 (Wellcome Centre for Human Neuroimaging, London, UK), running under MATLAB (version R2014b; Mathworks Inc., Natick, MA, USA). Images were manually reoriented according to their bicommissural axis, and then they underwent segmentation into three tissue-specific probabilistic maps: GM, white matter, and cerebrospinal fluid. Global volumetric indices (in ml) were calculated for each native-space output file using the "get_totals" script (www0.cs.ucl.ac.uk/staff/g.ridgway/vbm/get_totals.m). The values of the three tissue maps were summed up for each participant to obtain individual indices of intracranial volumes. A brain-parenchymal ratio was then calculated (GM plus white matter volume divided by intracranial volume) to obtain proxies of tissue density. Finally, segmented GM maps were modulated, normalised to the Montreal Neurological Institute space, and smoothed (with a 6 mm full-width at half maximum Gaussian kernel) for the purpose of data modelling.
As the aim of this study was to establish the link between GM and both quantitative and qualitative aspects of BNT performance, an additional neuroanatomical index was computed. This was to regress out an aspect of variability that is typically associated with neurodegenerative changes due to AD, and that can constitute a major confound in this type of analysis. The WFU Pickatlas toolbox [48], in tandem with the human Brodmann Atlas, was used to define a left hippocampal region of interest (ROI). The "get_totals" script was then used to extract individual left hippocampal volumes from the set of GM tissue maps. To confirm the validity of these scores, maps of structural covariance were calculated (see Sections 2.3 and 3 for more details).

Data Analyses
Two sets of multiple-regression inferential models were designed to test the linear association between GM maps and the quantitative and qualitative aspects of BNT performance. This was carried out separately for the sub-cohorts of controls and MCI individuals.
Model 1 tested the statistical effect of quantitative scoring (i.e., the count of correctly named items, out of 27). Four covariates were added to regress out aspects of variability not related to the study hypothesis. First, years of education were included as a proxy of cognitive reserve, as carried out in recent clinical and epidemiological research [49,50]. In alignment with a bi-componential framework of reserve [51] and with a dual view of active and passive processes contributing to inter-individual variability in the association between retained neural resources and behavioural functioning [52], two MRI-derived indices were also added to the models as covariates: brain parenchymal ratio and left hippocampal volume (see Section 2.2.2 for methodological details on how these values were obtained). The former was included as a dynamic index of brain reserve [53], whereas the latter served as a proxy of neurodegeneration typically associated with AD (i.e., a left-lateralised ROI was chosen, given the language-based nature of the BNT). Age, finally, was added as a fourth covariate.
Model 2 tested the statistical effect of qualitative scoring (i.e., the average SMI score of correctly named BNT items). This model was corrected for the four aforementioned covariates and, additionally, for quantitative scores that were included as a fifth covariate. This served to identify the statistically independent effect of qualitative from quantitative scoring. In Model 2, we tested the negative association between qualitative scores and GM, based on the principle whereby words with a lower SMI score tend to be considered more difficult.
Given the exploratory nature of the analyses, both Model 1 and Model 2 were thresholded at a cluster-forming threshold of p < 0.01, and corrected for Family-Wise Error (FWE) at a cluster level. No further corrections for multiple comparisons were applied. Only clusters surviving an FWE-corrected p < 0.05 were reported as significant. To verify maps of structural covariance of the left hippocampus, two multiple-regression models (one for each sub-cohort) were launched, using ROI volumes as a predictor of global GM maps. These analyses were corrected for age, intracranial volume, and GM ratio, and were thresholded at a complete FWE-corrected p < 0.05. Peak coordinates were converted from the Montreal Neurological Institute space to the Talairach space via a nonlinear transform and were interpreted using the Talairach Daemon client (www.talairach.org/daemon.html).

Descriptive Variables
Main demographic descriptors and other variables relevant to this study are summarised in Table 1. On average, MRI data were acquired at a temporal distance of approximately 22 days from the administration of the BNT (i.e., maximum absolute distance: 83 and 92 days, in control and MCI individuals, respectively). Although the two subcohorts were analysed separately, controls were slightly older and had significantly more balanced (i.e.,~50%/50%) proportions of males and females. As expected, a larger proportion of APOE E 4 carriers was observed among MCI individuals. The latter also scored significantly lower on the Mini Mental State Examination and on the quantitative BNT performance (calculated on either all 30 items or only on those 27 with a valid SMI score). The proportion of controls and MCI individuals who named each of the 27 BNT items correctly (and the statistical differences between the two diagnostic groups) is reported in Table 2. As expected, words with a higher SMI score tended to be named more often, as shown by a positive coefficient of correlation between SMI scores and the proportion of participants who correctly named each item (controls: rho 27 = 0.455; MCI participants: rho 27 = 0.488).  "RACQUET" was scored based on the American English spelling "RACKET", as reported by SMI norms [36]. The difference between "RHINO" (SMI = 3.875) and "RHINOCEROS" (SMI = 3.680) was negligible, and the latter score was used. Items 13 ("OCTOPUS"), 23 ("VOLCANO"), and 57 ("TRELLIS") were not included in the analyses, as they missed a normative SMI score. BNT: Boston Naming Test; MCI: mild cognitive impairment; N/A: not assessed (no difference between the two sub-cohorts); SMI: sensorimotor interaction.

Model 1: Quantitative BNT Performance
In the sub-cohort of controls, a significant positive association was found between quantitative BNT performance and GM density in three large clusters centred around the temporal and mediotemporal lobe, bilaterally, and extending to part of the prefrontal, cerebellar, and striatal territory (Table 3, Figure 1A).  In the sub-cohort of individuals with a diagnosis of MCI, a significant positive association was found between quantitative BNT performance and GM density in three clusters ( Table 4). Two of these were located in the temporal and mediotemporal lobe, bilaterally, and a third one stretched to the territory of the anterior right insula. When compared with the clusters found in the sub-cohort of controls, these results covered a larger portion of the temporal lobe, expanding to the superior temporal cortex and middle temporal gyrus ( Figure 1B).

Model 2: Qualitative BNT Performance
No significant results were found in the sub-cohort of cognitively healthy controls. In the sub-cohort of individuals with MCI, a significant negative association was found between average SMI and GM density in two ventromedial clusters located bilaterally in the temporal lobe (Table 5; Figure 1C). These included the anterior fusiform gyrus (BA20), the entorhinal cortex (BA28), and the perirhinal cortex (BA35), and extended to the right posterolateral cerebellum (Crus II).
No significant cluster was found in association with somatosensory or motor areas.

Post Hoc ROI Analyses
To complement voxel-based models, ROI analyses were run with a specific focus on the mediotemporal regions that are typically linked to SM retrieval [9]. The same methodology described in Section 2.2.2 in relation to the left hippocampus was adopted to extract regional volumes also from the right hippocampus, the left and right entorhinal cortices (Brodmann Area 28), and the left and right perirhinal cortices (Brodmann Area 35). ROI validity was scrutinised via structural covariance models (as those described in Section 2.2.2 in relation to the left hippocampus). The outcome maps confirmed the validity of the extracted scores (see Supplementary Figure S1 for an illustration of these resulting maps).
To test for the independent association between mediotemporal volumes and qualitative BNT scores, standardised residual scores were calculated to regress out quantitative BNT performance from qualitative scores. Coefficients of correlation were then run to test the association between the residual scores and volumes of the aforementioned selected brain regions. Of all one-tailed p-values (significance was tested in a single direction to reflect the outcome of whole-brain analyses), only a single p-value was significant. There was a modest but significant correlation between the BNT residual qualitative score and the volume of the left perirhinal cortex in the MCI sub-cohort (r 350 −0.089, p = 0.048; Figure 2). ci. 2023, 13, x FOR PEER REVIEW Figure 1. Voxel-based analyses showing a significantly association tive BNT performance in the sub-cohort of cognitively healthy cont tative BNT performance in the MCI sub-cohort (positive-green); (C in the MCI sub-cohort (negative-blue). z-Scores are illustrated with hand-side. Analyses were adjusted for age, years of education, br hippocampal volume. Statistical parametric maps were thresholded rendered on the Montreal Neurological Institute 152 T1-weighted te and z = −26).
In the sub-cohort of individuals with a diagnosis of MC   tative BNT scores, standardised residual scores were calculated to regress out quantitative BNT performance from qualitative scores. Coefficients of correlation were then run to test the association between the residual scores and volumes of the aforementioned selected brain regions. Of all one-tailed p-values (significance was tested in a single direction to reflect the outcome of whole-brain analyses), only a single p-value was significant. There was a modest but significant correlation between the BNT residual qualitative score and the volume of the left perirhinal cortex in the MCI sub-cohort (r350 −0.089, p = 0.048; Figure  2).

Discussion
We studied the association between GM structural integrity and the quantitative and qualitative aspects of BNT performance, scored via an item-level method. To operationalise qualitative aspects, we focused on the SMI score of each word in order to explore semantic difficulty in association with a well-defined thematic aspect (linked to pericentral somatosensory and motor regions). The findings indicate that item-level scores predict an independent portion of GM variability in the temporal and mediotemporal lobe of individuals with MCI, with a major peak of significance emerging from the perirhinal portion of the anterior parahippocampal gyrus. This supports our first hypothesis, as qualitative BNT scoring expands the characterisation of semantic processing beyond "traditional" BNT scores. However, our second hypothesis is not supported by the findings, as no significant association was found in the proximity of somatosensory or motor areas. Qualitative scores were also associated with GM density in the right posterolateral cerebellum. This cerebellar region (corresponding to Crus II) shows functional connectivity with the cerebral hubs of the default-mode network [54] and is involved in semantic processing [55].
Successful naming of BNT items depends on a set of cognitive abilities, including sensorimotor functioning (i.e., visual perception and phono-articulatory skills), visual recognition, SM, and lexical access [56]. Although it is not possible to link incorrectly named items to deficits or setbacks in any specific function from the above list, people who are neurologically healthy and individuals diagnosed with MCI typically do not have any sensorimotor difficulties. Moreover, the sense of familiarity elicited by BNT images is typically very high among older adults, with scores ranging between 7.00 and 6.37 (the latter was in association with "ABACUS", i.e., the 60th and, presumably, the hardest BNT item) on a 1 to 7 Likert scale (with 1 being "not at all familiar" and 7 being "very familiar") [57]. Comparably, an analysis of familiarity elicited in an anterograde visual recognition task showed that this very basic sub-process of declarative memory is preserved in MCI [58]. This indicates that incorrect BNT trials are due to lexical/SM access in these two populations. Although qualitative and quantitative scores were strongly correlated in the sub-cohort of MCI individuals (r 350 = −0.720), the variance inflation factor between the two variables was within the acceptable range (VIF = 2.078), and qualitative scores predicted an independent portion of GM density in the temporal and mediotemporal lobe, i.e., a region that plays a central role in SM. This, however, only emerged as a modest correlation when data were analysed via an ROI-based approach, indicating that, although qualitative BNT performance does provide additional evidence about the integrity of regions that sustain lexical-semantic access beyond quantitative scores, this added value is limited. The limited number of items included in the ADNI version of the BNT (i.e., n = 30) leads to ceiling effects more easily than the original 60-item version of the test, and this cluster of datapoints contributes to the width of the correlation between quantitative and qualitative scores (i.e., we found that 50 controls and 42 MCI participants obtained a flawless 27 out of 27 quantitative score, and this resulted in an average SMI of 4.931 for all of them). We argue that, with the inclusion of more BNT items and the subsequent increase in numerical variability, the relation between quantitative and qualitative performance may become weaker, similarly to what is observed with the Category Fluency Test.
In a study carried out in a sample of 76 early-AD participants, the standard, quantitative performance on the abbreviated 15-item BNT included as part of the Consortium to Establish a Registry for Alzheimer's Disease was associated with a standardised uptake ratio of the radiolabelled Pittsburgh Compound B tracer (i.e., sensitive to amyloid pathology) in the left entorhinal cortex [59]. A second neuromolecular study was carried out in a sample of 64 participants with AD of variable severity (mild to severe) and used AV1451 (also known as "Flortaucipir"), a radiotracer that selectively binds to neurofibrillary tangles of hyperphosphorylated TAU. BNT performance was not associated with tracer uptake in the mediotemporal lobe in this cohort, but was associated with a widespread pattern that extended to the anterior cingulate and to the frontal, temporal, and parietal lobes [60]. This finding indicates that quantitative BNT scores might be sensitive to underlying AD-related changes in the neural tissue of the mediotemporal lobe, but this might be visible only during the earliest phases of the disease. In this respect, the pattern of mediotemporal GM associations we found in both sub-cohorts ( Figure 1A,B) might be partly linked to variability in underlying AD mechanisms. No study, to our knowledge, has investigated the link between the neuromolecular imaging of AD pathology and BNT in healthy controls. A study carried out in a mixed sample (n = 96) of healthy older controls and participants with AD who underwent lumbar puncture, however, found that a lower amyloid/TAU ratio is a significant predictor of BNT performance [61]. This indicates that more studies are needed to characterise the link between quantitative BNT performance and early AD pathology. In light of the findings of this study, we argue that the additional inclusion of qualitative BNT scores may allow for a more fine-grained characterisation of neuroimaging-informed patterns indicative of amyloid and TAU pathology.
Although BNT items are roughly sorted in order of difficulty (in this study, from Item 1: "BED" to Item 59: "PROTRACTOR"), item-level scores assessing words' "frequency of use" indicate that the difficulty levels of individual items fluctuate throughout task administration [62]. This legitimises a more in-depth focus on individual items, as different items are associated with different semantic difficulties, and two performances that are quantitatively equal may be qualitatively different. Interestingly, however, our second hypothesis is not supported by the results, as we did not find any significant cluster/coordinate that extended to the somatosensory or motor area. The main reason we selected SMI as an index of semantic difficulty was to be able to visualise its independent effects, as these may have been visible in a topographically different set of regions. SMI facilitates semantic processing (as referents with higher potential for interaction are associated with a stronger semantic activation [63]) and is also linked to a better verbal free recall performance [64].
The findings of the current study, however, suggest that SMI ended up being a general index of semantic difficulty, rather than a specific feature capturing potential for body-object interactions. Arguably, a larger set of BNT items will result in larger score variability, and this, in turn, may allow SMI to express its construct validity fully.
The visual comparison of the findings obtained in the MCI sub-cohort suggests that qualitative scores may be less influenced by semantic control processes than quantitative scores. The unique portion of GM variability that accounted for qualitative scores, in fact, was limited to temporal, mediotemporal, and cerebellar default-mode-related areas, whereas quantitative scores were associated with a much wider pattern that also extended to other cortical regions. Although it is not possible to interpret naming difficulties (i.e., resulting in variability in quantitative scores) in terms of failure in access/control processes, standard quantitative scoring gives the participant a second and third chance by providing them with semantic and phonological cues that help direct semantic control resources [65]. Vice versa, the variability of qualitative scores is captured as an average of correctly named items and should thus be less susceptible to the influence of semantic control. This suggests that the use of qualitative scores could flank the standard quantitative scores in constructing the profile of lexical-semantic access (and, more generally, declarative memory) of individuals who undergo neuropsychological testing.

Limitations
As this study was, to our knowledge, the first one to explore the item-level scoring of BNT performance, we gave particular emphasis to methodological and feasibility-related aspects, but did not equally investigate the clinical value of the approach (e.g., by focusing on diagnostic differences). We expect that this methodology may be effective, for instance, at detecting abnormal lexical-semantic processing in tertiary-care neurological settings or in samples of community-dwelling adults. This is an area that will be worth investigating with ad hoc research projects. Along similar lines, we relied on ADNI's clinical diagnoses but did not include biomarker status in our design. This investigation was mostly concerned about the effectiveness of the methodology, but we recognise that the inclusion of biomarkers (e.g., to test the association between item-level lexical-semantic indices and peripheral levels of pathology) could help establish the effectiveness of this method at the level of the individual clinical referral. A third clinical aspect to consider is longitudinal follow ups and the extent to which this methodology can highlight AD progression and/or is influenced by test-retest practice effects. Finally, applications in the context of other neurological conditions (e.g., semantic dementia), in relation to other axes of semantic difficulty (e.g., 'age of acquisition' or 'frequency of use') or, more generally, the transposition of item-level methodological principles to the scoring of other verbal tests (e.g., the Prose Memory Test [66]) are also warranted.

Conclusions
The aim of this research was to assess the value of item-level scoring of BNT performance in relation to "standard" quantitative scoring. Whereas quantitative BNT scores are routinely used as a measure of lexical-semantic access, qualitative scores may provide a more fine-grained proxy of SM that is less influenced by semantic control. To test this idea, we designed two voxel-based models assessing the linear association between GM density and quantitative and SMI-informed qualitative BNT scores. We found that the item-level scoring of BNT performance is a significant predictor of temporal and mediotemporal GM in people with MCI. Albeit modest, this statistical effect is independent of traditional quantitative scores that are based on an item count. This may inform the design of ad hoc studies that could help evaluate the clinical potential of this approach.
It is important to design alternative methodologies to maximise the informativity of neuropsychological tests. This is particularly relevant for methodologies that can be applied retrospectively, not by collecting new data, but by devising novel opportunities based on innovative post-processing approaches. Item-level scores obtained from verbal