Factor structure and measurement invariance of a neuropsychological test battery designed for assessment of cognitive functioning in older Mexican Americans

Introduction The present study sought to investigate the measurement invariance of commonly used neuropsychological tests in an ethnically (Hispanic vs. non-Hispanic) and linguistically (Spanish vs. English) diverse sample. Methods Participants were 736 middle-aged and older adults (MAge = 62.1, SD = 9.1) assessed at baseline. Measurement invariance testing was performed using multiple-group confirmatory factor analysis. Results A five-factor model (memory, attention/executive functioning/processing speed, language, visuospatial, and motor) fit the data well (CFI = 0.979, RMSEA = 0.047) and the composite reliability of the factors ranged from .76 (visuospatial) to .97 (motor). The five-factor model was found to possess strict measurement invariance for ethnicity and language without a decrement in fit compared to a strong (scalar) invariance model (ΔCFI = .000, ΔRMSEA = .002). Discussion These results indicate that a five-factor model is suitable for estimating cognitive functioning in Mexican Americans and non-Hispanic whites without bias by ethnicity or language.


Background
Assessment of cognitive functioning is an essential tool in cognitive aging and neurodegenerative disease research. For instance, cognitive test scores often serve as outcome variables when studying group differences, rates of change over time, or when evaluating the impact of an intervention [1,2]. Furthermore, neuropsychological tests are often used to make inferences about the absence or presence of a latent pathological process, such as Alzheimer's disease or to assist with differential diagnosis among competing possibilities [3]. Clinical neuropsychologists rely heavily on cognitive test scores to identify a patient's strengths and weaknesses and use these results to make targeted recommendations for intervention and care [4]. Neuropsychological assessment is a noninvasive method for capturing useful information about the behavioral manifestations of an underlying neurodegenerative disease.
Although neuropsychological assessment is useful for understanding patterns of cognitive decline, this pursuit can be complicated by the considerable heterogeneity in cognitive phenotypes, not only across different neurodegenerative conditions, but across groups of individuals who differ along one or more dimensions [5,6]. In other words, factors such as racial and ethnic diversity are associated with differences in the clinical presentations of those with both normal and pathological aging. When examining group differences in cognitive test results, it is important to distinguish between true differences in ability and differences that arise due to measurement error. Bias refers to error that varies systematically with other grouping variables. However, group differences in cognitive test results are not in and of themselves reflective of bias; differences in life history variables like education and environmental enrichment can cause real differences in cognitive abilities that may be validly captured by test score differences [7][8][9][10]. Therefore, to help disentangle true differences in ability from systematic error variance, it is essential to validate the ability of a cognitive battery to make unbiased measurements of cognition across diverse groups.
Measurement invariance is the term used to describe the ability of a test score to estimate an underlying trait with equal validity across groups or over time [11,12]. The specific types of measurement invariance and their mathematical properties have been described in detail elsewhere [12][13][14]. Because of the importance of making comparisons of group mean differences, we sought to determine whether a comprehensive battery of cognitive tests possesses at least "strong" (scalar) invariance for estimating cognitive functioning across different groups. This type of invariance testing uses confirmatory factor analysis (CFA) to constrain the model's factor loadings and intercepts to be equal across groups. If such a model provides a good fit to the data that are not substantially worse than a less constrained model ("weak," or metric, invariance, which only applies equality constraints to factor loadings), then differences in group means on the factors being estimated can be interpreted validly [15].
The goal of the present study is to examine the measurement invariance properties of a cognitive test battery used in the assessment of Mexican Americans and non-Hispanic white Americans when administered in either Spanish or English. We seek to determine the type of invariance that can be achieved with this battery. Strict or strong invariance would allow for the battery to be used to compare differences in group means, whereas weak or configural invariance would suggest the possibility that group mean comparisons could be systematically affected by bias, thus not allowing valid group mean comparisons to be made. Our approach to establishing the factor structure and measurement invariance of this battery was modeled after the work of Park et al. [16] with the Alzheimer's Disease Neuroimaging Initiative neuropsychological battery.

Participants
Participants were 741 volunteers, aged 50 years and older, in the Health and Aging Brain among Latino Elders (HABLE) study who provided informed consent to participate in this research. More specific details about the HABLE study have been published previously [17][18][19][20]. Briefly, the HABLE study is a community-based epidemiological study that focuses on understanding cognitive changes in a predominantly Mexican American sample recruited from Tarrant county, Texas. As part of this larger ongoing study, participants undergo a review of medical history, medications, and health behaviors; neuropsychological assessment; blood collection; and a medical evaluation that includes a review of systems, Hachinski Ischemic Score, and neurological examination. Participants were neither excluded from the parent HABLE study nor the present study on the basis of dementia severity. The evaluation was completed in English or Spanish depending on the participant's preference, using standard versions of each test, described in Section 2.3 below.

Procedure
From the initial sample of 741 participants, we identified three predominant ethnicity/language subgroups: those with Hispanic ethnicity whose primary language was English (n 5 110), those with Hispanic ethnicity whose primary language was Spanish (n 5 489), and those with non-Hispanic ethnicity whose primary language was English (n 5 137). There was only one participant who reported non-Hispanic ethnicity and Spanish as primary language; because there were no other individuals with this ethnicity/language combination in the sample, this participant was excluded. An additional four participants were excluded because they reported English as their first language but were tested in Spanish, resulting in a total sample size of 736 for analysis.

Measures
Neuropsychological assessment was performed using the following instruments, collected at participants' baseline study visit.

Consortium to Establish a Registry for Alzheimer's Disease List Learning
The Consortium to Establish a Registry for Alzheimer's Disease (CERAD) list learning task provides measures of immediate and delayed verbal memory. In this test, participants are shown a list of 10 words at the rate of one word every 2 seconds. Immediately after, they are asked to recall as many words as possible. This is completed three times, and the order of word presentation changes with each trial. The sum across the three learning trials is the variable used for immediate memory in the present study (CERAD total). Following a brief delay, participants are again asked to recall as many words as they can remember from the list (CERAD delayed recall). Finally, participants are asked to complete a recognition task where familiarity with target items and foils is endorsed or denied (CERAD recognition) [21].

Wechsler Memory Scale-III Logical Memory I and II
This test provides a measure of verbal auditory memory and delayed retention. Participants are presented with short stories read aloud and are instructed to try to remember the stories exactly as they are told. They are then asked to repeat the stories back as best as they can remember (Logical Memory I). After a delay of approximately 30 minutes, participants are again asked to recall as much of the stories as they can (Logical Memory II) [22].

Wechsler Adult Intelligence Scale-III Digit Span
This test contains two components: forward and backward Digit Span. In the forward portion of the test, participants are read a series of numbers and asked to repeat them back exactly as heard. The backward portion presents participants with a series of numbers and asks them to repeat the numbers back in reverse order. This subtest provides measures of working memory and attention [23].

Executive Interview
This is a screening measure of executive functioning abilities and neurological soft signs that can be used to identify individuals with executive functioning deficits that may be associated with increased risk of functional deficits. The executive interview is a clinician-administered scale with 25 items, with each item scored as 0 (intact), 1 (mild deficits), or 2 (severe deficits). Higher scores reflect more pronounced deficits in executive functioning abilities. Scores were recoded to match the direction of other data (higher scores reflecting better performance) by subtracting each participant's observed score from the maximum observed score in the sample (27) [24].

Trail Making Test
Trail Making Test parts A and B provide information on processing speed, mental flexibility, and executive functioning. Part A provides the participant with a sheet of paper displaying circles containing the numbers 1-25 in an array spread over the page. The participant is asked to connect the numbered circles in order as quickly as possible, beginning at 1 and ending at 25. Part B is similar, but includes letters and numbers. Participants are asked to draw the lines while alternating between numbers and letters. For both tests, completion time is the primary outcome measure; scores were recoded to match the direction of other data (higher scores reflecting better performance) by subtracting each participant's observed score from the maximum observed score (217 00 for A and 377 00 for B) [25].

Boston Naming Test
The Boston Naming Test is a test of confrontation naming where up to 60 line-drawn pictures of common objects are presented to examinees who are asked to provide the name for each picture. Higher scores represent better naming ability [26].

FAS and Animal Fluency
The FAS test provides the participant with a letter (F) and then asks them to name as many different words beginning with that letter as they can in 1 minute. This is then repeated with the letters A and S. The total number of words generated across the three trials-excluding proper nouns and the same words with different endings-is the outcome variable of interest. Similarly, the Animal Fluency test asks the participant to name as many animals as they can within 1 minute.

CLOX: An Executive Clock Drawing Task
This test contains two parts. In CLOX1, the participant is presented with a blank piece of paper and is instructed to draw a clock with the hands set to 1:45. In CLOX2, the participant watches the examiner draw a clock that is set to 1:45 inside a circle. The participant is then asked to copy the clock. Higher scores represent better performance [27].

Grip Strength Test
A hand dynamometer is used to measure grip strength in participants' hands bilaterally. Two trials per hand are administered to obtain a measure of gross motor function. Higher values represent greater grip strength [25].

Model selection
To identify the most appropriate model to be subjected to measurement invariance testing, we hypothesized a model based on the specific test scores available in our battery and the published literature pertaining to the cognitive domains underlying neuropsychological test performance [28,29] with particular emphasis on a similar study performed using the Alzheimer's Disease Neuroimaging Initiative neuropsychological data [16]. The hypothesized model contained five factors: (1) memory, (2) attention/executive functioning/processing speed, (3) language, (4) visuospatial, and (5) motor. We modeled residual correlations between indicator variables sharing method variance for the three CERAD variables with one another, the two Logical Memory variables with each other, FAS with animal fluency, the two Trail Making Test variables with each other, the two Digit Span variables with one another, and the same-handed Grip Strength variables with one another. To judge the quality of model fit, we relied on the comparative fit index (CFI), Tucker-Lewis index (TLI), root mean square error of approximation (RMSEA), and standardized root mean square residual (SRMR). Good fit is indicated by CFI and TLI values .95, RMSEA values (including 90% confidence intervals) , .06, and SRMR values , .09 [30]. All models described in this study were run using a robust full information maximum likelihood estimator and all indicator variables were standardized before being analyzed.

Measurement invariance testing
After ensuring good model fit in the entire sample, we examined model fit in each of the three groups separately to ensure a reasonably good fitting model in each group before proceeding with more formal measurement invariance testing. Our approach to measurement invariance testing followed typical procedures, in that we began by specifying a configural invariance model and then incremen-tally applying increasingly restrictive equality constraints across groups. The procedure involved moving from configural to metric (weak) invariance, scalar (strong) invariance, and finally strict (residual variance) invariance models. Meaningful changes in model fit were judged by DCFI values of .01 [13,31] or greater and the likelihood ratio c 2 difference test [32]. Data were analyzed using Mplus version 8 [33] and R version 3.4.2 [34], including the lavaan (version 0.5-23.1097) [35] and semTools (version 0.4214) [36] packages.

Results
Participant demographic data, dementia outcome variables, and neuropsychological test scores are shown in Table 1. In the total sample, ages ranged from 50 to 100 years and years of education ranged from 0 to 20 years. The three ethnic/language groups differed in age, education, gender composition, and scores on the Mini-Mental State Examination. All neuropsychological test scores differed across groups as well. The groups did not differ in dementia severity, as measured by the Clinical Dementia Rating.
The five-factor model fit the data well in the entire sample, CFI 5 .979, TLI 5 .973, RMSEA 5 0.047 (95% confidence interval [0.041, 0.053]), SRMR 5 0.028. The factor loadings derived from this model are shown in Table 2 and the estimated factor correlations are shown in Table 3. In addition to the good overall model fit, all factor loadings were strong (..40) and in the expected direction. All factors were strongly positively correlated, with the exception of the Motor factor, which had a weak positive correlation with the other four factors. The model also fit reasonably well in each of the three ethnicity/language subgroups, as shown in Table 4, confirming the appropriateness of further measurement invariance testing.
As can be seen in Table 5, all of the constrained models fit the data well and there was no decrement in model fit when additional equality constraints were applied. Therefore, the results indicate that the five-factor model can be considered to have strict measurement invariance across the ethnic and language groups used in this study.
The most constrained measurement invariance model ("strict 1 means") applied group equality constraints to the factor means and found no decrement in fit compared to a less-constrained (strict) model. As such, three groups are considered to have equal global factor means for all five factors when these factors are estimated using a latent variable framework. In other words, when the factors are estimated without measurement error, such as with CFA, there is no evidence of group mean differences on those factors.
Finally, the composite reliability of the single-factor model was calculated using Raykov's approach [37]. The composite reliabilities (true score variance divided by observed score variance) of the five factors in the entire sample and in the subsamples are shown in Table 6. The visuospatial factor was consistently the least reliable, with composite reliabilities below the recommended threshold of r 5 .80 [38] but above .70. In contrast, the motor factor was consistently the most reliable (..96). For the most part, the remaining factors possessed high reliability, especially in the total sample, but there were a few cases of lower than desirable reliability for some of the factors in the subsamples (see Table 6).

Discussion
The current results indicate that a five-factor model of cognition-including the factors of memory; attention, executive functioning, and processing speed; language; visuospatial; and motor, as measured by 19 different neuropsychological variables-is capable of measuring cognitive functioning with equal validity in this diverse sample of participants regardless of ethnicity (Hispanic/non-Hispanic) or primary language (Spanish/English). Because this model demonstrated strict invariance, group comparisons can be made with confidence that the CFA model is capable of providing a valid estimate of actual differences  in global cognitive functioning and are not influenced by systematic bias to a meaningful degree. The current results provide a strong first step toward establishing the measurement invariance of a comprehensive model for cognitive functioning in older adults from both Hispanic and non-Hispanic ethnic groups and who speak either Spanish or English. Importantly, our results suggest that these five cognitive factors can essentially be measured without bias using commonly used neuropsychological tests that were developed in English but were administered to many in Spanish. These findings can be further bolstered by replication in independent samples and in samples that contain additional racial, ethnic, and linguistic heterogeneity.
Measuring cognitive functioning with equal validity across diverse groups is an essential requirement of neuropsychological assessment instruments used in crosscultural and cross-linguistic research and clinical practice. Considering the growing diversity of the U.S. population, more research is needed to determine whether commonly used neuropsychological assessment instruments are capable of measuring cognitive functioning with equal validity in groups that differ on important dimensions such as ethnicity and language. This is especially important for work in the area of cognitive aging, where clinicians and researchers are increasingly likely to assess patients and research participants from diverse backgrounds, where the potential introduction of systematic measurement error could affect the construct validity of the instruments used to measure cognition. Research has indicated that life history variables that covary on dimensions of race and ethnicity are important contributors to cognitive functioning in older adults [7][8][9][10]. In addition, dementia onset is earlier in Mexican Americans compared to non-Hispanic whites [39][40][41], and the brain variables driving cognitive decline can differ based on race and ethnicity [42]. In particular, racial and ethnic differences in rates of diabetes and depression are likely to contribute to health disparities in minority groups [19,[43][44][45][46]. Such findings highlight the importance of ensuring that observed differences on cognitive tests are valid and not simply a reflection of test bias.
In this study, 10 different tests-together yielding a total of 19 outcome variables-were used to derive estimates of cognitive functioning in five neuropsychological domains. Not only were these factors invariant to the ethnic and linguistic group differences in our sample, the factors were also found to possess, for the most part, high reliability. Although the focus of the current article was on measurement invariance, this finding of high composite reliability is important as well, as it indicates that the current model is capable of estimating cognitive functioning with high precision, which is another important attribute for examining group differences [47]. Future research can make use of this model to provide an essentially unbiased estimate of cognition across five domains for the purpose of studying cognitive aging in older Spanish and English-speaking Mexican Americans and English-speaking Caucasian Americans.
Although the observed neuropsychological test scores often differed across ethnic and language groups (Table 1), our results show that the five underlying cognitive factors did not significantly differ across groups. This finding is likely due to the fact that CFA was used to derive estimated factor means, which-in the context of a latent variable model-are not affected by measurement error. In contrast, the observed test scores shown in Table 1 reflect a combination of trait ability plus measurement error. Therefore, when using these tests to estimate the five factors in our model, it is necessary to use CFA to avoid the potentially biasing effects of measurement error. Failing that, application of demographically corrected normative data to the observed scores  may help attenuate the effects of systematic bias on the observed test results to some extent [48,49].
Despite the many strengths of this study, including the large and diverse sample, there are also some limitations. This was a cross-sectional study, which only allows for conclusions to be drawn about group differences at a single point in time. Longitudinal measurement invariance must be established before such a model can be used to draw conclusions about changes in cognitive functioning over time [13]. Our group is currently collecting longitudinal data from a community-based sample of Mexican Americans that will be used for such analyses, which should be a primary goal for future research in the area of cross-cultural neuropsychology. In addition, the current sample was recruited from a single geographic region, and the Hispanic participants were predominantly from a Mexican American background. One specific limitation of the five-factor model pertains to the breadth of the visuospatial and motor factors. The visuospatial factor is indicated by two test scores, CLOX1 and CLOX2, while the motor factor is indicated by four test scores, dominant and nondominant hand Grip Strength, with two trial scores per hand. As such, the model provides a narrowly focused ability estimate for these two cognitive domains, which may be undesirable in some assessment contexts.
Few neuropsychological tests have been designed a priori to provide unbiased estimates of cognitive functioning across diverse groups, and few existing test batteries have been subjected to post hoc validation for this purpose [13,50]. Without confirmation that a test is essentially free from ethnic, racial, or linguistic bias, it is difficult to determine the relative contributions of true cognitive differences versus systematic error variance when interpreting observed test score differences. The present study therefore makes an important contribution to the literature by providing clinicians and researchers with another tool for generating valid cognitive outcome measures across two important dimensions of diversity.