Measurement Properties of the Health Literacy Questionnaire in the Understanding Multiple Sclerosis Massive Open Online Course Cohort: A Rasch Analysis

Background: Online health education and other electronic health improvement strategies are developing rapidly, highlighting the growing need for valid scales to assess health literacy (HL). One comprehensive HL scale is the Health Literacy Questionnaire (HLQ), but little is known about its measurement properties in online health education cohorts. Objective: The purpose of this study was to determine if the multidimensional HLQ is an appropriate tool to measure HL in a cohort of Understanding Multiple Sclerosis (MS) online course enrollees. Methods: Participants who enrolled in the first two open enrollments of the Understanding MS online course completed the HLQ (N = 1,182) in an online survey prior to beginning course materials. We used Rasch analysis to assess the measurement properties of the HLQ. Key Results: The nine Domains of the HLQ each had ordered category function and a good fit with the Rasch model. Each domain was one-dimensional and exhibited good internal consistency and reliability. None of the 44 individual items of the HLQ demonstrated item bias or local dependency. However, while the overall fit was good, few measurement gaps were identified in this cohort for participants in each of the nine Domains, meaning that the HLQ may have low measurement precision in some participants. Conclusion: Our analysis of the HLQ indicated acceptable measurement properties in a cohort of Understanding MS online course enrollees. Although reliable information on nine separate constructs of HL was obtainable in the current study indicating that the HLQ can be used in similar cohorts, its limitations must be also considered. [HLRP: Health Literacy Research and Practice. 2022;6(3):e200–e212.] Plain Language Summary: In this study, we have shown that the HLQ is suitable for measuring HL in an online public health educational platforms for chronic diseases including multiple sclerosis. This finding adds to the evidence that the HLQ can be widely used in measuring HL in different settings, populations, and health educational platforms.

The World Health Organization defines health literacy (HL) as "the cognitive and social skills which determine the motivation and ability of individuals to gain access to, understand and use information in ways which promote and maintain good health" (Nutbeam, 1998). HL has gained a lot of attention in recent years (Sørensen et al., 2012) because of increasing evidence demonstrating its strong association with health inequalities and health outcomes (Beauchamp et al., 2015;Berkman et al., 2011).
HL plays a vital role in achieving effective participation, and empowerment of people and communities (Nutbeam, 1998). It is also an important component of public health and a determinant of health equity. Multiple sclerosis (MS) is a chronic neurodegenerative disease where the immune system attacks and gradually impairs the function of the central nervous system (Wilkins, 2017). It has been shown that adequate HL is associated with improved self-care skills, management of symptoms, understanding and use of health information, participatory decision-making and compliance with treatments, empowers the patient/family/caregiver, and fosters patient shared decision-making for optimized collaborative care in chronic neurological diseases like MS (Chiovetti, 2006;Henson, 2016;Jafari et al., 2020;Lejbkowicz et al., 2012;Rieckmann et al., 2015). Conversely, lower levels of HL are as-sociated with poor health outcomes and increased health care use in people living with MS (Marrie et al., 2014).
In this study, we assessed HL among enrollees in the Understanding MS online course, including members of the MS community (e.g., people living with MS, care givers) and interested laypeople, prior to beginning coursework. There are myriad HL assessment tools designed for use in a variety of study populations. Of these, we chose the Health Literacy Questionnaire (HLQ). The HLQ was developed using a comprehensive "validity-driven" approach (Osborne et al., 2013) and the tool comprises nine independent Domains with 44 total items that holistically capture different aspects of HL (Osborne et al., 2013). The HLQ has excellent psychometric properties and has been culturally adapted and/or validated in different populations, settings, and languages. For example, it has been adapted and/or validated in German, (Nolte et al., 2017), Danish (Maindal et al., 2016), Slovakian (Kolarcik et al., 2017), Dutch (Rademakers et al., 2020), and Iranian cohorts (Ahmadi & Salehi, 2019), as well as health professional university students (Mullan et al., 2017), older adults (Huang et al., 2019;Morris et al., 2017), people with metabolic and cardiovascular risk (Debussche et al., 2018;Richtering et al., 2017), and recently hospitalized patients (Jessup et al., 2017). However, despite the wide applicability of the HLQ, the suitability of an instrument may differ across settings or populations. Therefore, it is important to assess the performance of the HLQ in the population of interest before applying the instrument and interpreting scores (Osborne et al., 2013).
Rasch modelling is an approach used to evaluate the psychometric properties of self-reported health outcome scales like the HLQ (Richtering et al., 2017;Tennant & Conaghan, 2007). Although the usual item response theory (IRT) creates response models to fit the data, Rasch modeling does the reverse by predicting if observed responses fit the pattern of the Rasch model (Bond & Fox, 2015;Hendriks et al., 2012), which is a special case of IRT. Rasch model requires the identification and measurement of a single attribute at a time. The Rasch approach has several advantages, including providing valid summation of raw (ordinal) scores, category response ordering, item difficulty relative to person ability, and item bias and response dependency (Prieto et al., 2003), which are key to assessing scale validity and reliability for one-dimensionality. Here, we extend the current evidence on the applicability of the HLQ using Rasch analysis and validate the HLQ for use in a large online health education setting for the first time.

Ethics
This study was conducted in compliance with the Declaration of Helsinki of 1975, as revised in 2000, and was approved by the University of Tasmania's Social Science Human Research Ethics Committee (H0017924; H0018314). All participants gave their informed consent.

Study Design and Data Collection
We have developed a freely available massive open online course (MOOC) entitled "Understanding MS." The course presents participants with up-to-date, evidencebased information on the biology, management, and pre-vention of MS. The course content is described in detail elsewhere (Claflin et al., 2020). Participants in this cohort study were those who expressed interest in taking part in course-related research on their enrollment form. The research team contacted interested participants via email with details about the cohort study and a link to the surveys, project information sheet, and consent form.
Study participants completed an online survey prior to beginning course materials, including demographic questions and the 44-item HLQ. The data were de-identified at collection using course platform-generated participant ID numbers and remained so for analysis.

The Health Literacy Questionnaire
The HLQ contains nine independent Domains, with a total of 44 items (Osborne et al., 2013) that capture different aspects of HL to assess population, group and individual HL needs. The HLQ consists of two parts containing items with differing response option formats. Part 1, comprising Domains 1-5, contains items with a 4-point Likert-type response option rating scale assessing the level of agreement from (1) strongly disagree to (4) strongly agree. Part 2, covering Domains 6-9, contains items with a 5-point Likert-type response option rating scale assessing the level of capability/ difficulty on each item from (1) can't do or always difficult to (5) always easy. The complete HLQ provides nine separate Domain scores. Each Domain score is calculated by averaging the scores of items that define each Domain. The HLQ does not provide an overall score (Osborne et al., 2013).

Statistical Analysis
Stata 16.1 was used for data cleaning, management, and determining the descriptive statistics of the cohort. Participants were identified with a numerical user ID generated by the online course platform. Using this ID, we identified participants who completed a survey before both enrollments and only included data collected before the first enrollment. Similarly, we identified repeated responses and included the most complete survey or, if equally complete, the survey completed first. Those who did not complete all of the HLQ were also excluded. Rasch analysis was conducted using Winsteps software, version 4.5.5 (Linacre, 2019). It is a method of psychometric probability-based analysis that predicts if observed responses fit the pattern of the Rasch model (Bond & Fox, 2015), which is a special case of item response theory. If the model requirement is met, it identifies the measurement and structural properties of a scale (or instrument) including the relative difficulty of each item on the scale and maps these item difficulties against person ability levels. In this way it is possible to ascertain whether the difficulty level of items is appropriate for the assessment of individuals who have a particular level of skill (Bond & Fox, 2015). Rasch modeling is widely used to assess the psychometric properties of scales, test items and questionnaires in health and education (Bessing et al., 2021;Bond & Fox, 2015;Morris et al., 2017).
We first examined the category function for each of the nine Domains of the HLQ using the category frequencies, average measures, category fit statistics, threshold estimates, and probability curves. The diagnosis of the appropriate response category function enhances the validity and reliability of the HLQ (Bond & Fox, 2015;Cordier et al., 2018). We evaluated the category ordering of item response options by assessing whether each response option category has a minimum of 10 observations and the category average observed logit measures increased monotonically in accordance with the specified response option scale. Failure to meet this response option category ordering requirement is an indication of either poorly defined categories or inclusion of items that are not consistent with the construct being measured. We also assessed category step (threshold) ordering using the Andrich thresholds and category characteristic curves including detailed inspection of the item category distractor frequencies for each response category. Ideally, the Andrich thresholds (magnitude between categories) should increase monotonically with no overlaps or large gaps (>5 logits) between two adjacent categories. Steps (threshold) disordering could mean there are gaps, underutilized category or the category defines a narrow section of the construct being measured (Cordier et al., 2018).
The overall fit to the Rasch model expectations for each of the nine HLQ Domains was assessed in the Understanding MS course enrollees (Tennant & Conaghan, 2007). We also analyzed the fit to the Rasch model expectations of all items within each of the HLQ Domains. We determined goodness of fit for each HLQ Domain and corresponding individual items using the mean square (MNSQ) and z-standardized scores (Bessing et al., 2021;Bond & Fox, 2015). We used 0.6-1.4 MNSQ infit and outfit values as the acceptable fit range in this study, which is recommended for rating scales and surveys (Linacre, 2019).
We assessed several other psychometric properties of the HLQ Domains and items. These included one-dimensionality, internal consistency and reliability, Cronbach alpha test reliability, sex differential item functioning (DIF), response dependence, and scale/item targeting (described in Table 1). Although the Cronbach alpha test reliability is not important for the Rasch model, it does provide additional information about reliability according to classical test theory.

RESULTS
In total, 8334 people enrolled in the first two enrollments of the Understanding MS MOOC. Of these, 2,680 (32.2%) were invited to take part in the cohort studies because they indicated that they were interested in participating in research at enrollment. Of those invited, 1,261 (47.1%) participants completed the pre-course surveys in iterations 1 and 2. Of those who completed the pre-course surveys, 1182 (93.7%) have complete HLQ data for analyses (Figure 1).

Participant's Characteristics
The characteristics of study participants are presented in Table 2. The average age of participants was 48 years, and most participants were women (86%), married or in a de facto partnership (68%) and spoke English at home (92%). Participants were highly educated, with 57% having an associate degree or higher.

Rasch Analysis
HLQ category function. When examining the category function for each of the nine HLQ Domains, there were more than ten observations per category and their average category measures increased monotonically in 4 or 5 distinct ordered response option categories depending on the Domain (Table 3). This indicated each of the nine HLQ Domain rating categorizations were satisfactory and well defined.

Measurement Property Definition Statistical Test in Winsteps and Acceptable Values
Category function Evaluate whether the threshold values (i.e., the spaces between each of the Health Literacy Questionnaire (HLQ) domain categories or choices) were ordered or disordered. This supports the reliability of the HLQ domains The Rasch-Andrich thresholds for rating scale was used based on the following criteria: (1) a minimum of 10 observations for each category (N > 10); (2) monotonically increase in average category measures with increase categories; (3) mean square (MNSQ) 0.6-1.4; (4) category threshold increases monotonically with categories; (5) category thresholds are at least 1.4 to 5 logits apart; and (6) there are distinct peaks for every category probability curve (Boone et al., 2013;Linacre, 2002) Fit statistics A test to determine the extent to which the data fits the Rasch model, for both items and persons, and the whole domain The fit statistics are based on MNSQ and Z-standardized scores. Infit and Outfit MNSQ between 0.6 and 1.4 is considered acceptable for rating scales (surveys) (Bessing et al., 2020;Bond & Fox, 2015;Linacre, 2019) One-dimensionality The ability of each of the nine HLQ domains to measure a separate single health literacy (HL) construct Principal component analysis of the residuals with Rasch explained dimension >40% (Linacre, 2019) and first contrast Eigenvalue of ≤ 2.0 supports one-dimensionality (Bond & Fox, 2015) Internal consistency and reliability The extent to which the items in each of the nine HLQ Domains measures the same concept A person or item reliability of ≥0.7, and person or item separation index of ≥1.5 supports good internal consistency and reliability. A Cronbach alpha test reliability ≥0.7 also supports good internal consistency and reliability (Bond & Fox, 2015;Linacre, 2019;Tennant & Conaghan, 2007) Differential item functioning (DIF) DIF measures if there is any bias in response to the HLQ items within groups in the sample who have similar levels of HL We assessed DIF for sex. The Mantel-Haenszel approach was used and a contrast DIF of ≥0.64 logit with a p value of (two-tailed) ≤.05 is considered statistically significant (Bond & Fox, 2015) Scale targeting A measure of the ability of this cohort to answer each of the HLQ domains items correctly against the difficulty level of each of the HLQ Domains items arrayed along the same continuum This is assessed with a person-item threshold graph. A welltargeted scale should have participants and items spread across the continuum. The mean person-item ability/difficulty is zero log-odds unit (logit). Items located below zero are the easiest and the people closest to these items are those with less ability. The items above zero, on the other hand, are the most difficult and the people closest to these items are those with greater ability (Pallant & Tennant, 2007) Examination of the Andrich thresholds demonstrated increased thresholds monotonically along the continuum, indicating the categories were distinct for each of the nine Domains. However, 6 of 9 Domains had Andrich threshold magnitudes >5 logits between the last two categories, indicating potential measurement gaps between item response category difficulty levels and participants' ability.
We further examined the category probability curves of each of the nine Domains of the HLQ and respective items within each Domain. There are distinct thresholds for each response option category within each item, with each category response option also exhibiting distinct peaks on the probability curves. For example, Figure 2 gives information about the appropriateness of the response categories for the HLQ item "Find information about health problems." From Figure 2, the Y-axis (ranged 0-1) depicts the anticipated probability of each response category to be endorsed by the respondents. The X-axis stands for the item difficulty (endorsable), and a positive value means a higher ability to be endorsable and negative value means a lower ability to be endorsable. Figure 2 indicates that participants who showed positive attitude toward "Find information about health problems" (those with high positive values on the x-axis) were more likely to endorse higher categories (category: 5, Always easy). Similarly, participants who showed negative attitude toward "Find information about health problems" (those with low values on the x-axis) tend to endorse lower categories (category: 1, Can't do or always difficult). This trend was similarly found across all Domains and individual items of the HLQ (see Figure A and Figure B for a graphical depiction of all remaining items). Together this information indicates that, on average, participants with higher ability (agreeable) increasingly endorsed higher categories and those that were not as agreeable increasingly endorsed lower categories as expected. These suggest that the response categories in the HLQ function as intended.   Note. For domains 1-5, categories: 1 = Strongly disagree, 2 = Disagree, 3 = Agree, 4 = Strongly agree. For domains 6-9, categories: 1 = Can't do or always difficult, 2 = Usually difficult, 3 = Sometimes difficult, 4 = Usually easy, 5 = Always easy. The MNSQ acceptable limits for productive measurement were 0.6-1.4. Full domain items are available from the authors of the HLQ. HLQ = Health Literacy Questionnaire; Infit = overfit coefficient; MNSQ = mean square; Outfit = underfit coefficient.  (Table 4). All items under each HLQ Domain were scored appropriately and they functioned as expected (Linacre, 2002). The individual items were in the acceptable range for good model fit, although the item "I spend quiet a lot of time actively managing my health" (in Domain 3: Actively managing my health) was slightly underfitting with an outfit MNSQ of 1.51 (cut-off 1.40) ( Table 5). This implies there was too much variation in the responses of participants for this item (Bond & Fox, 2015).
One-dimensionality. The PCA of the residuals for each of the nine HLQ Domains supported one-dimensionality of each model (Table 5). The Rasch dimension in each Domain explained >47% of the variance in the data. Explanations of >40% variance are considered strong measurements of dimension (Linacre, 2019). The total unexplained variance in each Domain had first contrast eigenvalues <2. This implied there was no significant second dimension after extracting the Rasch dimension and that the unexplained variances in each Domain were mainly due to random noise.
Reliability and internal consistency of the HLQ. The person separation of ≥2.0 and person reliability of ≥0.8 for each of the nine HLQ Domains suggested that the items contained within each Domain were sensitive enough to differentiate between at least two-person ability levels (low and high) (see Table 4). The item separation of >4 and item reliability of >0.9 observed for each HLQ Domain were above the recommended values (separation ≥3 and reliability ≥0.9) (Cordier et al., 2018;Linacre, 2010). This indicated that our sample was large enough to confirm at least three item difficulty levels (low, medium, and high) in each HLQ Domain, supporting the construct validity of the HLQ (Linacre, 2010).
HLQ differential item functioning. The DIF contrast values for items in each HLQ Domain were <0.64, offering no evidence of item bias (see Table 5). This indicated that participants with the same level of HL within each Domain responded consistently to the items in that Domain irrespective of their sex.
Item targeting. The person-item Rasch-Andrich threshold distribution for each of the nine HLQ Domains on a log-odds unit (logit) scale are shown in Figure 3. The Domain difficulty (endorsable) levels ranged from -6.0 to 8 logits and the person ability (agreeable) levels ranged from -10 to 10. This indicated that the HLQ Domain difficulty (endorsable) levels cover most of the ability (agreeable) levels of the separate HL constructs among our participants.
However, each of the nine HLQ Domains contain gaps between participants self-reported ability and Domain difficulty levels which does not support adequate scale targeting. For example, in Figure 3, the HLQ Domain 4 (HLQ scale 4) items difficulty level did not cover participants with ability below -4.7 logits hence these participants self-reported these items too difficult to endorse. Similarly, there was no item difficulty matching participants with ability level between 1 and 3 logits meaning these participants self-reported these items below their ability too easy and those above their ability too difficult to endorse. It is important to note this is self-reported and not a test-based measurement and may not reflect the true ability and difficulty levels as indicated.

DISCUSSION
The main purpose of this article was to assess the appropriateness of the HLQ to measure HL among Understanding MS online course participants and to provide evidence that the HLQ may be used in online health education settings.
Rasch analysis was specifically used to rigorously assess the HLQ's psychometric properties. We found that the HLQ is an appropriate tool for the assessment of the nine separate HL constructs in Understanding MS online course enrollees, comprising of both MS community members and interested Notes. The MNSQ acceptable limits for productive measurement were 0.6-1.4. Infit and Outfit values close to 1.0 shows acceptable fit and that scale is more productive for measurement. Rasch explained dimension >40% = strong measurement of dimension; first contrast Eigenvalue ≤2.0 = no significant unexplained variance by additional dimension apart from the Rasch dimension. Infit = overfit coefficient; Outfit = underfit coefficient; MNSQ = mean square; ZSTD = Z-standardized scores. In the study cohort, the nine Domains of the HLQ each had adequate ordered rating-scale categories with enough separation between them and ordered Rasch-Andrich threshold measures. The 4-point and 5-point rating scale response options used in the HLQ Domains functioned appropriately in our sample. Because of this, we maintained these rating options for the assessment of HL in our cohort.
The result of this study is consistent with previous work that found the HLQ to have good measurement properties with robust construct validity and reliability for each of its nine Domains (Ahmadi & Salehi, 2019;Huang et al., 2019;Kolarcik et al., 2017;Maindal et al., 2016;Morris et al., 2017;Nolte et al., 2017;Osborne et al., 2013;Richtering et al., 2017). This indicates that the nine HLQ domains each consistently measure a single HL construct and together they provide reliable and valid information on nine distinct HL constructs. Our work suggests that the HLQ provides fair assessments of individual HL levels across sex groups in the Understanding MS online course cohort, as is expected of a good measurement tool (Tennant & Conaghan, 2007). This suggests that sex differences found in previous studies are likely to be true differences and not measurement artifact (Maindal et al., 2016).
In the Understanding MS online course cohort, the HLQ Domains were unable to adequately distinguish all participant ability levels in HL. This was evidenced by varied gaps between participant ability and item difficulty levels. This finding that the HLQ has inadequate participant ability targeting suggests that the HLQ measurement precision is reduced for participants with very low and low HL and for those with moderate or high HL in the study cohort. This finding is similar to the results of two previous validation studies evaluating cohorts with similarly high mean educational attainment or socioeconomic status Richtering et al., 2017). In these prior studies, the HLQ was found to have inadequate targeting for participants with high HL among older adults who presented to the emergency department after a fall  and in a population with moderate-to-high cardiovascular risk (Richtering et al., 2017). Despite these few measurement gaps in the HLQ Domains, our data using the Rasch model validated the use of the HLQ to assess HL in this cohort. It is encouraging to find the HLQ a suitable choice of instrument for use in this cohort considering that our study was overpowered and is more likely to find statistically significant problems that are not clinically significant or relevant to public health.
We have shown that the HLQ can be used to measure HL in online health educational platforms. This adds to the existing knowledge on the validated modes of delivery for HL measurements such as self-administered paper-based, online/web-based, face-to-face, and telephone-based interviews that have been explored in different settings and populations (Ahmadi & Salehi, 2019;Debussche et al., 2018;Huang et al., 2019;Jessup et al., 2017;Kolarcik et al.,  2017; Maindal et al., 2016;Morris et al., 2017;Mullan et al., 2017;Nolte et al., 2017;Rademakers et al., 2020;Richtering et al., 2017).

STRENGTHS, LIMITATIONS, AND FUTURE RESEARCH
The major strength of this study is the use of the Rasch modeling approach to provide a rigorous psychometric assessment of the HLQ in a large cohort of online learners, which included both members of the MS community and the general public. This study had limitations. Our study had a moderate participation rate (44.1% of invited participants), which is typical of online surveys. This may have introduced nonresponse selection bias hence our findings should be interpreted with caution. A small proportion of the study cohort were men (13.5%). Although this is a common issue among MS-related cohorts, given that MS affects nearly three times as many women compared to men (Shull et al., 2020), it suggests that the results of the DIF analysis assessing the influence of sex should be interpreted with caution. The high education level of our cohort may have resulted in the observed gaps in the domain targeting. Future research should examine changes in HL in this cohort and test the sensitivity of the HLQ Domains to change over time.

CONCLUSION
Here we present a robust psychometric validation of the HLQ in a large cohort in an online health education setting. The strong psychometric properties demonstrated by the HLQ in this study indicate that it is an appropriate tool for the assessment of HL among participants in the Understanding MS online course and similar settings. . Person-item threshold distribution graph for each of the nine Health Literacy Questionnaire (HLQ) Domains. The horizontal axis is the relative location of person ability (blue) or item difficulty (red) level on a log-odds unit (logit) scale. The vertical axis is a count of people with a particular ability level (blue) or items of a particular difficulty level (red). The mean location is zero. Items (red) or people (blue) located below zero are considered less difficult items or to have lower ability level, respectively. Items or people located above zero are considered more difficult items or to have a higher ability level. These graphs provide evidence of a good overall match between person ability and item difficulty with some measurement gaps.