Evaluating the Longitudinal Item and Category Stability of the SF-36 Full and Summary Scales Using Rasch Analysis

Introduction The Medical Outcome Study Short Form 36 (SF-36) is widely used for measuring Health-Related Quality of Life (HRQoL) and has undergone rigorous psychometric evaluation using Classic Test Theory (CTT). However, Item Response Theory-based evaluation of the SF-36 has been limited with an overwhelming focus on individual scales and cross-sectional data. Purpose This study aimed to examine the longitudinal item and category stability of the SF-36 using Rasch analysis. Method Using data from the 1921-1926 cohort of the Australian Longitudinal Study on Women's Health, responses of the SF-36 from six waves of data collection were analysed. Rasch analysis using Winsteps version 3.92.0 was performed on all 36 items of the SF-36 and items that constitute the physical health and mental health scales. Results Rasch analysis revealed issues with the SF-36 not detected using classical methods. Redundancy was seen for items on the total measure and both scales across all waves of data. Person separation indexes indicate that the measure lacks sensitivity to discriminate between high and low performances in this sample. The presence of Differential Item Functioning suggests that responses to items were influenced by locality and marital status. Conclusion Previous evaluations of the SF-36 have relied on cross-sectional data; however, the findings of the current study demonstrate the longitudinal efficacy of the measure. Application of the Rasch Measurement Model indicated issues with internal consistency, generalisability, and sensitivity when the measure was evaluated as a whole and as both physical and mental health summary scales. Implications for future research are discussed.


Introduction
To be deemed effective and useful, health measures must fulfil several requirements including validity, reliability, interpretability, and responsiveness to change [1]. Measurement invariance is another important characteristic, ensuring that the same construct is being consistently measured across different populations and settings, and over time. Considerations of measurement invariance are important for longitudinal studies that seek to gauge change in a construct, across a broad population and over time. When studies involve an older population, measurements may be vulnerable to instability as the participants age, their living circumstances may change, and their physical and cognitive abilities may decline [2,3].
The Medical Outcome Study Short Form 36 (SF-36) is one of the most commonly used questionnaires for monitoring Health-Related Quality of Life (HRQoL) across a multitude of populations and settings, including client groups and healthy populations [4][5][6][7][8][9][10]. HRQoL refers to aspects of quality of life that are impacted by an individual's mental and physical health [11].
Development of the SF-36 came about following difficulties during the Health Insurance Experiment (HIE), whereby the completion of a lengthy health survey was refused by participants [9]. In response to this need, Ware et al. [9] constructed a health survey that was both comprehensive and relatively short. The initial survey, the SF-18, comprised of 18 items measuring physical functioning, role limitations relating to poor health, mental health and health perceptions 2 BioMed Research International [9]. Subsequently, additional items have been added to create the 20-item SF-20 version, and 36-item SF-36 version which is now the most commonly used.
The SF-36 measures eight key health concepts: (1) physical functioning (PF); (2) role limitations due to physical health problems (RL-P); (3) bodily pain (BP); (4) general health (GH); (5) vitality (V); (6) social functioning (SF); (7) role limitations due to emotional problems (RL-E); and (8) mental health (MH) [9]. From the eight scales, the survey generates overall physical and mental health component summary scores. Both summary measures include scores from all eight subscales; however particular correlations are present; the physical functioning, role limitations-physical, and bodily pain scales should correlate highest with the physical component score (PCS) and lowest with the mental component score (MCS) [12]. The mental health, role limitations-emotional, and social functioning scales should correlate highest with the MCS and lowest with PCS, with the remaining general health and vitality scales found to correlate moderately with both the PCS and MCS [12]. Summary score results can be compared with gender and age-group norms derived from the general population, e.g., United States population norms [12].
The SF-36 is now widely used for both research and clinical purposes and has undergone rigorous psychometric evaluation nationally and internationally using Classic Test Theory (CTT) [6,7,9,10]. CTT seeks to determine the reliability of a whole instrument through evaluating the degree of variance in terms of the ratio between true and observed scores. Therefore observed results are the product of the respondent's "true score," in combination with error [13].
A relatively new approach to psychometric test design is Item Response Theory (IRT; Edelen and Reeve) [14]. IRT models are typically considered to be unidimensional, assessing instrument reliability at item-level rather than instrument-level, by determining the unique contribution of each item to the construct or trait being measured. IRT considers the importance of participants' responses, whereby the probability of their answering a particular item correctly is based on their responses to other items of greater or lesser levels of difficulty or challenge [14]. Within IRT, the Rasch Measurement Model (RMM) is the most frequently applied IRT approach to investigating the unidimensionality of items that make up scales and to determining if responses are indeed measuring a single dimension only, through the examination of item fit statistics [15].

Application of IRT/RMM to the SF-36.
Under the assumptions of an IRT model, instruments deemed reliable should meet the following properties: unidimensionality, hierarchical ordering of items, and reproducibility of scale items across client populations [16]. Unidimensionality assumes that a collection of items represent and assess a single construct, that is, fit a single one dimensional model [16]. Item hierarchy refers to a hypothesised continuum along which instrument items should progress in difficulty from easier to more challenging to answer. In other words, the probability of answering the more difficult items is higher for those individuals with higher levels of the latent trait being measured, while those with lower levels of the trait have a lower probability of answering items at the upper end [16]. Reproducibility relates to item hierarchy whereby item order and calibrations along the continuum are seen to remain relatively stable or constant across different groups of assessment respondents and assessment occasions [16]. Item reproducibility or stability is considered essential to the ability to accurately measure between-group differences and within-group changes over time [16].
IRT-based evaluation of the SF-36 has overwhelmingly focused on individual scales, particularly the Physical Functioning-10 subscale, with only some studies having examined particular psychometric properties of the SF-36 as a whole instrument or by component summary scores [5,6,12,16,17].

Unidimensionality and Item Fit.
Only a few analyses have investigated the model-fit of the SF-36 as a whole. A prospective cohort study, involving a sample of 583 participants who were opioid-dependent, assessed item-model fit and latent trait factors for the eight SF-36 subscales and for the whole instrument [6]. The RMM reliability estimates of all eight SF-36 subscales (including a revised PF-10 subscale) established that each measured a single latent trait [6]. Investigation of the dimensional structure of the instrument as a whole confirmed the presence of an eight-factor model; that is, the SF-36 measured eight distinct latent traits [6].
Analysis confirming a two-factor structure, reflecting the SF-36 physical and mental health components, has also been conducted using principal component analysis, with the physical and mental health domains accounting for 70% of the total variance across both standard and acute forms [12]. A single-administration survey with a general U.S.A. population sample (n = 634) evaluated the item-fit of the SF-36 physical and mental HRQoL domains using RMM modelling [5]. In this analysis, eight items in the physical domain had disordered thresholds, whereby a person responding to higher or lower levels of a categorical scale did not necessarily possess higher or lower levels of the trait that was being assessed [5]. The authors suggested collapsing some category options to overcome this issue [5]. In terms of the HRQoL domains' unidimensionality, the mental health items were seen to fit RMM expectations, whereas the physical domain required discarding of the seven misfitting items to produce a 14-item domain that met RMM requirements. Survey data for of 395 Taiwanese patients with chronic lung disease were analysed to conduct similar assessments of the SF-36 mental and physical health domains, with the authors concluding that each domain was unidimensional [7].
Differential item functioning (DIF) analysis using IRTbased techniques has also been undertaken with the SF-36. DIF refers to the unequal endorsement of instrument items by respondents of different groups, given that the items intend to measure the same latent trait [10]. The presence of DIF undermines instrument construct validity and may compromise the ability to compare instrument scores across different groups of respondents [10]. Yu et al. [10] utilised the multiple-indicator, multiple-causes (MIMIC) technique, and an IRT-based methodology to detect if DIF existed in the SF-36 physical and mental health domains. Data were extracted from the 1994-95 cohort of the Southern California Kaiser Permanente database (n = 7,538), which evaluated the health outcomes of patients receiving pharmacist consultations. DIF across SF-36 physical and mental health domains was analysed in relation to the presence of five key disease types: hypertension, rheumatic conditions, respiratory diseases, depression, and diabetes. Results indicated the presence of statistically significant DIF for a total of five items, both physical and mental health-based, for the hypertension, respiratory, and diabetes groups, respectively [10]. The authors concluded that the presence of DIF for only five of 36 items did not warrant significant concern regarding the overall construct validity of the SF-36; however, they cautioned regarding the use of the SF-36 in comparing groups based on hypertension in particular, who returned DIF effect for two items in the physical health domain [10].

Cross Cultural Item Response Patterns.
Rasch modelling has also been applied to translated versions of the SF-36 to examine its cross-cultural validation. An assessment of the appropriateness of a Korean version of the SF-36 with 510 elderly Korean adults was conducted using the RMM [17]. The authors verified the presence of unidimensionality in the instrument and determined through step calibration that the response options of three-and five-point scales for items were appropriate for this population [17]. Goodness-of-fit statistics however determined that nine items across the instrument were not appropriate for this population, in terms of being incongruent with other items, having significant overlap with other items, or creating confusion due to misinterpretation of the meaning of items [17].

Item
Stability. While item-model fit and determination of the presence of DIF are important, these properties can mean very little if item responses are inconsistent or changeable over time. Evaluation of the stability of item responses is important to determining the rigour of an instrument. Most IRT evaluations of SF-36 data have been cross-sectional and therefore stability of item response has not been evaluated [5-7, 10, 17]. Two studies assessed performance across repeated administrations, following pre-post designs [18,19]. Martin et al. [18] utilised the SF-36 as one of three evaluation tools pre-and posttreatments for rheumatoid arthritis (n = 339), but with the aim to compare measurement properties of these tools and determine sensitivity to change rather than stability. IRT analysis of the PF-10 revealed weaknesses in sensitivity to treatment response at 6 and 12 months, with authors suggesting construction of a more comprehensive measure. McHorney et al. [19] compared IRT and Likert scoring method of the SF-36 Physical Functioing-10 scale, using a pre-post design. The findings showed apparent differences in patients with very high and low physical functioning, suggesting that Rasch model of scoring may have important implications for clinical interpretations of the scale [19].
Only one longitudinal study has evaluated properties of the SF-36 using IRT methodologies. The first administration of the standardised SF-36 was conducted as part of a fouryear longitudinal Medical Outcomes Study of patients (N = 3,445) with chronic medical and psychiatric conditions [16]. Examination of the reproducibility of the item calibrations of the Physical Functioning-10 scale was conducted, from baseline to two years [16]. A high degree of consistency in item calibration between the two time points was found, both in order and magnitude [16]. However, this longitudinal study only evaluated the stability and structural validity of the Physical Functioing-10 scale using IRT. The stability of the remaining SF-36 subscales, the physical and mental health domains, and the measure as a whole over time has not been examined using IRT to date.
A lack of evaluation regarding the performance of the SF-36 over time presents a significant gap in the literature, with unanswered questions about its measurement stability. It is vital that the long-term reliability of the SF-36 is examined, to determine its true suitability for inclusion in large-scale longitudinal studies tracking participants, particularly as they age over extended periods of time. This study therefore seeks to use an IRT-based methodology to evaluate the item stability of the SF-36 total and component summaries in a large, longitudinal data set. The following questions guided this research: (1) Is there disordering or dysfunction within the SF- 36 items against the construct being measured?
(2) Do the SF-36 items have a consistent hierarchy of difficulty and good distribution across all waves of a longitudinal survey?
(3) Is the SF-36 differentiating discreet subgroups of people reliably (e.g., urban vs. regional)? (4) Does the SF-36 measure one or more constructs? (5) Were all items in the SF-36 instrument used by all participant subgroups in the same way?

Methods
Data were from an Australian prospective, population-based survey. which includes all Australian citizens and permanent residents. Women living in regional and remote areas were sampled at twice the rate of women living in urban areas in order to allow for meaningful statistical comparisons between urban and country-dwelling women. Over 40,000 respondents initially responded to the baseline postal survey in 1996 with response rates across the three age groups ranging between 37% and 52% [20]. Although some immigrant groups were underrepresented and tertiary educated women were overrepresented, the responding samples were considered to be "reasonably representative" of the Australian female adult population following a comparison to census data [21]. Each cohort has since been surveyed every three years on a rolling basis, commencing with the 1946-51 cohort in 2018, the 1921-26 cohort in 1999, and the 1973-78 cohort in 2000. Only data from the 12,432 respondents in the 1921-26 cohort were analysed in the current study. At the commencement of the longitudinal survey, these women were aged 70-75 years, and at the time of survey six, they were aged in their early nineties (N = 4,055), with most attrition being due to death (N = 5,273).
A study analysed potential biases introduced through the attrition of participants from this cohort between survey one and survey five [22]. Nondeath attrition was related to having less education, not being born in Australia, being a current smoker, and having poorer health in this cohort. Analysis comparing the survey population to the Australian Census data collected over the same time period showed an increase in the underrepresentation of women from non-English speaking backgrounds and an increase in the overrepresentation of current and ex-smokers. Differences between the study population and the national population were considered to have changed "only slightly" between survey one and survey five.

Instrument.
The SF-36 HRQoL scale is included in each survey. At baseline in 1996, mean scores for the 1921-26 cohort were lower than for other cohorts for the physical health subscales (PF, RP, and BP) and higher than for other cohorts for the mental health subscales (MH, RE, and BP) [23]. Over time, mean PF scores scale have declined, but with significant variation across different subgroups within the cohort [24]. Mean MH scores have remained relatively stable [25].

Data Analysis.
A two-stepped approach was taken to evaluate the reliability and validity of the SF-36. Across surveys one to six. First, Rasch analyses using Winsteps version 3.92.0 [26], with the joint maximum likelihood estimation method [27] were performed on all 36 items for each of the six waves of data collection and then on the items that constitute the physical health scales (PF 10-items, RP 4 items, BP 2 items, and GH 5 items), the mental health scales (V, SF 2 items, RE 3 items, and MH 5 items) and the item measuring health transition for each wave of data. The RMM was adopted for the data analysis since the 6-point response Likert scale was invariant across all the 36 items. The RMM adopts a "the data fit the model" approach. "The empirical data must meet the prior requirements of Rasch model in order to achieve objective measurement" [28, p. 65]. Several criteria including item infit and outfit statistics, reliability measures, rating scale functioning, and differential item functioning (DIF) were used to investigate the quality of the SF-36 total scale, physical health scale, and mental health scale. Item fit statistics indicate the extent to which the data match the expectations of the RMM. Outfit and Infit mean square (MNSQ) as well as their standardized forms (ZSTD) are used.

Is There Disordering or Dysfunction within the SF-36
Items against the Construct Being Measured? Response Scale. Category and step (threshold) disordering of the response scale was examined. To determine whether the rating response scales were being used in the expected manner, the rate at which average measure scores (frequency endorsed) increased in relation to category increases was examined for even distribution. A uniform category distribution is achieved when average measure scores increase monotonically as the category increases. If categories are poorly defined or items are included that do not fit the construct, then non-uniformity occurs. Fit mean squares (MNSQ) below 0.7 or above 1.4 indicate a category misfit. When disordered categories are measured then a consideration should be made to collapse it with an adjacent category [29].
The distance between categories is indicated by Andrich thresholds, or step calibrations. If there is no overlap, then categories should progress monotonically. Disordered steps indicate that the category defines only a narrow definition of the variable, rather than a problem with the sequencing of category definitions. An increase of at least 1.0 logit indicates distinct average measure categories on a 5-category scale, and gaps in the variable are indicated by an increase of >0.5 logits [30].

Do the SF-36 Items Have a Consistent Hierarchy and Good Distribution across All Waves? Person and Item Fit
Statistics. Misfitting items and the pattern of responses for each survey respondent were identified using fit statistics. These are used to determine whether an instrument is a valid measure of the construct it claims to measure. Fit statistics, reported as log odd units (logits), will be examined to determine whether the items contribute to the measurement a single construct, and the reliability of any one person's responses. The item constructs reviewed in this study are health related quality of life as a whole, as well as quality of life related to physical health and mental health. Two unstandardized statistics, MNSQ and Z-Standard (Z-STD), were used to measure item and person infit and outfit. MNSQ values for infit and outfit should have a value close to 1.0 to fit the model for rating scales, but values within the range of 0.7-1.4 are considered acceptable [15]. The model is degraded by underfit (i.e., values > 1.0), indicating the possibility for other sources of variance in the model and further investigation is required to determine the reason for the underfit. Conversely, overfit (values < 1.0) does not always degrade the model and could result in a misinterpretation that the model worked better than expected [15]. Z-STD values for outfit are expected to reach 0. If a value exceeds ±2, it is deemed to fall outside of the predicted model [15].
The person reliability statistic is equivalent to Cronbach's alpha used in CTT and indicates a measure's internal consistency (the relatedness amongst items) [15]. When person reliability values are low (i.e., < 0.8), the implications are twofold: (1) an instrument may not be sensitive enough to distinguish between high and low performers and more items are required; or (2) there were not enough persons in the sample with both high and low extreme values (a narrow range of person measures).
Person separation (if the outlying measures are accidental) and person separation index (PSI)/strata (if the outlying measures represent true performances; 4 * person separation +1/3) are used to classify people. Person separation reports whether the test separates the sample into enough levels with reliability of 0.5 separating into only one or two levels. Low person separation suggests that the instrument is not sensitive enough to separate high and low performers, 0.8 indicating separation into 2-3 levels and 0.9 indicating separation into 3 or 4 levels [29]. PSI/strata of 3 are needed to consistently identify three different levels of performance (i.e., the minimum level required to attain a reliability of 0.9). Item reliability verifies item hierarchy with <3 levels (high, medium, and low) with item reliability < 0.9 indicating the sample is too small to confirm the construct validity (item difficulty) of the instrument.

Does the SF-36 Measure One or More Constructs?
Dimensionality of the Scale. Dimensionality is tested by the following: (a) finding potentially problematic items by checking negative point-biserial correlations; (b) identifying misfitting persons or items using Rasch fit statistics; and (c) conducting Rasch factor analysis using principal components analysis (PCA) of the standardised residuals [31]. PCA of residuals checks that there are no further principal components (dimensions) after the intended or Rasch dimension is removed. No further dimensions are indicated if the residuals for pairs of items are uncorrelated and normally distributed. The criteria for determining the presences of further dimensions in the residuals were as follows: (1) >60% of the variance is explained by the Rasch factor; (2) an eigenvalue of <3 on first contrast; and (3) variance explained by the first contrast is <10% [32].
The person-item dimensionality map provides a schematic representation of how person abilities and item difficulties are distributed using a logit scale. Items that represent similar difficulty will occupy the same place on the logit scale. If a person is represented on the logit scale with no corresponding item, then there are gaps in the item difficulty continuum. Another indicator of overall distribution is the person measure score. If people in the sample are more able than the most difficult item on a scale, then the person measure score location will be lower than the centralised item mean measure score (i.e., <50). If people in the samples are less able than the items on a scale, then the mean person location will be higher (i.e. >50).

Were All Items in the SF-36 Instrument Used by All
Groups in the Same Way? Differential Item Analysis. A differential item analysis (DIF) was performed to investigate whether items in the instrument were used by all groups in the same way. DIF is noticeable when a response to an item is influenced by a characteristic of the respondent other than their ability on the underlying trait. For DIF analysis, the sample was categorised by marital status (single, widowed, divorced, married, de facto, and other) and location (urban vs. regional). In determining DIF when comparing two groups (i.e., urban and regional) the hypothesis "this item has the same difficulty for two groups" is used. The difference in the difficulty of the item between the two groups, indicated by the DIF contrast, should be at least 0.5 logits with a pvalue < 0.05 for DIF to be noticeable. In determining DIF when comparing more than two groups (i.e., marital status) the hypothesis "this item has no overall DIF across all groups" is used. DIF is then determined using the chi-square statistic and p-value < 0.05 [29].

SF36 Total Scale Rasch Analysis for Six Waves of Data
Collection. Total Rasch scale item statistics for six waves of data collection are shown in Table 1. When all 36 SF-36 items were calibrated using the RMM for the six waves of data collection, MNSQ infit statistics ranged from 0.13 to 2.43 and outfit statistics ranging from 0.22 to 2.64 (see Table 2). The mean item measure was 0.00 logits (SD = 1.12). With respect to logit measures, there was a broad range, the lowest value being -3.01 and the highest value being +2.31. This resulted in an average item separation index of 77.98 and an average item reliability of 1.00 over the six waves (see Table 3).
The SF-36 total scale person-item map in Supplemental Figure 1 shows evidence of consistent hierarchical ordering of the SF-36 total scale items. Items which were less difficult are located at the bottom of the person-item map while more difficult items are located at the top of the map. The figure also shows that while each of the waves had a reasonable distribution of items in relation to item difficulty, several of the SF-36 total scale items have the same level of difficulty.
The average person measure was 0.75 logits (SD = 0.23) over the six waves of data collection (see Table 3). The mean person separation was 0.73 with a mean reliability of 0.35 (see Table 3). When examining the overall RMM output of the SF-36 total scale, the average person measure (0.75 logits) was higher than the average item measure (0.00 logits). The range of logit values for items was from +1 to -3 logits. The person reliability was 0.35 and item reliability was 1.00. This places the item reliability for the SF-36 total scale in the acceptable range and the person reliability correlation in the unacceptable range.
The separation index for items was greater than 2.0 indicating adequate separation of the items on the construct being measured. However, the separation index for persons was less than 2.0 indicating inadequate separation of participants on the construct.          Item fit to the unidimensionality requirement of the RMM was also examined. Eleven out of the 36 items were found to have MNSQ infit and outfit statistics inside the 0.70 to 1.30 range and/or a z-score that fell inside the +2 to -2 range. Specifically, items CH01:Q1, PF01:Q3A, PF04:Q3D, PF06:Q3F, PF07:Q3G, MH04:Q9F, VT03:Q9G, VT04:Q9I, SF02:Q10, CH02:Q11A, and GH04:Q11C met the RMM requirements (see Table 2). In other words, only 30.6% (i.e., 11 of 36) of the 36 SF-36 total scale items met the RMM requirements. The following items had an Infit MnSq statistic that was less than 0. The Winsteps RMM program determines the dimensionality of a scale by using a Rasch-residual principal components analysis. When the item residuals from the RMM output were factor analysed, no significant factor loadings were present (see Table 4). This indicated that the unidimensional requirement of the SF-36 total scale was met. The raw variance explained by the SF-36 total scale over the six waves of data collection ranged from 58.5% to 62.1% and the unexplained variance in the first contrast ranged from 11.9% to 14.5%. The residual analysis completed indicated that no second dimension or factor existed. Linacre [32] suggests that a first single factor with 60% or greater of the accounted for variance is considered a reasonable unidimensional construct. "A second factor or residual factor should not indicate a substantial amount of variance if unidimensionality is tenable" [33, p. 192].
The point-measure correlation (PTMEA) ranges from +1 to -1 "with negative items suggesting improper scoring or not functioning as expected" [33, p. 192]. An inspection of the PTMEAs for the SF-36 total scale indicated that items GH01:Q1, SF01:Q6, BP01:Q7, and VT02:Q9E had consistent negative PTMEAs over the six waves of data collection. The rest of the SF-36 total scale items had PTMEAs that were positive, supporting item-level polarity. For all other items, the PTMEA correlations had acceptable values.
The functioning of the six rating scale categories was examined for the SF-36 total scale. Rating scale frequency and percent indicated that all categories were used by the participants. The category use statistics are presented in Table 5. The category logit measures ranged from -3.19 to 2.86 (see Table 5). None of the infit MNSQ scores fell outside the 0.7-1.30 range and/or a z-score that fell inside the +2 to -2 range. The results indicated that the six-level rating scale used in the SF-36 total scale fits appropriately to the predictive RMM (see Supplemental Figure 2); however, the full range of ratings were used by the participants who completed the SF-36 total scale. The probability curves for the rating scales of the six waves of data collection illustrated that each threshold estimate represented a separate point on the measure variable and each response category was the most probable category for some part of the continuum.
To investigate the possibility of item bias, differential item functioning (DIF) analysis was conducted to determine  whether different groups of participants based on marital status and area of residence (urban versus regional; see Table 6) responded differently on the SF-36 total scale items, despite having the same level of the latent trait being measured [34]. Three of the SF-36 items exhibited a consistent pattern of DIF over the six waves of data collection for both marital status and area of residence, those being MH01:Q9B, MH02:Q9C, and MH05:Q9H. It should be noted that these three items also exhibited MNSQ infit scores outside the 0.7-1.30 range and/or a z-score that fell inside the +2 to -2 range.  Table 7). When the 21 SF-36 items were calibrated using the RMM for the six waves of data collection, the items were found to have MNSQ infit statistics ranging from 0.18 to 2.66 and outfit statistics ranging from 0.19 to 2.77 (see Table 8). The mean item measure was 0.00 logits (SD = 0.99). With respect to logit measures, there was a broad range, the lowest value being -2.49 and the highest value being +1.79 (see Table 9). This resulted in an average item separation index of 60.32 and an average reliability of 1.00 over the six waves of data collection (see Table 9). The separation index for items was greater than 2.0 indicating adequate separation of the items on the construct being measured.

SF36 Physical
The SF-36 physical health scale person-item map is located in Supplemental Figure 3 and reports evidence of the hierarchical ordering of the SF-36 physical health scale items. Items which are easier are located at the bottom of the SF-36 physical health person-item map while more difficult items are located at the top of the map. The patterns of more challenging items and less difficult items on the person-item map for each of the six waves of data collection appear to be fairly consistent. It should also be noted that several of the SF-36 physical health scale items have the same level of difficulty.
The average person measure was 1.91 logits (SD = 0.39) over the six waves of data collection (see Table 9). The mean person separation was 0.93 with a mean reliability of 0.46 (see Table 9). With a mean person separation reliability of less than 2.0, this indicates inadequate separation of participants on the SF-36 physical health construct. When examining the overall RMM output of the SF-36 physical health total scale, the average person measure (1.91 logits) was higher than the average item measure (0.00 logits). The range of logit values for items was from +1.62 to -2.49 logits. The person reliability was 0.46 and item reliability was 1.00. Reliability ranges of .80 or greater are generally considered desirable [35]. This places the item reliability for the SF-36 physical health scale in the acceptable range and the person reliability correlation in the less than desired range.
Item fit to the unidimensionality requirement of the RMM was also examined. Seven out of the 21 items were found to have MNSQ infit and outfit statistics inside the 0.70 to 1.30 range and/or a z-score that fell inside the +2 to -2 range. Therefore items 1:Q1, PF01:Q3A, PF04:Q3D, PF06:Q3F, PF07:Q3G, GH02:Q11A, and GH04:Q11C met the RMM requirements (see Table 2). In other words, only 7 / 21 or 52.4% of the SF-36 physical health scale items met the RMM requirements. The following items had an Infit MNSQ statistic that was less than 0. When the item residuals from the RMM output were factor analysed, no significant factor loadings were present (see Table 10). This indicated that the unidimensional requirement of the SF-36 physical health scale was met. The raw variance explained by the SF-36 physical health scale over the six waves of data collection ranged from 41.6% to 48.9% and the unexplained variance in the first contrast ranged from 17.4% to 22.4%. The residual analysis completed indicated that no second dimension or factor existed.
The functioning of the six rating scale categories was examined for the SF-36 physical health scale. The category logit measures ranged from -3.86 to 5.43 (see Table 11). Of the six rating scale categories, only one had infit MNSQ scores that fell outside the 0.7-1.30 range and/or a z-score that fell inside the +2 to -2 range over the six waves of data collection, this being category six. The infit MNSQ scores for this rating category ranged from 2.03 to 3.18 (see Table 11). The results indicated that the six-level rating scale used in the SF-36 physical health scale might not be the most robust to use (see Supplemental Figure 3); however, the full range of ratings were used by the participants who completed the SF-36 physical health scale. The probability curves for the rating scales of the six waves of data collection illustrated that each threshold estimate represented a separate point on the measure variable and the first five response categories were the most probable category for some part of the continuum. Rating category six was problematic.
The Rasch output logit performance scores for the participants were compared to determine if any of the SF-36 physical scale items exhibited differential item functioning (DIF), based on marital status and area of residence (urban versus regional) (see Table 12). Four of the SF-36 physical health items exhibited a consistent pattern of DIF over the six waves of data collection. Item PF03:Q3C demonstrated DIF based on marital status alone while items GH02:Q11A, GH04:Q11C, and GH05:Q11D exhibited DIF based on both marital status and area of residence (see Table 12). It should           Table 13). The mean item measure was 0.00 logits (SD = 1.12).
With respect to logit measures, there was a broad range, the lowest value being -3.01 and the highest value being +2.31 (see Table 14). This resulted in an average item separation index of 79.17 and an average reliability of 1.00 over the six waves (see Table 15). The separation index for items was greater than 2.0 indicating adequate separation of the items on the construct being measured. The SF-36 mental health scale person-item map is shown in Supplemental Figure 5 and reports evidence of the hierarchical ordering of the SF-36 mental health scale items. It should also be noted that several of the SF-36 mental health scale items have the same level of difficulty. The average person measure was 0.75 logits (SD = 0.23) over the six waves of data collection (see Table 15). The mean person separation was 0.73 with a mean reliability of 0.35 (see Table 15). With a mean person separation reliability of less than 2.0, this indicates inadequate separation of participants on the SF-36 mental health construct.
When examining the overall RMM output of the SF-36 mental health scale, the average person measure (0.75 logits) was higher than the average item measure (0.00 logits). The range of logit values for items was from +2.13 to -2.08 logits. The person reliability was 0.35 and item reliability was 1.00. Reliability ranges of .80 or greater are generally considered desirable [35]. This places the item reliability for the SF-36 mental health scale in the acceptable range and the person reliability correlation in the less than desired range.
Item fit to the unidimensionality requirement of the RMM was also examined. Five out of the 14 items were found to have MNSQ infit and outfit statistics inside the 0.70 to 1.30 range and/or a z-score that fell inside the +2 to -2 range; thus, items VT01:Q9A, MH01:Q9B, MH03:Q9D, 27VT02:Q9E, MH04:Q9F, VT03:Q9G, MH05:Q9H, VT04:Q9I, and SF02:Q10 met the RMM requirements (see Table 14). In other words, only 9/14 or 64.3% of the SF-36 physical health scale items met the RMM requirements. The following items had       When the item residuals from the RMM output were factor analysed, no significant factor loadings were present (see Table 16). This indicated that the unidimensional requirement of the SF-36 total scale was met. The raw variance explained by the SF-36 mental health scale over the six waves of data collection ranged from 62.5% to 66.1% and the unexplained variance in the first contrast ranged from 15.1% to 16.5%.
An inspection of the PTMEAs for the SF-36 mental health scale indicated that, for all other items, the PTMEA correlations had acceptable values. All the SF-36 mental health scale items had PTMEAs that were positive, supporting item-level polarity.
The functioning of the six rating scale categories was examined for the SF-36 mental health scale. Items which are easier are located at the bottom of the SF-36 mental health person-item map while more difficult items are located at the top of the map. The patterns of more challenging items and less difficult items on the person-item map for each of the six waves of data collection appear to be fairly consistent. The category logit measures ranged from -3.86 to 2.57 (see Table 17). Of the six rating scale categories, only one had infit MNSQ scores that fell outside the 0.7-1.30 range and/or a zscore that fell inside the +2 to -2 range over the six waves of data collection, this being category one. The infit MNSQ scores for this rating category ranged from 1.38 to 1.41 (see Table 17). The results indicated that the six-level rating scale used in the SF-36 mental health scale might not be the most robust to use (see Supplemental Figure 6), however, the full range of ratings were used by the participants who completed the SF-36 mental health scale. The probability curves for the rating scales of the six waves of data collection illustrated that each threshold estimate represented a separate point on the measure variable and the latter five response categories were the most probable category for some part of the continuum. Rating category one was problematic.
The Rasch output logit performance scores for the participants were compared to determine if any of the SF-36 mental scale items exhibited differential item functioning (DIF), based on marital status and area of residence (urban versus regional) (see Table 18). Six of the SF-36 mental health items exhibited a consistent pattern of DIF over the six waves of data collection. Items SF01:Q6, MH01:Q9B, MH02:Q9C, MH03:Q9D, MH04:Q9F, and MH05:Q9H exhibited DIF based on both marital status and area of residence (see Table 18). It should be noted that items MH01:Q9B and MH03:Q9D had infit MNSQ statistics that fell outside the 0.7-1.30 range. SF-36 physical health items MH01:Q9B and MH03:Q9D appear to be particularly problematic items based on the RMM analysis findings.

Is There Disordering or Dysfunction within the SF-36 Items against the Construct Being Measured?
For the SF-36 as a total measure, the rating scale categories increased monotonically, indicating that rating response scales were being used as expected and are appropriate for measurement across all waves. Previous longitudinal evaluation of the measure using CCT methods found poor test-retest reliability between two time points two weeks apart [36]. Previous research using IRT methods have been largely cross-sectional, providing little longitudinal evaluation of the measure using this method [5,6,10,17]. In this sample, the pattern of more and less difficult items is consistent, indicating that item difficultly remained stable across each wave. Despite consistency across time in this sample, redundancy emerged as an issue with several total scale items displaying the same level of difficulty across all waves of data. This was seen again in both the SF-36 mental and physical health summary scores. It appears redundant items span across all uses of the measure and this suggests that item descriptors need to be more specific to avoid overlap across similar items.
Category Six of the SF-36 physical health summary scale and Category One of the SF-36 mental health scale had scores outside the acceptable range, which may indicate these rating categories are not robust for use in longitudinal studies. Disordered categories had been seen in a previous evaluation of the SF-36, with authors suggesting collapsing some category response options [5]. These findings support this issue with the SF-36. Further investigation into the category disordering in the SF-36 mental and physical health response scale is warranted and collapsing of the response option categories may improve this, as suggested in previous literature [5,17].
When examining summary statistics for total SF-36 items, the mean person reliability fell in the unacceptable range. Inadequate person separation reliability was also seen across all waves of data, in both summary scales. The person separation index indicates the instrument used as a whole and as summary scales is not sensitive enough to separate high and low performances in the sample [29]. This presents an issue with internal consistency across all presentations of the measure. Comparatively, using classical methods, the measure was seen to discriminate between patients preand postoperation [37]. Results using IRT suggest that the measure is unable to discriminate between high and low performances.
While results of IRT have raised doubts of the measures internal consistency, results from classical testing methods report strong internal consistency, reflected in high Cronbach's alpha scores. When validating the measure in patients with endometriosis, Cronbach's alpha for the total scale was above acceptable cut-offs [38]. Internal consistency scores have also been seen to be above .9 for the full scale and above .7 for each subscale [39]. In addition to internal consistency, the measure displayed acceptable content validity, correlating strongly with similar measures [38]. IRT assesses instrument reliability at item level, rather than instrument-level as well as considering considers the importance of participant responses.
The contrast between results from IRT and CTT could be due to the further focus at item level that is characteristics to IRT. It is possible that overlapping items identified in the person-item map are contributing to lack of sensitivity in the scale. Addition of more items or altering current items

28
BioMed Research International to improve sensitivity may improve the person reliability. Further investigation into the similarity and specificity of these items is warranted, to ensure items capture the full variable being measured.

Do the SF-36 Items Have a Consistent Hierarchy and Good
Distribution across All Waves? Several items on the total scale and both summary scales were found to have Infit statistics outside of the acceptable range. Many of the items remained problematic regardless of investigated as whole measure or by summary scale. The number of misfitting items was slightly lower when used in summary scales; however this can be due to the less items included in the summary scale analysis. These underfitting items create concerns about degradation of the model and the validity of the measure as a measure of health related quality of life [15]. Further investigation into such items is required to determine the reason for underfit. While overfit items do not degrade the model, they can result in misinterpretation of the model as working better than expected and also warrant further investigation [15].

Does the SF-36 Measure One or More Constructs?
The measure proved to be unidimensional across total scale and summary score analyses, indicating responses to each scale are likely to be determined by a single trait. As a total scale, the first single factor accounted for close to 60% across all six waves and the factor was considered unidimensional [32]. Residual analysis also indicated no second dimension or factor existed, further confirming unidimensioanlity of the total scale [33]. Analysis of all eight subscales revealed each scale measured a single latent trait [6]. Principal components analysis of the physical and mental health summary scores has confirmed the presence of a two-factor model, further corroborated by the results of the current study support the mental and physical health scales [12]. Results suggest the responses to measure are determined by a single factor. While the responses may be determined by a single factor, previously identified misfitting and overlapping items may degrade the model and validity, suggesting that it may not be health-related quality of life that is determining response to these items. Further research should aim to correct misfitting items and reassess unidimensionality.

Were All Items in the SF-36 Instrument Used by All Groups in the Same
Way? It appears that marital status and area of residence influence responses to both total and summary scale items. Differential item functioning has identified in the SF-36 previously, with health issues such as hypertension, respiratory issues, and diabetes influencing responses on five items in the measure [10]. Previously, the presence of DIF has been considered negligible, as it was only present for a small number of items [10]. As the SF-36 is a health-related quality of life measure, it is plausible that marital status or area of residence would have an impact in this domain as these factors can influence healthcare use and quality of life. However, the presence of DIF limits the ability of scores to be comparable across different populations.
While several items on each summary scale and total scale exhibited DIF, only item 24:Q9B demonstrated DIF across analysis of total scale and items in the summary scales. This particular item also demonstrated Infit statistics outside the acceptable range, proving to be particularly problematic in every presentation of the measure. Several other items demonstrated DIF and misfit. Given that the number of items exhibiting DIF and misfit across all presentations of the measure, further investigation is needed into these specific items.

Limitations and Future
Research. While the current study revealed differences between IRT and CTT evaluations of the SF-36, it did not compare each method in the same sample. Future research may perform both methods using the same sample, in order to explain the differences between methods and advantages of applying different frameworks when developing and evaluating measures. It may also be beneficial to compare methods longitudinally. A further limitation is the rate of attrition in the sample. While attrition is to be expected in a longitudinal study, results between waves should be interpreted in light of this.
The results suggest the SF-36 is not as sound as previously suggested. It can be delivered as eight subscales and future research may apply the RMM to each subscale to evaluate the efficacy of the measure in this form. Based on the RMM findings in the current study, future research should further evaluate this measure using IRT methods. Results suggest multiple items needed to be reassessed to avoid degrading the model and improve performance of the SF-36 as a reliable measure of health-related quality of life.

Conclusions
Previous evaluations of the SF-36 have relied on crosssectional data; however, the findings of the current study demonstrate the longitudinal efficacy of the measure. While using of the measure remained consistent across time for both the whole measure and summary scales, several issues were identified. Previous studies evaluating the SF-36 using CCT methods describe the measure as reliable and valid. However, evaluating the measure by application of the RMM indicated issues with internal consistency, generalisability, and sensitivity when the measure was evaluated as a whole and as both physical and mental health summary scales.

Data Availability
The survey data used to support the findings of this study were supplied by the Data Access Committee of Australian Longitudinal Study on Women's Health by formal request. Requests for access to these data should be made to Data Access Committee of Australian Longitudinal Study on Women's Health.

Disclosure
This research was performed as part of the employment of the authors.