Background

Endometriosis is a common, chronic gynecological disease among women of reproductive age. It is defined by the growth of endometrium-like tissue outside the uterine cavity, including the ovaries and other pelvic structures [1]. The condition is associated with a variety of symptoms, with the main clinical symptoms such as dysmenorrhea (pain on menstruation), dyspareunia (painful intercourse), dyschezia (painful bowel movements), lower back pain, and chronic pelvic pain [16]. It has been suggested that chronic pelvic pain is the most important clinical factor of endometriosis [7] and is commonly reported among women with the condition. Moreover, it is a progressive disease that worsens over time [8].

Among gynecological conditions, endometriosis is the third leading cause of gynecological hospitalization in the United States [9]. Exact prevalence is unknown as the endometriosis can only be definitively diagnosed during pelvic surgery, usually laparoscopy or laparotomy; therefore, most prevalence estimates are made on the basis of surgical populations [10]. Estimates vary widely [11], but the disease is generally estimated to occur in 5–10 % of women in the general population [2, 1015]. In women with pelvic pain, the prevalence is even estimated to be 3 or more times higher [2, 8, 16].

In addition to clinical symptoms, women with endometriosis experience a range of non-clinical symptoms. Depression and isolation are feelings often experienced. Women with endometriosis report worse emotional well-being than women with a primary diagnosis of depression, hypertension, diabetes mellitus, heart disease, and arthritis [17]. Problems with sex life and relationships are also common [17, 18]. Women with endometriosis have reported having less intercourse and more frequent interruption of intercourse due to pain [4]. Additionally, women with endometriosis have difficulty in fulfilling work and social commitments [19] and often report fatigue or lack of energy [6, 20].

The existence of endometriosis-associated symptoms has an adverse impact on physical, mental, and social well-being and therefore a negative effect on health-related quality of life (HRQOL) [19, 2124]. This impact is additionally magnified by the degree of severity of the condition; more severe cases are associated with greater reduction of HRQOL [18, 25].

Treatments aim to alleviate or significantly reduce pain, thereby reducing the burden of the illness. For chronic pain, the most important measures of treatment response and reduction in illness burden involve patient-reported outcomes (PROs) because the patient is the most important judge of whether changes are important or meaningful [26, 27]. Clinical trials of endometriosis treatment have reported significant improvement in HRQOL assessed using PRO measures following treatment [2836]. Disease-specific PRO measures have been developed and as measures of treatment efficacy, such as the Endometriosis Health Profile—30 [37]. In addition, generic HRQOL PRO measures are also used in studies of endometriosis, with the Medical Outcomes Study Short Form 36 (SF-36) being one of the most common [22].

Although disease-specific instruments are more sensitive to disease experiences than generic instruments [38], the SF-36 has advantages of allowing comparisons across diseases and between patients’ scores with those of the general public. This information is useful in establishing a thorough understanding of disease impact in relation to other conditions and healthy individuals. The SF-36 has been found to be responsive to change in health status in women receiving treatment for endometriosis [39] but has not been validated specifically for this condition.

The purpose of this study is to evaluate the validity of the SF-36 in endometriosis, using data from two clinical trials. A secondary objective is to examine the responsiveness and minimally important difference (MID) of the SF-36 in patients with endometriosis. Use of the SF-36 in endometriosis offers at least two advantages over disease-specific measures for this condition or its symptoms. First, as a generic health measure, it allows comparisons of HRQOL of women with endometriosis with HRQOL experiences of other diseases. Second, generic health measures tend to be less sensitive to the disease experience than disease-specific measures [38]. Thus, to the extent that the SF-36 detects improvements resulting from treatment, this would be stronger evidence of a treatment effect.

Methods

Data

Data came from two phase III studies of a treatment for endometriosis-related symptoms. Study A is a 24-week, multicenter, open-label, randomized, parallel-group, non-inferiority study investigating the efficacy and safety of daily oral administration of 2 mg dienogest versus intramuscular administration of 3.75 mg leuprorelin acetate every 4 weeks for the treatment for symptomatic endometriosis in 248 subjects with endometriosis [40]. Study B is a 12-week, double-blind, randomized, placebo-controlled, parallel-group study designed to investigate the efficacy and safety of daily oral administration of 2 mg dienogest versus placebo for pelvic pain in 198 subjects with endometriosis [41].

Measures

Data from three PRO measures and two clinician-completed measures were collected in both trials. Two of the PROs and both clinician-completed measures were used to validate the SF-36. The three PROs are described first below followed by the descriptions of the clinician-completed instruments.

Medical Outcomes Study Short Form 36

The SF-36 is one of the most widely used generic measures of health [2] and is commonly used in studies of endometriosis and common gynecological conditions, including endometriosis [22]. The SF-36 is a self-administered, generic health status questionnaire that measures 8 health concepts [42, 43]: “physical functioning (PF), role limitations due to physical problems (RP), bodily pain (BP), general health perception (GH), vitality (VT), social functioning (SF), role limitations due to emotional problems (RE), and mental health (MH).” The typical factor structure of the SF-36 hypothesizes that PF, RP, BP, and GH are subscales of the physical component, while RE, VT, MH, and SF are subscales of the mental component.

Scores can be calculated for each domain or by Physical and Mental Component Summary Scores (PCS and MCS) [43]. Scores are generally transformed to a range from 0 to 100 for the 8 subscales; the two components are normed with z-scores of mean = 50.0 and SD = 10.0. For all subscales and both components, a higher score indicates better health status on each dimension. In this study, version 2 of SF-36 was used.

The pelvic pain visual analog scale

As pain is the most dominant symptom of endometriosis, patients indicated their endometriosis-associated pelvic pain on a 100 mm visual analog scale (VAS). The ends of the VAS were anchored with the descriptions (0) “absence of pain” to (100) “unbearable pain.”

Patient satisfaction with treatment

Only patients in Study B rated their satisfaction with treatment (very much satisfied, much satisfied, minimally satisfied, neither satisfied nor dissatisfied, minimally dissatisfied, much dissatisfied, very much dissatisfied). This was used to assess the extent to which changes in the SF-36 subscales and components show differences for varying levels of treatment satisfaction.

The Biberoglu and Behrman severity profile

The Biberoglu and Behrman scale (B&B) [44] is a physician-completed questionnaire based on patient interview referring to the previous 4 weeks. The B&B evaluates three cardinal symptoms reported by endometriosis patients: dysmenorrhea, dyspareunia, and pelvic discomfort/pain. Each symptom has four possible intensities (0 = none, 1 = mild, 2 = moderate, and 3 = severe) based on the patient’s self-assessment of pain and the gynecological palpation by the attending physician. A summary score on these three items (0 = none, 1–3 = mild, 4–6 = moderate, and 7–9 = severe) is calculated. Physicians also rate 2 items on the same 0–3 scale that evaluate physical signs of endometriosis: pelvic tenderness and induration, yielding a summary score from 0 (none) to 5–6 (severe). A total symptom severity score is calculated by summing the pain/discomfort and physical signs scales.

Clinical global impressions of change

At the end-of-study visit, only in Study B, the investigator assessed each patient’s improvement relative to symptoms at baseline on the clinical global impressions of change (CGI-C) [45], a 7-point scale: 1 = “Very much improved,” 2 = “Much improved,” 3 = “Minimally improved,” 4 = “No change,” 5 = “Minimally worse,” 6 = “Much worse,” 7 = “Very much worse.” CGI-C was administered at week 12 in the placebo-controlled study.

Assessment points

The SF-36 was completed at baseline and end of study (week 24 for Study A; week 12 for Study B). The pelvic pain VAS was completed at baseline and every 4 weeks in both studies. The B&B was completed at baseline and week 12 for both studies, and week 24 for Study A. Finally, the CGI-C and patient satisfaction with treatment were completed at week 12 for Study B only.

Analyses

As the factor structure of the SF-36 is generally well established and because sample sizes for the two trials were relatively small, analyses began with confirmatory factor analyses (CFA). A confirmatory factor analysis of the SF-36 was first conducted on Study A at baseline. Once a satisfactory measurement model was obtained, confirmatory analyses were conducted using baseline data from Study B to see whether a comparable factor structure was supported. The remaining psychometric analyses were conducted on both trial datasets separately based on the results of the factor structure from the CFA.

Confirmatory factor analysis

Confirmatory factor analyses using structural equation modelling were conducted to confirm the measurement model and fit of subscales within the hypothesized structure of the SF-36. The analyses assessed the fit of an 8-factor and 2-summary-score solution as specified in the SF-36 standard scoring manual [46]. Since confirmatory analyses require relatively large sample sizes with sample size requirements increasing as models become more complex [47], the analyses were performed at the level of the subscales and components, not the items, using total scores for each subscale due to the relatively small sample sizes in each trial (Study A = 252 and Study B = 198). Specifically, the factors of physical functioning, role physical, bodily pain, and general health were hypothesized as subscales of the Physical Component Score and the factors of role emotional, vitality, mental health, and social functioning were hypothesized as subscales of the Mental Health Component Score [46]. Overall model fit was assessed and factor loadings were evaluated for acceptable magnitude (factor loadings of 0.40 are conventionally considered acceptable).

Adequacy of fit was assessed using several fit indices: Comparative Fit Index (CFI), standardized root mean residual (SRMR), and root mean square error of approximation (RMSEA) [47, 48]. In addition, modification indices were examined for any anomalous results (e.g., correlated errors, secondary loadings that were not explicitly modelled).

In the context of structural equation modelling, several fit statistics provide information about the adequacy of the model to explain the data [47]. In general, a model explains the data well if the CFI, that is, the difference between the hypothesized model and a null model, is 0.9 or better, though there is some disagreement about 0.9 or 0.95 as the lower threshold for the CFI [48]. The SRMR measures the mean absolute difference between observed and model-implied correlations; values of <0.1 are considered acceptable [48]. As such, the SRMR is a measure of “badness of fit” as a larger value represents a larger discrepancy between the hypothesized model and the data. Finally, the RMSEA is also a measure of the “badness of fit,” assessing the discrepancy between the predicted and observed data per degree of freedom; values <0.08 are considered acceptable [49]. The 90 % confidence interval (CI) for the RMSEA should be narrow, giving additional confidence in the estimate. Once the model had been run and acceptable fit was achieved using baseline data from Study A, the model was confirmed using baseline data from Study B.

Internal consistency reliability

Once the factor structure of the SF-36 was confirmed, internal consistency was assessed (Cronbach’s alpha; standardized items are reported, though the results for unstandardized items were identical to the third decimal place) for each subscale first using baseline data from Study A and then with baseline data from Study B.

Test–retest reliability was not performed due to the relatively long lags between SF-36 assessments (Study A: 24 weeks; Study B 12 weeks).

The internal consistency reliability was assessed using Cronbach’s formula for coefficient alpha:

$$\alpha = \frac{{N \cdot \bar{c}}}{{\left( {\bar{\upsilon } + \left( {N - 1} \right) \cdot \bar{c}} \right)}}$$

where N is the number of components (items or tests), \(\bar{\upsilon }\) equals the average variance, and \(\bar{c}\) is the average of all covariances between the components. In addition, the item-rest correlation (i.e., the multiple correlation coefficient “R” for each item, having regressed each item on the remaining items in the scale) was examined to see whether any items are less correlated with the remaining items.

The standardized alpha was presented. This was based on standardized scores (mean = 0 and standard deviation = 1) for each of the items. There are no tests of statistical significance for alpha; the values are presented descriptively on an interval level scale from 0 to 1.0, with higher scores indicating a more reliable (precise) instrument. The target Cronbach’s standardized alpha is at least 0.70, though patterns of item-to-item correlations and item-to-total correlations are also important, as are the number of items in the subscale. Moreover, an alpha that is too high (e.g., approaching 1.0) can indicate a set of items that are likely to be redundant, so this is not optimal.

Construct validity

Construct validity, the extent to which the instrument measures what it is intended to measure, was evaluated in a variety of ways. Specifically, SF-36 subscale and component scores were correlated with the pelvic pain VAS item (at baseline and end of study for both studies), B&B (pelvic discomfort and pain and total score; at baseline and end of study for both studies), and patient treatment satisfaction rating (at week 12 in Study B). Spearman correlation coefficients were used to evaluate these relationships.

Known groups/discriminant validity

Known groups/discriminant validity was assessed through the ability of the SF-36 subscale and component scores to discriminate between groups of patients according to the levels of symptom severity, based on the B&B symptom severity using analysis of variance (ANOVA) with Scheffe’s post hoc comparisons. Mean differences between four symptom severity groups at baseline were compared to assess the relationship between SF-36 scores and symptom severity item scores at baseline for both studies. Subjects were stratified depending on their symptom severity item scores. The groups were 0 (none), 1 (mild), 2 (moderate), and 3 (severe).

A similar ANOVA strategy evaluated differences in mean SF-36 subscale and component scores by VAS pain severity groups. Quartiles of VAS pain severity groups were created after examination of descriptive statistics, and Scheffe’s post hoc comparisons of mean SF-36 scores between quartiles were carried out.

Finally, for Study B, the mean change in SF-36 was compared for different values of the CGI-C. These ANOVAs indicate whether those for whom the clinician rated as “Very much improved” had significantly higher mean scores on the SF-36 subscales and components than those with clinician ratings of change that were less improved.

Responsiveness and minimal important difference

To evaluate responsiveness of the SF-36 subscale and component scores, correlations were computed between changes in the SF-36 and changes in the pain VAS for Study A, and between changes in the SF-36 with changes in the pain VAS and the CGI-C for Study B.

Two methods—a priori and data-based—were used to establish change thresholds for assessing the relationship between minimal change in pain and the corresponding change in the SF-36 bodily pain subscale and the PCS. First, we used as a priori thresholds those suggested by Farrar et al. [50] to anchor important changes in pain using a 0–10 numerical rating scale. Farrar et al. [50] found that changes of 1–2 points were considered small but important to patients. Applying this finding to the 0–100 (“absence of pain” to “unbearable pain”) VAS scale, those with a 10- to 29-point change toward the “0” end on the VAS scale were considered as having a small but important change between baseline and end of study, while VAS reductions of 30 points or more were considered moderate to large improvements. Therefore, VAS improvements of 10–29 points represent a “responder,” and changes in the VAS of less than 10 points in either direction (i.e., ±9 points) were considered the stable group (“non-responder”).

Changes in VAS scores were grouped into 5 change categories:

  • Decrease of at least 30 mm (very much improved)

  • Decrease between 10 and 29 mm (minimally improved)

  • Decrease of 9 mm up to an increase of 9 mm (no change)

  • Increase between 10 and 29 mm (worse)

  • Increase of at least 30 mm (very much worse).

The second approach used the distributions of change based on the data in each study to establish change thresholds rather than using a priori thresholds, that is, based on the histograms of the change scores in the pain VAS, and categories of “minimal change” and “no change” were established. Interestingly, the category of “minimal change” was consistent with that noted above: a change of 10–30 points, while the “no change” group had a slightly larger range (−10–10).

A step-wise triangulation approach was used to establish an MID for the SF-36 subscales. First, distribution-based approaches were used to evaluate MID for Study A and then for Study B. An anchor-based method using the CGI-C measure from Study B was used to confirm an MID. Another way of exploring the MID is to use receiver operating characteristic (ROC) curves to look at sensitivity and specificity for different cut points when comparing patients who improve versus those who show no change on the SF-36 over the trial. The final cut point is one that strikes a balance between sensitivity and specificity, and correctly identifies the greatest proportion of patients with detectable improvement without incorrectly identifying patients as having improvement when in fact they did not. Two different ROC curves were computed based on the pain VAS categories of change noted above. In Farrar et al. [50], a priori category of “minimally improved” was compared with that of “no change.” In a second analysis, the data-derived categories of “minimally improved” and “no change” were compared.

Results

Table 1 presents the baseline patient characteristics for Studies A and B.

Table 1 Patient characteristics at baseline

Confirmatory factor analysis

The model fit statistics of the CFA for both trials are presented in Table 2. The factor loadings for both trials and correlations between the PCS and MCS are presented in Fig. 1. The CFI was 0.92 and 0.91 for Studies A and B, respectively, between the recommended thresholds of 0.9 and 0.95. The SRMR was below the threshold deemed acceptable for both of the studies, further confirming the hypothesized factor structure, that is, the mean differences between the data-derived correlations and those implied by the model were trivial. However, the reported RMSEA values were outside of the acceptable range, especially for Study B where the 90 % CI was entirely above the recommended threshold of 0.08. It is possible, however, for the RMSEA to be unacceptably high in simpler models, such as those analyzed here [51]. In this case, both the CFI and SRMR indicate acceptable fit and the RMSEA can be ignored. Also, as shown in Fig. 1, all factor loadings were above an acceptable threshold of 0.40.

Table 2 Confirmatory factor analysis model fit statistics
Fig. 1
figure 1

Confirmatory factor analysis factor loadings (standardized)

Internal consistency reliability

The results of this part of the analysis are presented in Table 3. Although the confirmatory factor analyses needed to be performed with the subscales components, the internal consistency reliabilities could be calculated for the items within each subscale. In general, internal consistency reliability of the subscales was acceptable with alpha above the generally acceptable reliability value of 0.70. The two scales that were closest to this threshold were general health for Study A (alpha = 0.73) and role physical for Study B (alpha = 0.75). The “alpha-if-deleted” changed little for each of the eight subscales suggesting a high degree of internal consistency for each subscale. The one notable exception was item 5b (“Accomplished less than you would like”) for role emotional. This was the case for both trials. Standardized and unstandardized values were calculated, but were negligibly different (at the third decimal place).

Table 3 Reliabilities (Cronbach’s alpha) of SF-36 subscales and components

Construct validity

Construct validity was assessed by correlations between the SF-36 subscales and components and the pain VAS and B&B pelvic discomfort and pain scores. Results of construct validity analyses with the pain VAS for both trials are presented in Table 4. In Study A, five SF-36 subscales (PF, RP, BP, VT, and MH) and one component (PCS) were statistically significantly correlated with the pain VAS at baseline. At end of study, all subscales and both components were statistically significantly related to the pain VAS. For Study B, a similar, though slightly more compelling, set of results emerged. Both SF-36 components and all subscales, except GH, were statistically significantly related to the pain VAS at baseline. At end of study, like Study A, all subscales and both components were statistically significantly related to the pain VAS, though the correlations for Study B were generally larger except for MH. Of particular note is that the correlation of BP with the pain VAS was moderate [52] for both studies at baseline and end of study. The PCS was weakly correlated with the pain VAS for Study A and Study B at baseline; at end of study, it was moderately correlated in both studies (r = −.41 and −.44). Other dimensions show only a weak or sometimes very weak relationship.

Table 4 Pearson correlations between SF-36 subscale and component scores and pain VAS, baseline and end of study

Spearman correlations between SF-36 subscales and components and the B&B pelvic discomfort and pain exhibited a similar pattern of correlations at baseline and end of study for both studies (results not shown). Correlations tended to be larger at end of study than at baseline and for Study B compared with Study A. Not surprisingly, BP had the strongest correlation of the subscales with the B&B pelvic discomfort and pain; the PCS had a slightly weaker correlation with the B&B pelvic discomfort and pain.

Known groups/discriminant validity

Mean differences in SF-36 subscale and component scores were compared by level of symptom severity on the B&B symptom severity (none, mild, moderate, severe) using ANOVA with Scheffe’s post hoc comparisons. Results of these analyses are presented in Table 5 for baseline and end of study for both studies (for details by B&B symptom severity, see Appendix Table 9). For Study A at baseline, with the exception of RE and the MCS, all SF-36 subscales and the PCS were significantly associated with levels of symptom severity. At end of study for Study A, both SF-36 components and all subscales, except GH, MH, and SF, and the MCS were significantly associated with levels of symptom severity when comparing pelvic pain severity groups: patients with lower B&B symptom severity scores (i.e., less severe) had better mean SF-36 subscale and component scores. The association was particularly strong for the bodily pain SF-36 score and the PCS.

Table 5 Discriminant validity of the SF-36 scores: ANOVA by Biberoglu & Behrman symptom severity level at baseline and end of study

Similar, though somewhat less robust, results were seen for Study B at baseline. Mean scores on PF, RP, BP, VT, SF, and the PCS varied significantly by symptom severity level of the B&B. At end of study, however, mean scores for every SF-36 subscale and component varied significantly by B&B severity level.

Responsiveness and minimally important difference of the SF-36

Responsiveness of the SF-36 subscales and components was evaluated by examining relationships between changes in the SF-36 and changes in the pain VAS and, for Study B, categories of CGI-C and patient satisfaction with treatment. The scoring on the SF-36 change variables is such that a lower or negative score indicates that the respondent got worse (i.e., their end-of-study score was lower/worse than their baseline score). Conversely, for the change in the pain VAS score, a lower or negative score represents an improvement (i.e., their end-of-study score was lower/better than their baseline score). Table 6 presents correlations between changes in SF-36 subscales and components and changes in the pain VAS and summaries of ANOVA F tests for comparisons of the mean changes in SF-36 with categorical changes in the pain VAS (very much improved, improved, no change, worse, very much worse). It was hypothesized that those who reported improvement in pain should also report improvements in their SF-36 scores, especially the BP score.

Table 6 ANOVAs assessing mean change in SF-36 by mean change in pelvic pain VAS from baseline to end of study and correlations between changes in SF-36 scores and changes in pain VAS from baseline to end of study

For both trials, correlations between changes in SF-36 and changes in the pain VAS indicated that decrements in pain VAS scores (i.e., lessening pain) were correlated with improvements in SF-36 subscale and component scores (i.e., greater SF-36 scores). This was particularly notable for the BP subscale and the PCS. These results are reflected in the negative correlations seen in the first two columns of Table 6.

For Study A, those whose mean pain VAS scores improved from baseline to end of study had significantly higher mean improvement in the PCS and all SF-36 subscales, except for GH, MH, and SF. Bodily pain and PCS exhibited a particularly strong and statistically significant relationship. For Study B, those whose mean pain VAS scores improved from baseline to end of study had significantly higher mean improvement in the PCS and all SF-36 subscales, except for MH and MCS.

Improvement based on the CGI-C and patient satisfaction with treatment in Study B was associated with improvement in the SF-36 for several subscales and the PCS. Specifically for the CGI-C, the SF-36 subscales of BP, GH, RE, and VT all had significantly higher means for patients whose clinicians indicated that they had greater improvement in their symptoms since baseline. For patient satisfaction with treatment, mean scores for RP, BP, GH, and PCS were greater for those who had greater satisfaction with treatment for their condition.

Minimally Important Differences analyses

Study A

Table 7 presents the results of the MID analyses. The results suggest some highly varied MIDs for the SF-36 subscales and components, ranging from about 4 to over 20 for Study A and from under 4 to 20 for Study B. Given the central role that pain plays in endometriosis, the BP subscale and the PCS (of which BP is a component) will be the focus of detailed results. As seen in Table 7, half of the standard deviation of the change in BP is 15. This is slightly larger than the standard error of mean (SEM) (10.4). The SEM describes the error associated with the measure. Wyrwich has shown that this approach closely mirrors results using an approach based on patient global assessment of change [2, 38]. Moreover, these are associated with a substantial effect size (ES) of 1.43, suggesting that a change of this size is meaningful.

Table 7 Results of minimally important difference

Receiver operating characteristic curves were calculated to compare those who showed minimal change on the pain VAS versus those who did not change, using the cut points adapted from Farrar et al. [50] (see Table 8). The results of the ROC curves (not presented) suggest that a score between 16 and 21 represents a balance between sensitivity and specificity, correctly classifying 73 % of cases. The second method of setting thresholds of change (using distributions of change based on the data rather than the a priori thresholds suggested by Farrar et al. [50]) suggested that a score of 21 represents a balance between sensitivity and specificity, correctly classifying 70 % of cases (detailed results not presented).

For the PCS, half of the standard deviation of change is 4.6 which is also the value for the SEM (see Table 8). This corresponds to a large ES of 0.97. The score from the ROC curves (using the Farrar et al. [50] method) that balances sensitivity and specificity is 3.7 and correctly classifies 61 % of cases. The score from the ROC curves (using the alternative method for establishing change categories) that balances sensitivity and specificity is 3.8 and correctly classifies 61 % of cases (results not presented).

Table 8 Summary of results from minimally important difference analyses

Study B

For BP, we see that half of the standard deviation of the change is 11 (see Table 8). This is slightly larger than the SEM (7.2) but these correspond to an ES of 0.99. The results of the ROC curves based on the pre-defined cut points suggested by Farrar et al. [50] suggest that a score of 10 represents a balance between sensitivity and specificity, correctly classifying 63 % of cases (results not presented). Using pain VAS cut points based on the data in the study (alternative method), a score of 9 represents a balance between sensitivity and specificity, correctly classifying 63 % of cases (results not tabled).

For the PCS, half of the standard deviation of change is 3.5 while the value for the SEM is 3.1. This corresponds to an effect size of 0.71. The score from the ROC curves (Farrar et al. [50] method) that balances sensitivity and specificity is 2.9 and correctly classifies 61 % of cases (results not presented). ROC curves using the alternative method for establishing thresholds of change suggest that a score of 3 balances sensitivity and specificity and correctly classifies 61 % of cases (results not tabled).

Using the anchor-based approach (CGI-C) for Study B, comparing “minimally improved” with “no change” in their condition, this corresponded to a BP change of 10.7 and a mean improvement in PCS of 4 (see Table 8).

Summary of MID results

A summary of the results from the MID analyses is presented in Table 8. The results suggest some triangulation on an MID for both the BP subscale and the PCS, although there was more variability in a possible MID for bodily pain for Study A. For example, a possible MID ranged from 10.4 (SEM) to 21 (ROC curves). A score of around 15–16 seems to fall in the middle of this range for a minimally important change from a patient’s perspective for Study A. For Study B, there was much more consistency in the possible MID values for bodily pain. A score of 11 is a likely value for a minimally important change from the patients’ perspective, based on the half standard deviation of the change, the SEM. The ROC curves suggest a score of 9–10, which is close to the value suggested by the other approaches. Thus, based on these two studies, it appears that a change in the bodily pain subscale between 11 and 16 represents a meaningful change to patients.

Results for the PCS are a little tighter and generally more consistent across the two trials than for the bodily pain subscale. A possible MID ranged from 2.9–3.0 (ROC curves) to 4.6 (half standard deviation of change and SEM). The ROC curves for Study A yielded a value of 3.7–3.8; half standard deviation of change for Study B resulted in a value of 3.5; the anchor-based results using the CGI-C resulted in a value of 4.1. Therefore, it is likely that a change in the PCS in the range of 3.7–3.8 is a meaningful change to patients.

Discussion

The purpose of this study was to establish the psychometric validity and responsiveness of the SF-36 in endometriosis. A secondary goal was to determine the MID for SF-36 subscales and components. Establishing the psychometric properties and an initial MID for SF-36 is an important step in evaluating the effect of endometriosis on women’s HRQOL and the efficacy of treatments for this condition. That the results from two different trials—an active comparator trial and a placebo-controlled trial—were very similar lends confidence in the results and the robustness of conclusions.

The overall results of the psychometric analyses provide evidence of the validity of the SF-36 for this patient population. The factor structure, construct validity, internal consistency reliability, known groups/discriminant validity, and responsiveness indicate that the SF-36, especially the BP subscale and the PCS, is a valid, reliable, and responsive instrument for measuring HRQOL for women with endometriosis.

To establish the psychometrics of the SF-36, two measures that are generally accepted as appropriate indicators of HRQOL for women with endometriosis—pain VAS and the B&B—were used as comparator measures. Although correlations between the SF-36 and the pain VAS were somewhat mixed (some weak but significant while others were moderate), it performed in expected ways. Further, the results of the ANOVAs with the B&B were consistent with those of the correlations with the pain VAS. Women who reported more pain at baseline on the pain VAS and whose B&B scores were more severe were significantly more likely to have poorer scores on most of the SF-36 subscales, especially the BP and PCS.

Results were also favorable for the SF-36 as a measure that is responsive to change: Patients whose pain VAS scores improved also had improved mean SF-36 scores. Further, those whose pain VAS scores improved the most had the largest improvements in SF-36 scores.

Minimally important difference estimates from this study suggest that, based on the effect size, the BP subscale and the PCS are the two dimensions of the SF-36 that show a strong effect, supporting their ability to detect treatment effects or differences. MID estimates for the bodily pain subscale are in line with those of the developer [53]. For the PCS, MID estimates were close to those that have been published elsewhere, although these were in different indications [54, 55].

The consistency of results across two different trials—active comparator and placebo-controlled—demonstrated that the SF-36 has value in describing the experience of women with endometriosis. This instrument appears to be sensitive to changes in pain or discomfort and differences in effects of treatment. Not surprisingly, given that pain is the most prevalent symptom in endometriosis, BP and PCS, which includes the BP subscale, were especially sensitive to differences in experience and changes in condition.

Recently, using some of the same clinical trial data, Gerlinger and colleagues [56] reported that the minimal important difference (MID) of the pain VAS was 10 mm. This represents the lower threshold used in the present study based on Farrar et al. [50] Thus, the MID values for the SF-36 reported here based on the Farrar et al. approach are likely to be similar to those if the Gerlinger et al. MID value was used.

No single method of establishing an MID is ideal or accepted and each one makes certain assumptions about change [57]. Consequently, researchers use multiple methods and triangulate on a value that is consistent or within a consistent range across the methods used. That was the case in the present study. As seen in Tables 7 and 8, there was general consistency in MID values across the two studies. Thus, while some may take issue with the use of the pain VAS as an anchor and the particular categorizing of the pain VAS, the results from using that anchor correspond reasonably well with the MID results from the other methods used, especially for Study B.

Although there is some debate about the factor structure of the SF-36, there is general consistency in the second-order factor structure (i.e., the subscales that load under the PCS and MCS; [5860]). The results of the present study are in line with these findings.

That the SF-36, a generic measure of health, appears to be a valid measure for endometriosis and its treatment is advantageous in at least two ways. First, comparisons can be made with other diseases and with general populations, particularly since the PCS has been normed for many populations and diseases. Second, as a generic measure of health, it is likely to be less sensitive to condition-specific changes. The present findings indicate that the SF-36 can detect differences in patients’ conditions and changes in their conditions. Therefore, this suggests that changes in the SF-36 in the context of a clinical trial on the order of the MID reported here are likely to be meaningful and real. This lends confidence in the SF-36 being a valid and responsive measure for endometriosis, and provides evidence that BP and the PCS are especially informative when evaluating the HRQOL impact on patients with diagnosed or suspected endometriosis.