Clinically important improvement thresholds for Harris Hip Score and its ability to predict revision risk after primary total hip arthroplasty

Background Some aspects of validity are missing for the Harris Hip Score (HHS). Our objective was to examine the clinically meaningful change thresholds, responsiveness and the predictive ability of the HHS questionnaire. Methods We included a cohort of patients who underwent primary total hip arthroplasty (THA) and responded to the HHS preoperatively and at 2- or 5-year post-THA (change score) to examine the clinically meaningful change thresholds (Minimal clinically important improvement, MCII; and moderate improvement), responsiveness (effect size (ES) and standardized response mean (SRM)) based on pre- to post-operative change and the predictive ability of change score or absolute postoperative score at 2- and 5-years post-THA for future revision. Results Two thousand six hundred sixty-seven patients with a mean age of 64 years completed baseline HHS; 1036 completed both baseline and 2-year HHS and 669 both baseline and 5-year HHS. MCII and moderate improvement thresholds ranged 15.9–18 points and 39.6–40.1 points, respectively. ES was 3.12 and 3.02 at 2- and 5-years; respective SRM was 2.73 and 2.52. There were 3195 hips with HHS scores at 2-years and 2699 hips with HHS scores at 5-years (regardless of the completion of baseline HHS; absolute postoperative scores). Compared to patients with absolute HHS scores of 81–100 (score range, 0–100), patients with scores <55 at 2- and 5-years had higher hazards (95 % confidence interval) of subsequent revision, 4.34 (2.14, 7.95; p < 0.001) and 3.08 (1.45, 5.84; p = 0.002), respectively. Compared to HHS score improvement of >50 points from preoperative to 2-years post-THA, lack of improvement/worsening or 1–20 point improvement were associated with increased hazards of revision, 18.10 (1.41, 234.83; p = 0.02); and 6.21 (0.81, 60.73; p = 0.10), respectively. Conclusions HHS is a valid measure of THA outcomes and is responsive to change. Both absolute HHS postoperative scores and HHS score change postoperatively are predictive of revision risk post-primary THA. We defined MCID and moderate improvement thresholds for HHS in this study.


Background
Total Hip Arthroplasty (THA) is the second most commonly performed arthroplasty in the U.S. In 2010, 438,000 THAs were done in the U.S. [1] and its utilization rate is increasing rapidly. U.S. Medicare spent $9 billion on implantable medical device procedures in 2009 [2].
The improvements in pain and function after THA are measured with instruments such as Harris Hip Score (HHS) [3]. HHS is the most commonly used instrument for the assessment of outcomes post-THA [4]. HHS is valid and reliable [5][6][7][8] and is often used as a reference/ gold standard for assessing the construct validity of other patient-reported outcome measures (PROs) for hip outcomes [9]. HHS is more responsive than the Western Ontario McMaster Osteoarthritis Index (WOMAC) [8] (a pain and function composite measure), short form-36 (SF-36) [8,10,11] (a generic health-related quality of life measure) and the walking speed [11] (an objective measure). HHS is joint-specific, measures hip outcomes and is widely available. A surgeon or a health professional usually completes HHS.
To our knowledge, despite its widespread use, there are no published data regarding what is a clinically important difference on the HHS or whether the HHS scores are predictive of the risk of future revision surgery. Defining clinically important thresholds for outcomes instruments such as HHS is critical, since it has direct clinical care and clinical trial relevance. For clinical care, this threshold could define what proportion of patients improved (by what matters to a patient, which is more relevant than a mean change for a population and/or a p-value for group means) with any new clinical initiative. For a clinical trial, thresholds would allow a comparative assessment of one intervention vs. another, and thus allow the design of clinical trials with an adequate sample size to differentiate two interventions [12]. If HHS can predict the risk of early revision surgery, future studies could assess its utility as a screening tool for implant failures after THA. The objective of this study was to define the clinically meaningful thresholds for improvement and assess the responsiveness of HHS and examine its predictive ability for early revision surgery in patients with primary THA.

Study participants and the HHS questionnaire
The Human Ethics Committee at the Mayo Clinic approved the study and research was carried out in compliance with the Helsinki Declaration. We received a waiver of written informed patient consent for this database study. Two study cohorts of patients who underwent primary THA at the Mayo Clinic between 1993 and 2005 were examined: (1) patients who completed HHS at baseline and at follow-up at either 2-or 5-year (cohort 1); (2) patients who completed follow-up HHS at either 2-or 5-year (cohort 2). All analyses were performed for the first cohort and in addition, we performed analysis of predictive ability of final postoperative HHS scores at 2-and 5-years for the latter cohort. HHS is a composite measure, with score ranging from 0 to 100, heavily weighted by pain and function; a higher score is better. It includes four domains: pain (1 item; 44 points), physical function (7 items; 47 points), deformity (5 items; 5 points), and range of motion (5 items; 4 points) [3].

Responsiveness and predictive ability
Minimal clinically important improvements (MCII) and moderate improvements were calculated by assessing mean change from baseline to 2-and 5-year follow-up in patients who reported "somewhat better now" or "much better now" in response to the global question, "Compared to before surgery, how is your hip?" at both 2-and 5-years, respectively. To assess responsiveness, we calculated effect size (ES) and standardized response mean (SRM). We calculated ES by dividing the change in hip score from baseline to 2-years (or 5-years) by the baseline standard deviation (preoperative; SD). According to the Cohen's rule, an ES of 0.20-0.49 represents a small change, 0.50-0.79 a medium change, and ≥ 0.80 a large change. The SRM is the mean change in the patient score divided by the SD of the changed scores [13]. These analyses were performed for the cohort with pre-operative and at least one post-operative follow-up, 2 or 5-year.
Descriptive statistics were reported as number (percentage) or mean (SD) as appropriate. We examined the associations of the final HHS score at 2-or 5-years or change in HHS score from baseline to 2-and 5-years, with the risk of subsequent revision THA, at ≥ 731 days post-surgery for 2-year and ≥ 1826 days for 5-year (Day 0 for all revision analyses, respectively). Final HHS scores were categorized into ≤ 55, 56-63, 64-71, 72-80 and 81-100, based on quintiles. Only the first hip in a patient was included for this study; patients with simultaneous bilateral THAs were excluded. We performed sensitivity analyses by using the traditional categorization of HHS: < 70, poor; 70-79, fair; 80-89, good; and 90-100 excellent. Improvement in HHS was categorized a priori as ≤ 0 (no improvement or worsening), 1-20, 21-50 vs. > 50, based on clinical judgment from an orthopedic surgeon (D.G.L), a co-author of the current study. Cox proportional hazards regression was used, reporting a hazard ratio and 95 % confidence intervals. Kaplan-Meier survival was used to estimate implant survival based on the absolute and change in HHS scores at 2-or 5-years. A p-value of less than 0.05 was considered significant. Since death is a competing risk, we also performed competing risk models adjusting for death.

Cohort characteristics
Two thousand six hundred sixty-seven patients had completed baseline HHS, of whom 1036 had completed both baseline and 2-year HHS and 669 had completed both baseline and 5-year HHS; 338 patients completed all three assessments (baseline and 2-and 5-year HHS (Fig. 1). Mean age was 64 years and 51 % were women. The dempgraphic and clinic characteristics of two cohorts used for (1) responsiveness, clinically important improvement thresholds and change HHS scores (both pre-and 2-or 5-year post-arthroplasty scores) and (2) predictive ability of 2-or 5-year HHS scores only, are shown in Appendix: Table 3. The causes for early revision in the 2-and 5-year cohorts are shown in Appendix: Table 4.

Clinically important improvement thresholds for the HHS and its responsiveness
MCII threshold for HHS ranged 15.9-18 points (Table 1); and moderate improvement threshold was 39.6-40.1 points. We found that the effect sizes for HHS pre-to post-THA were large, estimated at 3.12 and 3.02 at 2-, and 5-years. SRM was 2.73 and 2.52, at 2-and 5-years, respectively (Table 1).

Predictive ability of the absolute and change HHS scores
There were 3195 hips with HHS scores at 2-years and 2699 hips with HHS scores at 5-years (regardless of the completion of the baseline HHS). Low total HHS scores at 2-and 5-years were associated with a higher risk of revision surgery after each time-point (Table 2). Compared to patients with follow-up HHS scores of 81-100, patients with HHS scores < 55 at 2-and 5-years, had 4.34 and 3.08 higher hazards of revision subsequently, both statistically significant (Table 2). Results were similar when we used the traditional clinical cut offs of < 70, 70-79, 80-89, and 90-100 (Table 2).
Compared to an improvement of > 50 points, no improvement/worsening HHS score, i.e., change ≤ 0, was associated with a 18-fold increased risk (p = 0.02) at 2years post-primary THA and improvement of only 1-20 points with a 6-fold increased risk of revision (p = 0.10) ( Table 2). K-M graphs showed that 2-year HHS scores (p =0.0018) was associated with revision risk and change HHS score at 2-years seemed to be associated as well with borderline statistical significance (p =0.062; Fig. 2a-d). Models accounting for death as a competing risk confirmed findings; the total number of patients was small for improvement (Appendix: Table 5).

Discussion
Our study is the first study to define the thresholds for patient-relevant meaningful improvements on HHS. These thresholds were 16-18 points for MCII and 40 points for moderate improvement. We used a patient anchor to define these thresholds. MCII for HHS can now be used for trial sample size calculations and comparative effectiveness studies in THA. The use of MCII in arthroplasty trials will allow future trials to be adequately powered to examine group-level differences in patient-level outcome assessment on HHS. This information will compliment the mean HHS score comparisons, an approach that combines responders and non-responders in a single group. Consensus recommendations from an international pain trial group are to report both means and %responders in clinical trials, where pain is one of the primary outcomes [14]. Similar thresholds have been reported for another instrument, the Mayo Hip Score [15], that is patientadministered, not physician-reported.
Our MCII threshold is similar to/lower than MCII estimates for other 0-100 pain scales, for example, the thresholds of 15-20 mm on WOMAC pain [16], 24 mm on HOOS pain [17], 20 mm on VAS pain in pain trials [18] and 22 mm on VAS pain in gout trials [19]. This finding establishes the sensitivity to change for HHS, an important property for an instrument. Moderate improvement reported in our study are similar to the substantial improvement of 50 % in pain [14,18] and to American College of Rheumatology (ACR50) response in patients with rheumatoid arthritis, i.e., 50 % improvement in composite criteria [20]. An earlier study  in revision THA that did not use a patient anchor defined the MCII using a statistical definition and reported it to be 2.44 points [10], much lower than the 20-unit threshold previously defined on 0-100 pain scales in trials of chronic pain [18] and gout [19]. This might be due to very low variability on HHS in this small sample, and is discordant with available literature for other similar pain scales. A 2 mm improvement on a 0-100 scale (~2-4 % change depending on the baseline) does not mean much to patients or providers, and in most instances will be indistinguishable from baseline noise. We found that HHS scores < 55 at 2-and 5-years as well as the lack of improvement or worsening on change HHS scores from baseline to 2-years were each predictive of early revision in patients with primary THA. This is a new finding, to our knowledge. The robustness of this finding is supported by the consistency of estimates at both 2-and 5-years. Our finding links HHS at early follow-up to future implant failure and indicates that absolute HHS scores and its trajectory post-THA may help to screen early implant failures after primary THA. More research is needed to see if risk prediction models using PROs such as HHS can be developed for early implant failure and validated, similar to a proposal based on Oxford scores [21].
Wright et al. reported ES and SRM of 2.5 and 1.8 for HHS in a sample of 78 THA patients at 6-month postsurgery [8], confirmed in our study and extended to a longer follow-up. A previous study showed a much larger effect size of 8.6 for HHS in patients with revision THA [10]; SRM in hip fracture was 0.75 for HHS [22]. Some reasons for the difference in findings from our study vs. previous study for HHS effect size were the difference in patient population (primary THA vs. revision THA), length of follow-up (up to 5-years vs. 6 months), and the setting (USA vs. China).
Our study strengths include a large primary THA patient sample from a Joint Registry, the use of well-  *P-value was calculated using a Cox proportional hazards regression model **The main reason for decrease in the number of patients between absolute score (top) and change HHS score analyses (bottom) was that a lower number completed both surveys (See Fig. 1 for details) accepted methods to examine validity, the performance of sensitivity analyses and robustness of estimates across 2-and 5-year data. Our study has several limitations. MCII thresholds are not absolute, but rather estimates and can vary somewhat from one population to another, and between studies. Our findings were derived from a single center primary THA cohort, and our patient characteristics are similar to several other THA studies including the one with a national sample [23][24][25], implying that our sample is representative of primary THA. However, findings may not be generalizable to revision THA. However, more studies in other populations can determine the slight variation in MCII thresholds or confirm that these thresholds hold true for those groups as well. Incorporation of 2-or 5-year HHS scores into surgeon's clinical decision-making may have biased findings to be more significant that they actually might be. However, many revisions occurred at a time much after the assessments, and it is possible that the surgeon did not access old scores. We recognize that radiographic changes after arthroplasty are very important in deciding whether revision surgery is needed or not, since asymptomatic patients sometimes show loose hip implants. The present study did not assess the predictive ability of hip radiographs alongside Harris Hip Score. This should be investigated in future studies. Our study establishes the thresholds for clinically meaningful change on HHS; however, ceiling effects have been noted with HHS [26].

Conclusions
In conclusion, we found that in patients who underwent primary THA, HHS was responsive to change and predictive of the risk of revision after primary THA. Clinically important improvement thresholds for minimal and moderate clinically important improvements on HHS were defined in this study can now be used in arthroplasty clinical trials and clinical care. This report also establishes an additional utility of HHS, in predicting early revision surgery after THA, an important outcome.