Item response theory assumptions were adequately met by the Oxford hip and knee scores

Objectives: To develop item response theory (IRT) models for the Oxford hip and knee scores which convert patient responses into continuous scores with quantiﬁable precision and provide these as web applications for efﬁcient score conversion. Study Design and Setting: Data from the National Health Service patient-reported outcome measures program were used to test the assumptions of IRT (unidimensionality, monotonicity, local independence, and measurement invariance) before ﬁtting models to preoperative response patterns obtained from patients undergoing primary elective hip or knee arthroplasty. The hip and knee datasets contained 321,147 and 355,249 patients, respectively. Results: Scree plots, Kaiser criterion analyses, and conﬁrmatory factor analyses conﬁrmed unidimensionality and Mokken analysis conﬁrmed monotonicity of both scales. In each scale, all item pairs shared a residual correlation of (cid:1) 0.20. At the test level, both scales showed measurement invariance by age and gender. Both scales provide precise measurement in preoperative settings but demonstrate poorer precision and ceiling effects in postoperative settings. Conclusion: We provide IRT parameters and web applications that can convert Oxford Hip Score or Oxford Knee Score response sets into continuous measurements and quantify individual measurement error. These can be used in


Introduction
The Oxford Hip Score (OKS) and Oxford Knee Score (OKS) are widely used patient-reported outcome measures (PROMs). Both instruments have been used as primary outcome measures in high-profile randomized controlled trials [1,2], as clinical decision support tools [3], and as quality indicators in the UK National Health Service (NHS) PROMs program [4] and other arthroplasty registries [5]. The questionnaires were developed in 1996 (OHS) [6] and 1998 (OKS) [7] to measure the outcomes following hip and knee arthroplasty from the perspective of the patient. Each contains 12 equally weighted items with five response categories relating to severity and frequency of pain and disability (most items specifically attribute symptoms to the joint of interest). The recall period of both questionnaires is 4 weeks, and scores range from 0 to 48, with a higher value indicating a better clinical state.
The OHS and OKS were developed with classical test theory, a traditional psychometric approach which assumes a linear relationship between the observed score and the level of the underlying latent construct (hip or knee health) or true score. Although straightforward to apply, there are limitations to classical test theory [8,9]. First, all items in the scale (questionnaire) must usually be completed for valid score comparison. Second, measurement error is assumed to be constant across the measurement range and errors are assumed to cancel each other out on a population level. Third, although the scores derived by summing item responses are ordinal, they are typically treated as continuous and interval-scaled. In other words, the questions and their responses are treated as being weighted equally for analysis and interpretation despite this not necessarily being the case in the minds of patients as they answer them.
In recent years, there has been increasing interest in the application of item response theory (IRT), which uses probabilistic modelling to map specific response patterns (i.e., combinations of item responses) onto continuous scales [10,11]. This can provide more granular, continuous measurement with quantifiable uncertainty at the individual level. Metaphorically, this exchanges the ruler with large and unequally sized intervals for a ruler with many, tiny, equally sized intervals.
In IRT, all items function independently. This means that scores can be generated in the presence of missing responses, without imputation or exclusion. This can be applied deliberately, by only posing the most relevant items for an individual based on their responses to previous items. This is termed computerized adaptive testing (CAT) and can shorten and personalize assessments [12].
Researchers have previously attempted to fit OHS and OKS data to the Rasch model, a strict form of IRT model that assumes that the sum-score is a sufficient statistic for the latent score (which has interval scale properties) [13]. When this has been attempted, model fit to the unmodified questionnaires has been variable [14,15], and in some cases unconvincing [16]. Another approach is to use slightly more complex (and flexible) models, such as the graded response model (GRM) [17], to describe the relationship between item response patterns and latent constructs. This has been attempted in a recent paper which showed promising results, but there the authors used only a small proportion of available data to generate model parameters and made modifications to both questionnaires by collapsing several adjacent response options [18]. Models based on larger datasets and unmodified questionnaires may have more stable parameters, better generalizability, and leverage all available response options for more granular measurement [19].
Our first aim was to test the fit of NHS PROMs data to the GRM and establish IRT models that could derive continuous latent construct measurement from item response sets. Such models could be used by other researchers in future to quantify measurement error in clinical studies [manuscript under review with JCE] or to administer the OHS and OKS as computerized adaptive tests. If this was achieved, our second aim was to create a useable system where OHS/OKS response sets could be converted to this interval-scaled scoring. This would allow revised scoring of older datasets and might allow future datasets to benefit from improved scoring without needing specific psychometric programming experience in a study or clinical team. We planned to do this by creating an opensource web application.

Methods
All analyses were performed in R version 4.2.0. Code and data are available at: https://github.com/MrConrad Harrison/IRT-modelling-for-the-OHS-and-OKS.

Data
We used publicly available NHS PROMs program data for this study. These were collected as part of a national audit across NHS England providers and include the demographics and PROM responses of patients undergoing elective primary hip or knee arthroplasty between April 1, 2012 and March 31, 2020. All patients undergoing hip or knee arthroplasty in NHS England are invited to complete the PROMs preoperatively and approximately 6 months postoperatively. This longitudinal, paired (preoperative and postoperative) dataset has been estimated to represent approximately 50% of procedures conducted during the period [20]. The data are deidentified, and ethics committee approval is not required for secondary analysis.
Hip and knee replacement procedures were assessed separately. We analyzed demographics and missing data patterns through descriptive statistics and excluded respondents with incomplete preoperative response sets list wise. We then used complete preoperative item response data to test the following key assumptions which underlie the IRT framework: unidimensionality, monotonicity, independence, and measurement invariance.

Unidimensionality
A set of items are described as unidimensional if they all measure the same, single, latent construct (or factor) in this case knee health or hip health. This is particularly relevant to the OHS and OKS, as other studies have suggested that they might each measure two correlated factors: pain and function [21,22]. If pain and function are experientially distinct constructs, positive changes in one could offset negative changes in the other when the scores of all items are combined. For example, a patient could experience improving function but worsening pain (two important changes), with a combined score that remains unchanged. The correlation between these factors has been estimated between 0.87 and 0.92 for the OKS [21] and 0.60 for the OHS [23].
For each PROM, we assessed unidimensionality using a scree plot, Kaiser criterion analysis (with a threshold of 1.0 eigenvalues) [24] and a confirmatory factor analysis (CFA) with polychoric correlation and a diagonally weighted least squares estimator in the lavaan package (version 0.6-11) [25]. The scree plot and Kaiser criterion analysis measure the variance in item responses explained by potential factors. The CFA tests how well our theoretical, unidimensional model explains the covariance in item response data. We used the following fit statistics and thresholds to indicate good model fit: root mean squared error of approximation ! 0.06, standardized root means square residual 0.08, comparative fit index ! 0.95, and Tucker-Lewis index ! 0.95 [26].

Monotonicity
Monotonicity describes a nondecreasing relationship between item scores and latent construct levels. In other words, for any given item, if respondent x has a higher score than respondent y, the overall assessment score of respondent x must not be lower than that of respondent y. This can be assessed through Loevinger's H i statistic, which compares the number of violations to this pattern (known as Guttman errors) to the number that would be expected in a set of unrelated items [27]. We took Loevinger's H i values O 0.3 to indicate monotonicity [28].

Item independence
The local independence assumption states that two items are only related by the construct that they measure. We tested for this using Yen's Q3 residual correlation statistic, with a threshold of O 0.20 indicating undesirable local dependence between items [29]. A high residual correlation may suggest that the response to one item affects the response to the other, or that both items measure a second, unintended construct.

Measurement invariance
Measurement invariance describes a consistent relationship between item response patterns and latent construct levels across different population subgroups. For example, imagine an item that asks whether the respondent has difficulty using a toilet to urinate. For a given level of knee function, the response may differ between males and females as men may be more likely to stand up while urinating. In this case, the item would show differential item functioning (DIF) by gender.
We tested for DIF by gender (male vs. female) and age (! 60 years or ! 60 years, as patients undergoing hip or knee arthroplasty less than the age of 60 years have substantially higher revision and dissatisfaction rates than those aged more than 60 years [30,31]). To do this, we used the logistic regression technique described by Choi et al. [32]. This method compares the fit of different logistic regression models that aim to predict item response based on the latent construct level. The addition of covariates (age or gender) should not improve model fit unless DIF exists. If the addition of a covariate (gender or age) improved the Nagelkerke pseudo-R 2 value of the model by O 2%, we considered the item to exhibit DIF.

Graded response model
Using the mirt package (version 1.36.1) [33], we fitted GRMs to the complete preoperative item response sets in each dataset and used these to calculate IRT scores (specifically, expected a posteriori scores computed with a standard normal prior), for patients at both preoperative and postoperative time points. We compared these to test-level and item-level information generated by the models, to illustrate how measurement precision varies with the level of hip or knee health.
We operationalized these models as an R Shiny web application that allows researchers to upload item response sets as a comma separated values (CSV) file, convert response sets to IRT scores, and download these together with the standard error of measurement for each respondent.

Cross-walk table
As an alternative to the response-patternespecific IRT scores generated by the web application, we used the mirt package [33] to produce cross-walk tables that translate each of the 49 possible sum-scores on each instrument into expected a posteriori sum-scores and T-scores (mean 50, standard deviation 10), based on the GRMs. These serve as quick look-up tables to convert a (0-48) sum-score on either instrument into an IRT score. The expected a posteriori sum-score is the mean of each response-patternespecific IRT score associated with a given sum-score [34]. For example, there are 12 possible response patterns that could achieve a sum-score of 1 on the OKS. Each of these response patterns is associated with its own response-patternespecific IRT score (available through the web application). The expected a posteriori sum-score associated with the sum-score of 1 is the mean of these 12 response-patternespecific IRT scores.

Demographics, clinical characteristics, and ceiling effects
The demographics for each dataset are presented in Table 1. In both datasets, complete preoperative item response sets were available for 98.9% of individuals.
In respondents completing the OHS, ! 0.1% achieved the ceiling score preoperatively, whereas 15.7% achieved the ceiling score postoperatively. In respondents completing the OKS, ! 0.1% achieved the ceiling score preoperatively and 3.7% achieved the ceiling effect postoperatively.

Unidimensionality
Scree plots, Kaiser criterion analyses suggested that both the OHS and OKS were unidimensional. This is illustrated in Figure 1.
The CFA provided further support for the assumption of unidimensionality, with both the OHS and OKS preoperative data demonstrating excellent fit to the one-factor model. The only fit statistic not to meet our prespecified threshold was the root mean squared error of approximation for the OHS (0.075, threshold ! 0.060).
The fit statistics for each CFA are presented in Table 2, together with the thresholds that indicate good model fit. The results of CFA assumption tests and the models' standardized pattern coefficients are presented in the Supplementary Material.

Monotonicity
All items in each scale showed Loevinger's H i statistics O 0.3, confirming monotonicity. These are presented with standard errors in the Supplementary Material.

Item independence
For the OHS, the Yen's Q3 residual correlation statistic between the items relating to 'washing' and 'dressing' was 0.20. For all other item pairs in the OHS, Yen's Q3 was ! 0.20. All item pairs in the OKS had a Yen's Q3 ! 0.20.

Measurement invariance
The OHS items showed no DIF by age or gender. The OKS showed no DIF by age, but the item relating to 'kneeling' showed uniform DIF by gender, with an improvement in pseudo-R 2 of 6.17%. At any given latent construct level, men reported less difficulty kneeling down and getting up afterwards than women. When all items are administered together, the relationship between overall OKS score and latent construct level was very similar between genders (Fig. 2).

Graded response model
Having confirmed the assumptions of IRT, we fitted GRMs to both the OHS and OKS. These showed stable item parameters. Model parameters (together with 95% confidence intervals) are presented in Tables 3 and 4, and fit statistics are available in the Supplementary Material. Figure 3 demonstrates the relationship between sumscores and scores derived from the IRT model.
The test-level information (which is closely related to measurement reliability and precision) was high across the ranges of the latent trait where most respondents are  Figure 4. A test-level information plot for the OHS is available in the Supplementary Material, along with item-level information plots for both scales. An information level O 9.77 equates to a standard error of measurement ! 0.32, or a marginal reliability of O 0.90, which is generally considered an excellent level of precision. An information level of 5.00 is equivalent to a marginal reliability of 0.80, which some consider acceptable for group-level measurement but not for individual-level measurement [35].

Web applications
The web application for converting item response data into IRT scores can be found at: https://conrad-harrison. shinyapps.io/IRTconverter/.
Users may upload item response data for either the OHS or OKS as a CSV file and convert these to continuous IRT scores. Data are not stored by the platform or viewable by other users.
The scores are presented as person location logits, which will range from approximately À4 to 4. Users may wish to scale these into other formats (e.g., to range from 0 to 100) [36] but it is usually reasonable to analyze logit scores without further scaling. Readers should be aware that scaling the logit scores into a continuous 0-48 format does not necessarily place them onto the same ordinal 0-48 scale achieved by summing the scores of each item.
Together with the IRT score, the web applications provide standard error of measurement values for each respondent. These can be interpreted as the standard deviation of plausible IRT scores that would result in the observed response set. In other words, 95% credible intervals can be presented for each score as IRT score 6 1.96 Â standard error of measurement [37].
Missing data are handled directly by the IRT model. There is no need to impute or excluding missing item response data. In these cases, the score is measured from all available data and the uncertainty of the measurement is reflected in the standard error of measurement. Missing item responses can simply be left blank in the CSV file. Table 5 is the cross-walk table for converting sum-scores into expected a posteriori sum-scores or T-scores. This can be used as a straightforward way to convert sum-scores to IRT scores, but provides less granular scoring than the response-patternespecific scoring available through the web application.

Discussion
In this study, we found that the OHS and OKS fulfilled the assumptions of the GRM and developed models that allow specific response patterns to be mapped onto continuous scales. In future, the model parameters provided in this paper and our open-source web application can be used to: The scree plots show a clear 'elbow' at the second factor, suggesting that most of the covariance in item responses is explained by the first factor. The horizontal dashed line shows the Kaiser criterion cutoff of 1 eigenvalue, with only the first factor accounting for more covariance than this limit. This strongly suggests unidimensionality. -analyze OHS and OKS data with higher granularity, taking into account information on item characteristics; -describe measurement precision at the individual level (e.g., for clinical decision support); -quantify measurement error in clinical trials; and -build computerized adaptive tests.
Although the classical test theory scoring of the OHS and OKS allow 49 different scores (including 0), our web applications will provide more than 244 million (5 12 ) different possible scores for each scale (or more than two billion when possible missing data patterns are included).
This change in scoring is not necessarily sufficient to alter the conclusions of studies which use the OHS or OKS. Studies of other PROMs have failed to demonstrate superiority of IRT scoring to sum-scoring against external criteria [38,39], and in this study, we found a close correlation between EAP scores and sum-scores (Fig. 3). Nonetheless, this could be tested empirically in future. By demonstrating the agreement between IRT and classical test theory scoring, our study provides valuable reassurance that the foundation of previous research and policy remains sound, while also highlighting the potential benefits of using IRT in future studies.  This work is potentially more impactful for individuallevel scoring (e.g., when the OKS is used as a clinical decision aid on a per-patient basis [3]) than for group-level scoring, where positive and negative differences between classical test theory and IRT scoring (which are generally small, Fig. 3) are averaged out. Although it is possible that rescoring the OKS and OHS with IRT could alter betweengroup or within-group comparisons (such as those made in research studies), it is likely to have a bigger impact on between-patient or within-patient comparisons (such as those made in clinical practice). Using our web application, clinicians can now also estimate the potential measurement error around an individual's score (using the standard error of measurement or 95% credible intervals). This may be particularly useful for comparing repeated measures in an individual or for comparing an individual's score to those of other patients or clinically important thresholds [3].
The parameters we have presented could be used to build computerized adaptive tests that reduce the length of the OHS and OKS by selectively administering the most relevant items for an individual, based on the responses provided so far during the assessment [12]. CAT is most effective when used with large item banks, where it can provide more precise scoring than static short forms, in some cases from even fewer items [35,40]. However, recent research has shown that CAT can also reduce the length of PROM scales with similar lengths to the OHS and OKS [41e43] and this may be appealing in the context of clinical trials where several PROMs may be administered to respondents together. The publication of these IRT parameters complements previous efforts to reduce the burden of the OHS and OKS through CAT. These previous attempts have relied on either modifying the questionnaires [18] or the use of non-IRT techniques [44].
In the OKS, there was DIF by gender for the 'kneeling' item. But when all 12 items were combined, differential test functioning was negligible. This means overall scores for men and women can be interpreted in similar fashions.  However, the DIF in this item may be an important consideration for future computerized adaptive test development, as it could have a more significant impact in truncated assessments. If DIF is a concern, this item can simply be omitted when calculating IRT scores. In recent years, there has been interest in the use of the OKS and OHS to measure pain and function constructs separately and in addition to a more global knee health construct [21,22]. This can be achieved by breaking each scale into two discrete subscales. Here, we have shown that the items in each instrument can be considered unidimensional (collectively measuring a single knee or hip health construct). One possible interpretation of this finding is that pain and function are closely related in osteoarthritis. This is consistent with previous factor analyses that showed cross-loading of items onto both pain and function constructs [21,22]. An alternative hypothesis might be that pain causes a reduction in function. The practical implication of this study for OKS and OHS users is that although pain and function can be measured individually, there may be little merit in doing so as these constructs are closely correlated. Using all items in each PROM together as a single scale will produce measurements with a lower standard error of measurement (higher precision) than will be achieved in individual pain and function subscales.
Although the OHS and OKS are well targeted (provide precise measurement) for preoperative patients, they both demonstrate ceiling effects postoperatively. This has been described previously [45,46], and in this study, we have demonstrated how this impacts on test-level information (and thus measurement precision). Respondents with high scores (e.g., an IRT score more than 2.5 logits or sumscore more than 40) will demonstrate higher standard errors of measurement (lower precision and reliability). In realworld terms, the Oxford scores would struggle to differentiate between someone who casually runs 5 km once a month and an elite athlete. This is more relevant for postoperative settings than preoperative settings. Although IRT may help to model the impact of ceiling effects on measurement precision, it does not resolve the ceiling effects themselves, which can be considered an issue with the content (wording) of the PROMs' items.
There are limitations to this work. Although we used very large datasets that produced stable model parameters, these models may not generalize to other populations (e.g., where significant cultural differences may affect the Fig. 4. Test-level information (precision) of the Oxford Hip Score (panel A) and Oxford Knee Score (panel B) across latent construct levels. The xaxis represents the latent construct (knee or hip health) measured on a continuous logit scale based on respondents' specific response patterns and the graded response model. The higher the latent trait level, the better the clinical state. The distribution of postoperative scores (shaded green) is higher (clinically better) than that of preoperative scores (shaded purple). The red line represents the level of information contained by the pattern of responses that achieve the latent construct score, which is closely related to the precision or reliability of the score. Scores at the extreme negative or positive ends of each scale provide less information for the model to calculate latent construct level. In other words, measurement is less precise at these levels. Test information is high for most preoperative response sets, but drops in the latent construct range where many postoperative respondents lie. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.) Table 5. Expected a posteriori (EAP) sum-scores and T-scores (mean 50, SD 10) associated with each sum-score in the Oxford Hip and Knee Scores relationship between latent construct level and item responses). In future, this could be examined through DIF analysis by country. All patients in this analysis were undergoing primary elective arthroplasty. Models may not generalize to very different conditions or treatments (e.g., major trauma or complex revision arthroplasty). With our models, it is now possible to quantify measurement error in clinical trials that use the OHS or OKS, using techniques such as plausible value imputation [34]. Plausible value imputation is similar to multiple imputations, but instead of aiming to replace missing data, latent construct measurements for each respondent are randomly drawn from a distribution of plausible values. This distribution can be normally approximated with a mean equal to the expected a posteriori IRT score and a standard deviation equal to the standard error of measurement (both available through our web application). We have demonstrated this process in an accompanying paper (manuscript under review with JCE).
Future work should apply IRT scoring to OHS and OKS datasets. It will be particularly important to understand how this additional granularity affects the instruments' sensitivity and responsiveness and whether measurement error could have affected the results of landmark trials that have used these PROMs with classical test theory scoring. When doing this, trialists should be aware that interpretability statistics (such as minimal important difference and minimal important change) may vary with the scoring approach. Future work might also aim to define clinically important thresholds on this new, continuous, IRT scale. This cross-walk table can be used as an alternative to the web application for converting sum-scores to either IRT scores or T-scores based on the IRT score. EAP sum-scores are the mean of EAP scores associated with any given sum-score. For example, there are 12 different possible response patterns that achieve a sum-score of 1, each with its own associated EAP score. The mean of these scores is the EAP sum-score.