Reliability and Validity of the Osteoarthritis Research Society International Minimal Core Set of Recommended Performance-Based Tests of Physical Function in Knee Osteoarthritis in Community-Dwelling Adults

Background The proper reliability analysis for specific type of data and limit study of various types of construct validity are crucial for performance-based tests for the knee osteoarthritis (OA) population. The purpose of this study was to evaluate relative and absolute reliability and construct validity of the Osteoarthritis Research Society International (OARSI) recommended minimal core set of performance-based tests in knee OA in community-dwelling adults. Methods Fifty-five primary knee OA (median age 69.0, interquartile range [IQR] 11.0) participated in the cross-sectional study. Three performance-based tests were performed in two sessions with a 1-week interval; 30-s chair stand test, 40-m fast-paced walk test and 9-step stair climb test. Relative reliability included intra-class correlation and Spearman’s correlation coefficient (SPC). Absolute reliability included standard error of measurement, minimum detectable change, coefficient of variance, limit of agreement (LOA) and ratio LOA. Knee Injury and Osteoarthritis Outcome Score-Physical Function Short Form (KOOS-PS), knee extensor strength and pain scale were analysed for convergent validity using Pearson’s correlation coefficient and SPC. Analysis of Covariance was utilised for known-groups validity. Results Relative and absolute reliability were all acceptable. LOA showed small systematic bias. Acceptable construct validity was only found with knee extensor strength. All tests demonstrated known-groups validity with medium to large effect size. Conclusion The OARSI minimum core set of performance-based tests demonstrated acceptable relative and absolute reliability and good known-groups validity but poor convergent validity.


Introduction
Knee osteoarthritis (OA) is the deteriorative risk to locomotive function and consequential frailty (1)(2). Feasible and evidence-based measurement tools were needed for early detection and follow up of the disease state (3).
Assessment of physical function in knee OA is classified into self-reported and performancebased tests that conceptualised on the International Classification of Functioning, Disability and Health (ICF) (3)(4). The minimal core set of three performance-based tests were recommended by the Osteoarthritis Research the measurement property of comprehensive OARSI recommended minimal core set of performance-based tests were still needed (5). The purpose of this study was to evaluate the relative and absolute reliability and construct validity of the OARSI recommended minimal core set of performance-based tests in knee OA in community-dwelling adults.

Population and Sampling
This cross-sectional study of construct validity and reliability of three recommended performance tests for knee OA was performed in a sub-district community setting of Chiang Mai, a northern province of Thailand. The sample size for both construct validity and reliability was at least 50, as stated elsewhere (17). The study project was advertised via the community volunteer, the village head man and the Subdistrict Health Promoting Hospital (SHPH) personnel. The volunteers were screened by physical therapists and were included if they (i) met the American College of Rheumatology (ACR) clinical diagnostic criteria (with classification tree) for knee OA (18)(19); (ii) were able to follow the instructions and could perform sit-to-stand, walk and climb the stairs and (iii) could read and fill out the questionnaire by themselves or with the help from their relatives who will read for them.
Subjects were excluded if they had a history of (i) rheumatoid and gouty arthritis or secondary OA; (ii) pain on the lower back and lower extremities other than the knees; (iii) injury/fracture with/without surgery on the lower back and lower extremities; (iv) congenital or acquired anomalies in the spine and lower extremities; (v) neurological problems; (vi) heart disease or high blood pressure not controlled by medication; (vii) intra-articular injection within 3 weeks; (viii) receiving alcohol or medications within 24 h that affect sleep or (ix) hearing and visual loss.

Procedures
Participants were appointed at the SHPH for the data collection on two sessions of performance-based testing, 1 week apart, to minimise recall effect and ensure real performance change (12). In session 1, participants performed three performance-Society International (OARSI), which includes 30-s chair stand test (30sCST), 40-m fast-paced walk test (40mFPWT) and stair climb test (SCT) (3,5). The 30sCST was preferred because of no floor effect in which poor physical function could complete the test (6). Forty-metre fastpaced walk test, a time-based, short-distance, untimed-turn and maximum walk speed, is a good test to evaluate performance response with environmental demand and is appropriate for lower extremity OA (7)(8). Stair negotiation is one of the most difficult tasks to overcome. It is associated with the limit of function from sarcopenia, somatosensory and visual impairment and other disorders such as OA (9).
The assessment of psychometric property of the test is specific to clinical conditions and should cover various reliability (absolute and relative reliability) and construct validity (e.g. convergent, divergent and known-group) (4,(10)(11). Two measurement property studies that targeted the OARSI recommended minimal core set of three performance-based tests in knee OA and knee arthroplasty (12)(13). One of those studies showed acceptable relative reliability of all performance-based tests except 11-step SCT in which the outlier was removed before analysis (12). Moreover, in non-normal data of 11-step SCT (outlier not removed), minimum detectable change (MDC) was not calculated (12). It was contended that human performance measurement, which uses ratio scale, tended to be heteroscedastic (i.e. departure from normality; error related with measurement value) (10,14). In this heteroscedastic type of data, the standard error of measurement (SEM) is a less effective representation of measurement error than the coefficient of variance (CV) and ratio limit of agreement (ratio LOA) (10,15). So, the proper analysis for a specific type of data is much more concerned with reliability study. Another psychometric property study of the OARSI recommended minimal core set showed that three performance-based tests had a moderate correlation with quadriceps strength and low correlation with a self-reported test (Knee Injury and Osteoarthritis Outcome Score-Physical Function Short Form [KOOS-PS]) and pain scale (13). Although this study provided a thorough convergent validity analysis, it did not investigate known-group validity. In a known-group study of 30sCST, the participants who ambulated with an assistive device were significantly different from the ones who did not (16). Therefore, the comparative study of

KOOS-PS
KOOS-PS is a 7-item self-report on each individual's difficulty in performing daily functions. The 5-point Likert scale, which ranged from no difficulty to extreme difficulty, was rated. The scale was developed by using the Rasch analysis of multiple samples from many countries and extracted only seven most valid items (21). KOOS-PS showed good internal consistency (0.89) and good test-retest reliability (0.85-0.86) (22). Thai-version KOOS reported good internal consistency (0.9) and high test-retest reliability in ADL domains and moderate correlation to aggregated functional performance time (0.38 to 0.50) (23).

Isometric Knee Extensor Torque
The maximum voluntary isometric contraction was tested with Baseline ® Hydraulic Hand Dynamometer (Fabrication Enterprise Inc., Elmsford, NY, USA). The dynamometer is liquid-hydraulic, 683 g in weight and is able to measure up to a maximum of 90 kg. An adaptor was fixed with the distal leg just above the lateral malleolus (24). The participant sat upright on the table with the arms crossed and the hip and knee flexed at 90° (0° as full knee extension) (24). Each participant performed two maximal contractions with a 5-min rest interval (25). During each contraction, the participant was instructed to gradually develop maximal strength over a few seconds and continue the maximal effort for 5 s (24 (25). The torques of both legs were summed to aggregate knee extensor torques (AggKET), then divided by body mass to be aggregate knee extensor torques normalised by body mass (AggKETbm, N.m/kg) (26).

Statistical Analysis
The data analysis was performed with Statistical Package for Social Sciences (SPSS) version 17.0 (SPSS Inc., Chicago, IL, USA). The distribution of data was checked. If skewness or non-normality was identified, the natural logtransformation and back-transformation would be done (27)(28)(29)(30). Geometric mean and 95% confidence interval (CI) was estimated by Cox's modification (31)(32).
based tests (30sCST, 40mFPWT and 9-step SCT) with three independent raters, i.e. each rater rated all participants by using only one test. All raters were physical therapists who had 7-12 years of clinical experience. The testing order of the tests for each participant was randomised to prevent a carry-over effect. A 5-min rest period between two consecutive measurements was allowed to ensure energy recruitment and fatigue prevention. Before testing, the physical therapist that was responsible for a specific test demonstrated the task, allowed the subjects to follow, gave feedback and stayed for safety prevention. In the first session, other than performance-based tests, the following data were collected: (i) baseline demographics; (ii) level of knee pain experienced over the past week assessed by an 11-point pain numerical rating scale (NRS) with 0 = no pain and 10 = worst pain (20); (iii) self-report difficulty to perform physical function assessed by the KOOS-PS and (iv) isometric knee extensor force measured with a hand-held dynamometer (HHD). In the second session, participants repeated three performance-based tests in the same order as they had done in the first session with the same raters. All raters were blind from the outcome of the first session. Half-day training for the test-specific therapists was taken before the data collection of the first week, including the following: questionnaire completion, NRS and KOOS-PS; set up, administration and recording of HHD and performance-based tests.

Performance-Based Tests
Measurement of the OARSI recommended minimal core set of performance-based tests (30sCST, 40mFPWT and 9-step SCT) were strictly administered with the standard procedures provided in the OARSI website (https://www.oarsi.org/research/physicalperformance-measures). The 30sCST was performed on a chair, 43 cm in height, with straight backrest and without armrest. The 40mFPWT was performed outdoors by walking straight at a distance of 10 m four times. Fast walking speed was calculated by excluding the turning time. The 9-step SCT was performed in the SHPH building on a 9-step (19 cm height/step) flight of stairs with handrail. The participants were allowed to use ambulation aids and/or handrail during the walking and stairclimbing tests.

Validity Analysis
The bivariate correlation coefficient was analysed among the performance-based test, KOOS-PS, normalised AggKET, NRS pain and age. Pearson's correlation coefficient (for normality data), SPC (for non-normality data and ordinal scale) and their 95% CI was calculated using Fisher's transformation (35). For convergent validity, the relationship between performance test and the measurement with similar constructs (KOOS-PS, knee extension torque) should have a correlation coefficient at least moderate, ≥ 0.4 or ≤ −0.4 (13). All performance tests were evaluated for knowngroups validity using adaptation to stair climbing as an independent variable. The variable was categorised into two groups; non-adaptation and adaptation (e.g. use of walking aids or handrail). Subgroup comparison of each performance measure was done using t-test or univariate Analysis of Covariance (ANCOVA) if any covariates were detected. ANCOVA was perform with standard procedures and no violation of assumption; i.e. homogeneity of regression slope (P-value for F interaction > 0.05), homogeneity of variance (P-value for Levene's test > 0.05) and variance ratio (F max ≥ 10:1) (27,41). Main effect F-test, estimated marginal mean (statistical covariate adjustment), partial eta squared effect size (Eta 2 ) with 95% CI (non-centrality interval estimation) were calculated (41)(42)(43). Cohen's criteria for Eta 2 was 0.01 (1%) small effect, 0.06 (6%) medium and 0.138 (14%) large (41).

Results
Ninety-three community dwellers came to the village sites and screened using the ACR and eligibility criteria. Of those, 24 were excluded who did not meet ACR criteria (n = 7), had history of knee trauma (n = 4) and total knee replacement (n = 4), had knee pain related with spinal conditions (n = 2), other knee pathology, i.e. gout (n = 1) and rheumatoid arthritis (n = 2), were prone to fatigue and weak from coronary artery disease (n = 2), had chronic obstructive pulmonary disease (n = 1) and was unable to provide responses to personal medical history, pain scale and ACR questionnaire (n = 1). Of the remaining 69, 14 did not come after the screening session because of no transportation (n = 4), illness (n = 4), knee pain got worst (n = 2) and unknown
Absolute reliability included SEM, SEM percentage (SEM%), MDC and LOA. SEM was calculated as the square root of mean square error, and 95% CI was from sum square error and Chi-squared value from the ICC 2,1 analysis of variance table (36). MDC 90 was calculated from 1.65 × √2 × SEM. SEM% was defined as (SEM/mean) × 100 and MDC 90 percentage (MDC 90 %) was (MDC 90 /mean) × 100, when mean was the mean of all observations in both sessions 1 and 2 (37). A SEM% of < 10% was the acceptable random error regardless of measurement unit (38). Coefficient of variation percentage (CV%) and 95% CI were calculated with the root mean square method (39). CV% < 10% was acceptable as the small difference lied within 10% of the mean (10). LOA was reported as LOA = mean diff ± 1.96 (Z-score of 95% CI) × SD diff , when mean diff and SD diff were the mean and SD of the difference between sessions (40). LOA should cover '0' to show that there was a point where between-session scores were equal. LOA was separated into systematic bias (left component or mean difference) and random error (right component or SD component). The systematic bias was interpreted as a percentage of the grand mean of the sample (10). For nonnormality performance-based data, SEM (log scale), CV% and ratio LOA were analysed. Ratio LOA = mean Lndiff + 1.96 × SD Lndiff , when Ln diff was the difference of the log-transformed session 1 and 2 scores (Ln s1 -Ln s2 ) (40). The logform of ratio LOA was later back-transformed (antilog = e s1/s2 ) and reported as antilog (mean Lndiff ) x/÷ antilog (1.96 × SD Lndiff ) (14). Ratio LOA should cover '1', which indicated equal between-session scores. For ratio LOA, the bias was interpreted as the percentage between repeated mean. tests were well above acceptable levels (ICC and SPC > 0.85, lower 1-sided 95% CI > 0.7). Of all absolute reliability, SEM% and MDC% could be interpreted for normality data, i.e. 30sCST and 40mFPWT. SEM% of both tests was well under 10% (9.1% and 7.0%) of the mean test score and showed a small amount of random error. For CV%, the 40mFPWT and 9-step SCT achieved an acceptable level of <10% (6.9% and 6.7%, respectively), whilst the 30sCST was 0.7% above the criteria (10.7%). The LOA of both 30sCST and 40mFPWT showed small systematic bias (−0.9 times and −0.002 m/s, or 6.1% and 0.2% of the grand mean, respectively) and covered '0', meant that between sessions test scores were sometimes equal. The difference between session test scores of 30sCST and 40mFPWT, with 95% CI, lied within 3.8 times and 0.245 m/s, respectively. For ratio LOA, the 9-step SCT (n = 4). Finally, 55 participants joined both sessions 1 and 2 of the data collection, and their complete data were available. The number of patients included and reasons for not being included in the study are summarised in Figure 1. Participant characteristics are presented in Table 1. All variables were normally distributed, except SCT, age and body mass. Only 9-step SCT was natural log-transformed for further analysis. Their skewness and kurtosis were improved to be normal after transformation without the outlier.

Between Sessions 1-Week Interval Within-Rater Reliability
Descriptive data and within-rater absolute and relative reliability of three performancebased tests are presented in Table 2. In terms of relative reliability, all performance-based

Known-Groups Validity
Since age was significantly (P < 0.01) related with all performance-based tests (30sCST, r = −0.48; 40mFPWT, r = −0.37; 9-step SCT, r = −0.55) ANCOVA adjusted for age was performed to compare the performance ability between non-adaptation and adaptation (during stair climbing) groups. The results of ANCOVA are presented in Table 4. All, except 9-step SCT model, showed no assumptions violation.
Although the Levene's test demonstrated unequal variance, the F max was less than 10:1, and near-equal sample size between cells (n = 27 and 28) supported robust ANCOVA analysis. With statistical age-adjustment, the showed a ratio of 1.029 of systematic bias, which signified a 2.9% difference between session test scores, and a ratio of 1.188 of random error, which signified with 95% CI, no more than 18.8% of the difference.

Construct Validity
Bivariate correlation coefficients of the tested variables are shown in Table 3 The 'absolute' correlation between KOOS-PS and NRS pain was higher than the correlation Table 3. Bivariate Pearson's correlation coefficient (95% CI) (n = 55) In terms of relative reliability, 9-step SCT showed better consistency than the other two tests. To compare the degree of absolute reliability from a different study, SEM% is an accurate estimation for homoscedasticity and CV% for heteroscedasticity. To determine whether the change of intervention is real, MDC% is helpful for homoscedastic data. For heteroscedastic data, ratio LOA could be a minimum criterion of change (10,14). In this study, the ratio of repeated 9-step SCT test time should be at least 18.8% (with 95% CI). The relative reliability (lower 1-sided ICC) of this study was within the same range as other studies (12,44). However, the absolute reliability based on MDC90%, in this study (30sCST, 21.1%; 40mFPWT, 16.3%), showed little more error than those of previous study (30sCST, 16.9%; 40mFPWT, 9.5%). For 9-step SCT, MDC90% of the previous study (18.8% of the average mean) was comparable with the random error of ratio LOA in this study (18.8% of the ratio betweensession mean). The mean 9-step SCT time of this study (14.2 s and 14.7 s) was higher than that in a previous study (12) (13.27 s and 12.35 s). Age factor may conjunct the performance of this knee OA group and contribute to higher random error. non-adaptation group had better performance than the adaptation group significantly with mean difference of 2.6 stands of 30sCST, 0.161 m/s of 40mFPWT and 49.2% of 9-step SCT. The percentage of variance in the DV (performance test scores), as explained by the IV (adaptation of 9-step SCT), was large in the 9-step SCT (40.6%) and 40mFPWT (16.3%) and medium in 30sCST (12.1%).

Discussion
This study aimed to evaluate the psychometric properties of the OARSI recommended minimal core set of three performance-based tests, including various forms of reliability and construct validity. In terms of reliability, all tests showed acceptable levels of relative reliability and small measurement error. For convergent validity, the tests showed moderate correlation with age and knee extensor torque but low correlation or no relationship with self-reported physical function. In terms of known-group validity, all tests could discriminate between the use and non-use stairclimbing aids groups. Non centrality interval estimation, b Natural log transformation effect size. All three performance-based tests demonstrated evidence of known-groups validity by differentiation of the knee OA group with aids from the group without aids. The 9-step SCT showed a larger effect size than the other two tests. It suggested that 9-step SCT was more relevant to adaptation of stair climbing factor than 30sCST and 40mFPWT.
In this study, heterogeneity of data and skewness were observed in the stair-climbing performance test. Natural log was used to transform because of easy-interpretation when back-transformation as stated above. The Cox estimation of 95% CI was selected because of its smallest coverage error for a medium sample size (31). With our knowledge, this is the first study to estimate various absolute reliability of recommended performancebased tests, specifically for different data types (e.g. homogenous and heterogeneous). Since some human performance measurement was recorded in a ratio scale and lead to heteroscedastic errors (14), the CV% and ratio LOA were more appropriate. A 1-week duration was set to ensure between-session within-rater reliability rather than test-retest reliability, as other studies (13, 49). The sources of heteroscedasticity in 9-step SCT seemed to relate with the level of adaptation whose division made a large effect size. Age factor was the confounder for construct validity of these performancebased tests. According to the Thailand context of knee OA burden in community-dwelling elderly (22.5% in 2014) (50), the combined effects of sarcopenia and arthrogenically reduced voluntary activation mechanism to strength and function deterioration which should be considered (2,(51)(52)(53).
This study had several limitations. First, the study design was cross-sectional and scope on the change of repeated measurement rather than responsiveness of the change from intervention. Second, although the proposed sample size of > 50 is adequate (17), the heteroscedastic data might need more sample size than 'adequate'. Third, KOOS-PS had narrow valid content. It showed an inferior ability to assess lower extremity function than KOOS functional and Sports items sum scores (54). The reduction of items might result in the under representation of KOOS-PS. Lastly, the measures used for convergent validation might well selected on other relevant outcomes such as quality of life, pain associated with function.
For convergent validity, AggKETbm was the only construct that passed the minimum acceptable criteria. The self-report physical function (KOOS-PS) did not show significant correlation with all the performance-based tests except 30sCST. However, the KOOS-PS and 30sCST correlation coefficient were less than the criteria. The 'absolute' correlation between KOOS-PS and NRS pain was higher than the correlation between all performance tests and NRS pain. Since self-report and performancebased tests were developed to assess similar function in OA (3) (e.g. ICF conceptualisation), it was expected to have a meaningful association. The result of this study and the previous study did not respect this premise. These unmet criteria of the correlation might give the idea of the content validity of both self-reported and performance-based tests. Timed measurement of task performance is the reduction of direct observation to be one quantifiable dimension, which dismisses erroneous movement (i.e. impairment) (45)(46). It was advocated that the assessed content of the self-report was not only the patient's ability to move around but also the patient's subjective response to accumulated past experiences (i.e. pain and perceived exertion) (45). The results in the influence of pain on the self-report were more than the pain experienced upon the execution of that task (13, 45,47), as showed in this study.
In this study, the failure of the three performance-based tests to show hypothesised relationship with KOOS-PS and NRS might be the problem of construct under representation (48). If a measured test underrepresents the physical component function, it fails to capture the important aspects of the construct it purports to measure. In this study, the correlation between 30sCST and K-PS-Q3 or the question number 3 was more than the correlation with KOOS-PS, which also suggested that 30sCST was more strongly represented by a component of

KOOS-PS.
For known-group validity, this study used stair climbing aids (use/non-use) as a discriminatory factor, which was differed from the previous study (gait aid/no gait aid) (16). Since age was correlated well with all performance-based tests and also showed a significant difference between use and non-use stair climbing aids (not presented in the results; independent t-test = −2.763, P = 0.008), the age was adjusted to accurately estimate the

Conclusion
This study evaluated a number of selected psychometric properties of the OARSI recommended minimal core set of performancebased tests in knee OA. The reliability and measurement error estimated all tests to meet the acceptable criteria. MDC90 and random error component of ratio LOA were calculated for useful decision of real change. The convergent validity only with knee extensor strength demonstrated limited representation. The performance-based tests had medium to large effect size in discriminating the group adaptation to stair climbing. The study suggested using specific calculation of absolute reliability to homo-/heterogenic data.