External validation of prognostic models for recovery in patients with neck pain

Background Neck pain is one of the leading causes of disability in most countries and it is likely to increase further. Numerous prognostic models for people with neck pain have been developed, few have been validated. In a recent systematic review, external validation of three promising models was advised before they can be used in clinical practice. Objective The purpose of this study was to externally validate three promising models that predict neck pain recovery in primary care. Methods This validation cohort consisted of 1311 patients with neck pain of any duration who were prospectively recruited and treated by 345 manual therapists in the Netherlands. Outcome measures were disability (Neck Disability Index) and recovery (Global Perceived Effect Scale) post-treatment and at 1-year follow-up. The assessed models were an Australian Whiplash-Associated Disorders (WAD) model (Amodel), a multicenter WAD model (Mmodel), and a Dutch non-specific neck pain model (Dmodel). Models’ discrimination and calibration were evaluated. Results The Dmodel and Amodel discriminative performance (AUC < 0.70) and calibration measures (slope largely different from 1) were poor. The Mmodel could not be evaluated since several variables nor their proxies were available. Conclusions External validation of promising prognostic models for neck pain recovery was not successful and their clinical use cannot be recommended. We advise clinicians to underpin their current clinical reasoning process with evidence-based individual prognostic factors for recovery. Further research on finding new prognostic factors and developing and validating models with up-to-date methodology is needed for recovery in patients with neck pain in primary care.


Introduction
Neck pain is common and one of the leading causes of disability in most countries. 1,2 From 2005 to 2015, prevalence of chronic neck pain has increased globally by 21.1% and is likely to increase further. 1,2 Recovery from neck painrelated disability mainly takes place in the first few weeks without further subsequent improvement. 3 Acute neck pain prognosis may be even worse than currently recognized which underlines the importance of neck pain prognosis at intake in primary care. 3 Short-term beneficial effects and cost-effectiveness of non-invasive primary care treatment have been reported but long-term effects are still limited. [4][5][6][7] Prognostic models are obtained by multivariable regression and aim to improve the quality of care for individual patients by estimating the probability of a future health outcome or condition being present by combining patient specific values of multiple predictors. 8 Accurate prognostic models can be useful for clinicians to support clinical decisions and for research to riskstratify participants for clinical trials. [8][9][10] Compared to derivation studies, models usually perform less well in external validation studies and it is recommended first to test models' generalizability and transportability to evaluate whether their predictive performance remains accurate before broad clinical use can be advised. [11][12][13] Numerous prognostic models for people with neck pain have been developed, however, few have been validated. [14][15][16] In a recent systematic review, three promising models that predict recovery of people with neck pain in primary care were identified. 17 However, their broad clinical use could not be recommended and further external validation was advised. 17 Therefore, the research question of this study was: can these three models be externally validated in a cohort of people with nonspecific neck pain treated with manual therapy in Dutch primary care?

Methods
This external validation study including its statistical analysis was performed according to an a priori constructed and approved study protocol complying with internal university procedures. The included models were: 1) the Australian two-way model (Amodel) 18 predicting full recovery and ongoing moderate to severe disability, measured with the Neck Disability Index (NDI) in patients with Whiplash-Associated Disorders (WAD); 2) the multicenter model (Mmodel) 19 also predicting disability measured with the NDI in patients with WAD, and 3) the Dutch model (Dmodel) 20 predicting recovery measured with a Global Perceived Effect Scale (GPES) in patients with non-specific neck pain. Models' characteristics are presented in Table 1. The findings of this study were reported according to the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) recommendations. 21

ANIMO validation cohort
For validation, existing data from the 'Amersfoorts Nekonderzoek of the Master manuele therapie Opleiding' (ANIMO) study was used. Ethics approval was obtained from Erasmus Medical centre, Rotterdam, the Netherlands (MEC-2007-359). The dataset used and analyzed during the current study are available upon reasonable request. ANIMO is a prospective cohort study that aimed to describe usual care manual therapy for patients with neck pain in the Netherlands and explored outcomes and adverse events of treatment. Patients between 18 and 80 years with neck pain consulting a directly accessible manual therapist were recruited from October 2007 until March 2008. Participants with signed informed consent and treatment indication who submitted baseline data were eligible for participation (n = 1193). Received treatment consisted of usual care manual therapy and may have included specific joint mobilizations, high velocity thrust techniques, myofascial techniques, giving advice, or specific exercises. Further study characteristics are described in detail elsewhere. 22

Measurement procedure
Participants completed socio-demographic characteristics and questionnaires at baseline, immediately post-treatment, and at 12 months. Manual therapist where blinded from information gathered by patients' questionnaires. At baseline, patients' age, sex, marital status, employment, neck pain duration, neck pain localization, earlier episodes, associated symptoms, current medication, current smoking, current sport, imaging results, additional diagnostics, medical diagnosis, and comorbidities were recorded. Disability was measured using the Dutch versions of the NDI (scale 0À50) 23,24 and the Neck Bournemouth Questionnaire (NBQ, scale 0À70) 25 ; pain intensity was measured with a 10-point Numeric Rating Scale (NRS, scale 1À10), and pain-related fear was measured with the Dutch version of the Fear Avoidance Beliefs Questionnaire (FABQ-DV, scale 0À96). 26 Outcomes were measured post-treatment at discharge (mean treatment duration 37.9 days, mean number of 4.3 sessions) and at 12 months follow-up, using the NDI and a GPES (7point Likert scale).

Validation procedure
Based on models' predictors available in ANIMO, the Amodel (s) and Dmodel were suitable for validation. 20,27 The Mmodel was considered not suitable due to four variables not collected in ANIMO (i.e. cold pain threshold, impact of events scale, quotient of a sympathetic vasoconstrictor response; left rotation) with lack of appropriate proxy measures. 28 As We examined baseline demographics, models' predictors, and outcome distribution between the models' development studies and ANIMO as means with standard deviations or frequencies or percentages to compare case-mix between studies.

Handling of missing values
The ANIMO data contained missing values and we planned to perform several missing value analyses to decide on multiple imputation for main analyses and complete cases for sensitivity analysis. 2932 -

Statistical analysis
Statistical validation of models' performance We compared observed outcomes to those predicted by the models and analyzed the full original models in ANIMO and based models' performance on discrimination and calibration measures. 10,13,33 The Amodel was analyzed in both the ANIMO trauma subset as well as the whole dataset. We calculated model's linear predictor and individual probability (p (y = 1) =1/ (1 + e Àlinear predictor )) for all participants immediately post-treatment and at 1 year follow-up. 34 Discriminative performance Discriminative performance indicates whether a model is able to distinguish between patients with and without recovery. It is calculated as the concordance (c) statistic which is comparable to the area under curve (AUC) of the Receiver Operating Characteristic curve (ROC) for binary data. 13,35 We a priori considered discriminative performance acceptable if AUC was 0.70. 36 Calibration performance Calibration performance refers to the agreement between a model's predicted risks and observed event rates. 37 Preferably, this is reflected by calibration-in-the large, a calibration slope, and a calibration plot. 13,38 The Hosmer-Lemeshow goodness of fit test is often performed in validation studies and if the test is not-significant, it should indicate that the model fits the data well. 36 The models were re-estimated in ANIMO on al logit scale with the linear predictor as only predictor to calculate calibration-in-the large and the calibration slope. 10,13,30 We evaluated calibration as percentage of deviation from the ideal calibration slope of 1 and the intercept of 0. Calibration plots' probabilities were calculated to allow observation if all decile groups closely fit the ideal 45°l ine of identity. 10,13 We performed statistical validation procedures using IBM SPSS 24.0 and R (version 3.4.3).
Finally, we checked the number of events in ANIMO for a minimum of 100, as advised for validation studies that predict binary outcomes. 39,40 Results

Study characteristics
The baseline characteristics from the ANIMO study and from the original studies are presented in Table 2.

Amodels
The ANIMO subset consisted of people with any trauma and neck pain duration, whereas the original Amodel study included people with acute neck pain due to a motor vehicle crash only. People in ANIMO were recruited and treated in primary care with manual therapy and people in the original study were allowed to pursue any treatment and where recruited from general advertisement and emergency departments. On average, people in the original study were 4.8 years younger compared to the ANIMO trauma subset, had 17 NDI points higher disability (0À50 scale), and had 0.9 point more pain (0À10 scale).

Dmodel
There were 8.1% less male participants in ANIMO compared to the Dmodel derivation study. Duration of current episode in the Dmodel derivation cohort resulted in 26% more patients categorized as acute and 13.5% more categorized as chronic compared to ANIMO. In ANIMO, average disability at inception was 1.5 NDI points lower and the average neck pain was 2.4 points less on an 11-point Likert scale. For the other variables, there were 8.8% less people with headache and 20.1% less with radiating arm pain. In ANIMO, 2.9% more people had a previous neck pain episode, 24.1% more had concomitant low back pain, and 6.1% more people were employed.

Missing data
There were more than 5% missing data for several baseline variables and all outcome measures ( Table 2). Little's Missing Completely at Random (MCAR) test was significant at the p<0.05 level so we assumed data were not MCAR. Significant differences in means existed for 24 of 91 variables and differences were small indicating Missing at Random (MAR). Explained variation of missingness varied from 11 to 100% and missing variables were to some extent associated with the other ANIMO variables. Therefore, we assumed data were MAR.
We applied multiple regression imputation for missing data using all possible predictors and outcomes, as computationally feasible. 29,31,41 We used the Multivariate Imputation by Chained Equations (MICE) procedure and generated 20 imputed sets. 42 Regression coefficient estimates and standard errors were pooled using Rubin's Rules and validation performance measures were estimated in each of the 20 completed datasets and then combined using the median. 30,43 We used imputed data for main analyses and complete cases for sensitivity analysis.

Models' performance
The ANIMO smallest outcome groups contained 122, 247, and 40 events at post-treatment for GPE, NDI recovery, and NDI moderate/severe, respectively. At long-term, these numbers were 264, 289, and 45, respectively. These numbers revealed sufficient sample size for the Dmodel and Amodel recovery post-treatment and at long-term. The ANIMO trauma subset did not have a sufficient sample size as it contained 24 recovered people as measured by the NDI and 9 with moderate/severe outcome post-treatment, and 41 and 13 at long-term.

Discriminative performance
Models' performance measures are described in Table 3.
Analysis of the Amodels in the whole ANIMO cohort at long-term follow-up revealed a discriminative performance for the model that predicts full recovery of 0.43 (95% CI: Value a n (%)

Discussion
External validation in a cohort of people with neck pain of a two-way WAD model (Amodel) that predicts disability measured by the NDI, and a non-specific neck pain model (Dmodel) that predicts recovery measured by the GPE, was not successful as their discriminative performance and calibration clearly did not meet expected thresholds. A third prognostic model could not be evaluated in this study because of variable discrepancy across data sets.
The Amodels' discriminative performance was substantially below 0.70 for all time points. However, its discriminative and calibration performance could not be compared with the original studies because these measures were not described and our study is the first in presenting Amodels' performance measures. 18,27 The Amodel full recovery broad confidence intervals obtained in the trauma subset included AUC 0.70 values close to the upper bounds. These broad intervals could be explained by too few events, because the ANIMO trauma subset did not reach the minimum of 100 events in the smallest outcome group. Analysis in the whole ANIMO cohort, containing sufficient events, revealed small intervals but with 0.52 as the upper bound value.
The Dmodel's discriminative performance in the original study was 0.66 (95% CI: 0.61, 0.71) at internal validation and 0.65 (95% CI: 0.59, 0.71) at external validation. Our validation study revealed a lower 0.53 (95% CI: 0.48, 0.58) AUC post-treatment and 0.54 (95% CI: 0.49, 0.58) AUC for longterm predictions. A decrease in discriminative performance from derivation to validation is not unusual. 33 Dmodel's performance at development was already below our cut-off 0.70 for AUC and a 0.12 decrease of an overfitted model in another population with different case-mix is not an unexpected finding. Additionally, there may be little distinction in AUC between our validation study and the development study, as the 95% CI are close together. In addition, calibration was poor for both Dmodel and Amodels. At external validation, predictions are often too extreme due to overfitting at the development phase. 44 This results in low predictions being too low and high predictions being too high, as characterized by a calibration slope smaller than 1 and indicate that the original regression coefficients were too large. 13,45,46 In addition, we believe case-mix differences could not have been responsible for models' poor performance as these differences were relatively small. Comparison of model performance to other studies in the field is hampered: prognostic prediction models in the musculoskeletal field typically do not reach their validation phase and methodological shortcomings are common. In fact, the few models that were evaluated for external validity usually did not present model performance by means of calibration and discrimination measures. 14,17,47

Strengths and limitations
Strength of our study is analysis in a large cohort by state-ofthe-art calibration and discrimination measures. However, there are some limitations we would like to report. First, in ANIMO, multiple independent therapists at multiple sites were used and the broad CIs derived in the large ANIMO cohort could reflect this measurement variability. Second, the validation data set had substantial missing values, which is not unusual. 48 We applied multiple imputation procedures and sensitivity analysis on complete cases that showed comparable values of the performance measures. Third, the EuroQol predictor for the Dmodel and the hyperarousal subscale predictor for the first Amodel were not available in ANIMO and may have influenced model performance. However, this impact is probably negligible considering the 0.005 b eta value for EuroQol. We believe that the NBQ anxious subscale predictor served sufficiently as proxy for the hyperarousal subscale, thereby, the other Amodel that did not contain this predictor performed very similar. Fourth, the predicted outcomes for the Dmodel at derivation and validation were measured at 6 months and 12 months, respectively. We believe that the impact of these different outcome times is limited as overall prognosis for neck pain and disability for 6 and 12 months appear to be similar. 49

Implications for practice and research
Based on our findings, the clinical use of these promising models can, at present, not be advocated. We feel this is a very important message for musculoskeletal clinicians considering the numerous models that predict outcomes in neck pain that are available for clinicians without this crucial step of subsequent external validation, which could potentially lead to undesired outcomes for patients when models are implemented too early in practice. We advise clinicians to underpin their clinical reasoning process at this moment with separate prognostic factors that can be used with more confidence, such as baseline pain intensity, baseline neck disability, age, and past history of musculoskeletal disorders. 50 The low performance of the existing prognostic models indicate that important predictors may not have been included in the models' derivation process and further search for valuable model predictors is needed.

Conclusion
External validation of two promising prognostic models on neck pain recovery in primary care was not successful and their clinical use can, at present, not be advocated. Currently, no useful models are available for clinicians to predict outcomes in people with neck pain. New insights on potentially valuable prognostic factors are needed to strengthen models' derivation and updating procedures.

Conflict of Interest
The authors declare no conflicts of interest