An automated workflow based on hip shape improves personalized risk prediction for hip osteoarthritis in the CHECK study

Objective: To design an automated work ﬂ ow for hip radiographs focused on joint shape and tests its prognostic value for future hip osteoarthritis. Design: We used baseline and 8-year follow-up data from 1,002 participants of the CHECK-study. The primary outcome was de ﬁ nite radiographic hip osteoarthritis (rHOA) (Kellgren e Lawrence grade (cid:2) 2 or joint replacement) at 8-year follow-up. We designed a method to automatically segment the hip joint from radiographs. Subsequently, we applied machine learning algorithms (elastic net with automated parameter optimization) to provide the Shape-Score, a single value describing the risk for future rHOA based solely on joint shape. We built and internally validated prediction models using baseline demographics, physical examination, and radiologists scores and tested the added prognostic value of the Shape-Score using Area-Under-the-Curve (AUC). Missing data was imputed by multiple imputation by chained equations. Only hips with pain in the corresponding leg were included. Results: 84% were female, mean age was 56 ( ± 5.1) years, mean BMI 26.3 ( ± 4.2). Of 1,044 hips with pain at baseline and complete follow-up, 143 showed radiographic osteoarthritis and 42 were replaced. 91.5% of the hips had follow-up data available. The Shape-Score was a signi ﬁ cant predictor of rHOA (odds ratio per decimal increase 5.21, 95%-CI (3.74 e 7.24)). The prediction model using demographics, physical ex- amination, and radiologists scores demonstrated an AUC of 0.795, 95%-CI (0.757 e 0.834). After addition of the Shape-Score the AUC rose to 0.864, 95%-CI (0.833 e 0.895). Conclusions: Our Shape-Score, automatically derived from radiographs using a novel machine learning work ﬂ ow, may strongly improve risk prediction in hip osteoarthritis. This an open CC (http://creativecommons.org/licenses/by/4.0/).


Introduction
Hip osteoarthritis (HOA) is often diagnosed relatively late in the disease process and currently there are no drugs available to modify disease progression. Therefore, HOA treatment is necessarily restricted to education, exercise, weight loss and analgesics. Total hip replacement (THR) often follows when these do not suffice 1 .
To guide current care and develop interventions to modify the disease course, accurate prediction of HOA development in persons presenting with hip complaints is important. Many risk factors for HOA are reported in the literature. However, no established risk prediction tool for HOA is currently available. The rise of automated image processing techniques using artificial intelligence, offers the possibility to extract information from images beyond traditional visual interpretation. For example, deep neural networks can be used on computed tomography scans to classify arterial calcifications or pulmonary peri-fissural nodules 2,3 .
Shape variations in the hip play a role in the development of HOA 4,5 . Geometric measurements for assessing hip dysplasia or cam morphology are used as clinical tools 6 , but only describe particular components of the hip shape. Statistical Shape Models (SSMs) have the potential to quantify the overall shape variation of the bone, including more subtle variations 7e10 . However, SSMs require labor-intensive manual input to outline (i.e., segment) shapes with landmark points, hampering their use in large study populations. Therefore, we developed a segmentation software system to automatically extract hip shape from standard pelvic radiographs 11,12 . This study describes the development and validation of a prediction tool for future HOA in a large cohort of persons with hip pain that had never or only recently (<6 months) consulted a physician for their complaints. Our prediction tool, Shape-Score, utilizes overall hip shape based on SSMs, using our segmentation software system on standard pelvic radiographs. Moreover, we quantified the added predictive value of our tool over clinically available predictors alone.

Participants
Cohort Hip and Cohort Knee (CHECK) aimed to examine the course of early OA in the hip and/or knee 13 . Between October 2002 and September 2005, 1,002 participants were enrolled in 10 participating centers throughout the Netherlands. Possible candidates were approached by their general practitioner and/or recruited via local media. Participants were aged 45e65 years at the time of inclusion and had pain and/or stiffness in at least one knee and/or hip. They had not, or only recently (<6 months) consulted a physician for these complaints. Exclusion criteria were (i) pathology other than OA explaining symptoms, (ii) expected inability to complete 10-year follow-up, and (iii) inability to sufficiently understand Dutch. Radiographic knee OA (defined as KL 2 or higher) was not present in patients at baseline.

Measurements Demographics
Age, gender, BMI and current smoking (yes or no) were registered. Highest education level was scored on a scale from 1 to 8, as a proxy for socio-economic status. The scale is described in Table I.

Clinical examination
Trained health professionals registered hip pain, when pain was present around the groin/buttock/upper leg. Additionally, knee pain was registered if pain was present around the knee (possible referred pain). The Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC) total score was used to summarize pain, stiffness, and physical function. Analgesic use and morning stiffness were registered as present or absent. Active internal hip rotation was measured using a goniometer, according to Norkin and White 14 . Pain during internal rotation was registered (yes/no).

Basic radiographic parameters
Standardized weight-bearing anteroposterior pelvic radiographs, with 15-degree internal hip rotation were made. Presence of joint space narrowing (JSN), osteophytes and thickening of the femoral calcar (buttressing) were scored (yes/no) on baseline radiographs by five trained observers as previously described 15,16 .

Automatic quantification of subtle shape variations
SSMs provide a global representation of shape rather than reducing shape to a set of geometric measurements, enabling quantification and analysis of complex and subtle shape aspects. Using predefined (anatomical) landmark points, an object, such as the bones of the hip joint, can be outlined and segmented. Based on all landmark points across a set of images, an Statistical shape modelling (SSM) can be generated by applying principal component analysis to the aligned shapes 17 . The SSM then describes every shape by the combination of a mean shape and a linear combination of a number of shape modes. Each mode describes a distinct shape aspect. The first shape mode explains the highest proportion of variation across the dataset and each additional mode explains a smaller part of the total variation.
We developed a fully automatic segmentation system (FASS) to segment the hip using 75 landmark points (Supplementary text and S1 Figure and S2 Figure) 11,12,18 . We used all 1,373 baseline pelvic radiographs of sufficient quality with manual segmentations available from previous work, enabling to compare the predictive value of the data produced by the FASS to the data produced by manual segmentations 18 . Below, we give a concise overview of the development of the Shape-Score using FASS/SMM shape modes. More detailed information and a comparison between Shape-Scores from manual vs automatic segmentations is provided in the Supplementary text.
We used the first 26 SSM modes, explaining 90% of the overall shape variation across our dataset to develop the Shape-Score. While individual shape modes are independent by the nature of SSMs, the simultaneous effect of two (or more) shape modes on the risk for OA may interact (e.g., be multiplicative instead of additive). For example, a mode describing cam morphology may strengthen the effect of a mode with increased acetabular coverage (i.e., pincer morphology). The theoretical explanation would be that an increased acetabular coverage causes cam morphology to impinge earlier against the acetabulum, which might increase the risk of labral damage and subsequent HOA (Fig. 1) 19 . Considering interactions between all 26 modes produces 325 combinations (the sum of the arrhythmic row from 1 to 25 with a common difference of one). Adding the 26 separate modes would produce a total of 351 variables, a number too large for standard regression techniques. Therefore, we used a penalized regression technique (an elastic net) suited for high dimensional data to relate all these variables to the incidence of HOA and produce a single score representing HOA risk based on hip shape (Supplementary text). The resulting Shape-Score ranges between 0 and 1, and contains various clear and subtle aspects of hip geometry. Compared to the low-risk shape, the high-risk shape shows a cam morphology (an aspherical femoral head-neck junction), a narrower superior joint space, decreased acetabular depth in combination with lateralization of the femur. Additionally, the femoral shaft is narrower, while the femoral neck is wider. However, these shape variations do not have to coincide within a patient and a single high risk shape variation may increase the Shape-Score (Fig. 2).

Prediction model development and performance testing
To develop the prediction model we only used data of hips with pain around the hip and/or the knee of the respective leg (possible referred pain) as these are the joints where this prediction will most likely be used for in clinical practice. This is in contrast to the development of the shape model itself, in which all available baseline pelvic radiographs with data on HOA on follow-up were used. Furthermore, for a hip to be included in the prediction model a baseline radiograph of sufficient quality had to be available. Depending on these criteria, one or both hips of a participant were included in the analyses. Baseline predictors were used to predict the outcome, radiographic hip osteoarthritis (rHOA) at 8-year follow-up, defined as a KellgreneLawrence grade (KL-grade) 2, or THR 16,20 . All predictors were measured at baseline and categorized as demographics, clinical examination, basic radiographic parameters, or Shape-Score. To account for missing predictor and outcome data, we imputed 15 datasets using predictive mean matching and logistic regression 21 . We performed a sensitivity analysis including only hips with complete data using logistic regression and generalized estimating equations (GEE).
To develop the prediction model, logistic regression was used and predictors were added per category. First demographics were added, secondly variables from the clinical examination, thirdly basic radiographic parameters, finally the Shape-Score. After each step (addition of a category of predictors) we simplified the model by removing redundant predictors from the added category only, using backwards selection with a pooled alpha-level of 0.15 21,22 . To optimize parameter estimates for predictors and avoid overfitting, we used logistic ridge regression on each imputed dataset separately. Optimal penalties were based on corrected Akaike's Information Criteria 23 . When using multiple imputation, the imputed values per imputed dataset may differ, as they are drawn from a distribution. We averaged the intercepts and parameter estimates of the ridge regression models from all imputed datasets, to obtain formulas to calculate individual risks for rHOA or THR in future research or clinical work (Supplementary text) 23,24 . We calculated predicted risks and stratified all hips into arbitrary risk categories of <20%, 20e50% and >50% risk for OA. We calculated positive and negative predictive value for the low (<20%) and high risk (>50%) categories. For the low-risk category the absence of OA at 8-year Interaction between shape modes. The risk for HOA produced by a shape mode may depend on the presence of other shape features. In this hypothetical example, shape mode A represents cam morphology (Aspherical femoral head-neck junction), the shape mode B represents increased acetabular coverage. Both shape modes have a risk factor for OA. When both features coincide in one hip, the risk for OA may be greater than the sum of two risk factors from shape modes A and B. In C we combine the femur of mode A with the pelvis of mode B and simulate hip motion by applying 15 degrees of abduction to the femur. The risk for femoroacetabular impingement becomes clear and is very plausible. Femoroacetabular impingement may increase the risk of labral damage and subsequent HOA.
follow-up was considered a positive gold standard. For the highrisk category the presence of OA at 8-year follow-up was considered a positive gold standard. Performance of the model was further assessed in terms of calibration, i.e., the agreement between predictions and observed outcomes, as well as discrimination, i.e., the ability of a model to differentiate hips with developing hip OA from those which will not 25 . To assess calibration of the model, we plotted the percentage of observed OA cases in groups of hips with increasing predicted risk (n ¼ 23 per group to create 50 data points in the plot). Using a lowess smoothing function, we visualized the calibration of the model. We assessed discrimination by calculating Area Under the Curve (AUC) statistics with bootstrapped 95% confidence intervals pooled over the 15 imputed datasets. Any prediction model will perform better in the dataset used to train the model compared to a dataset containing new patients. Therefore, performance measures based on training data will be over-optimistic, also known as overfitted 26 . To internally validate the models and estimate their performance in new patients we used bootstrapping, a resampling method. We drew 1,000 bootstrap samples per imputed dataset and pooled the AUC and calibration plots to test for over-optimism of both calibration and discrimination 26 . All data analyses were performed using R v3.3.1. with MICE v2.30, caret v6.0e73, rms v5.1-0, and glmnet v2.0-5.

Results
Baseline characteristics can be found in Table I. Of 1,044 hips with data on KL-grade or THR at 8-year follow-up, 143 showed KLgrade 2 or higher and 42 had undergone THR. Among the demographics included as predictor in the initial model, smoking status was non-significant (p-value 0.74) and removed from the model. Among the predictors from clinical examination, morning stiffness (p-value 0.46) and pain on internal rotation (p-value 0.17) were removed. All basic radiographic parameters were significant (at the alpha level of 0.15) and were retained. Predictors that were retained in the models are shown in S1 Table with their respective unpenalized odds ratios. The formulas used to calculate the predicted risks are given in Supplementary text.

Model performance
The discriminative ability of the models improved each time an additional category of predictors was added, meaning that the models' ability to separate cases from non-cases increased (Fig. 3, Table II). A model containing only the Shape-Score discriminated comparable to a model combining demographics, clinical examination and basic radiographic parameters (AUC 0.798 vs 0.795).
Adding the Shape-Score to the latter model improved the discriminative ability from an AUC of 0.795 to an AUC of 0.863. Adding the Shape-Score also improved the calibration of the prediction model (Fig. 4). The calibration curve is very close to the diagonal representing optimal fit, meaning that the predicted risk closely resembles the observed risk for rHOA or THR.
Calibration slopes and AUCs in internal-validation based on bootstrapped samples differed minimally from those in development, indicating that the predictive models are not overoptimistic (AUC 0.795 and 0.864 respectively (Table II and Fig. 4). The distributions of predicted risks resulting from each of the models show that adding the Shape-Score helps to stratify more medium-risk patients into low and high-risk categories (Table III). In the sensitivity analysis, using both logistic regression and GEE on hips with complete data only, AUC values were within 0.01 of the values found using the imputed datasets, and calibration plots were comparable.

Discussion
We developed and internally validated models to predict incident rHOA or THR over 8 years in persons with first onset hip pain. Until now, no predictive model for HOA is widely used. We built a prediction model that combines innovative automated analysis of plain radiographs using machine learning, with clinical data that can easily be obtained (Fig. 5). The discriminative ability of the final model was high (AUC 0.863) given the relatively early stage of possible HOA at baseline and rHOA or THR as outcomes at 8-year follow-up.
In the literature, multiple prediction models are available for HOA. However, most are actually diagnostic models, aiming to diagnose HOA cross-sectionally, situated in a population or end-stage OA cohort. Saberi Hosnijeh et al. recently developed a prediction model for HOA in the Rotterdam cohort 27 . Their model uses demographics, urinary CTX-II levels and radiographic parameters including the Wiberg-angle and alpha-angle (to quantify acetabular coverage and cam morphology, respectively), but no parameters from the physical examination or SSM. Their model showed an AUC of 0.82 in the Rotterdam cohort and 0.71 when validated in CHECK. Furthermore, calibration in CHECK was far off the perfect slope, with observed risks being 2.5 times higher than predicted risks. Developing the model in a general population cohort and testing it in a target population with hip pain likely caused this. Our prediction model was developed in CHECK, which represents our target population, and includes parameters from physical examination and the Shape-Score. External validation was not performed as most cohorts focus on OA in later stages and/or do not have pelvic radiographs of sufficient quality available. Nevertheless, Fig. 2. High vs low risk Shape-Score. The left shows a schematic representation of the mean shape of the 5% highest Shape-Score (high risk for future HOA, in red) and 5% lowest shape-score (low risk for future HOA, in green). Compared to the low-risk hip, the high-risk hip is characterized by a cam morphology (femoral head-neck asphericity), a narrower superior joint space, and decreased acetabular depth in combination with lateralization of the femur and a higher neck-shaft angle. Additionally, in the high-risk hip the femoral shaft is narrower, while the femoral neck is wider. These shape variations, however, do not have to coincide within a patient and a single high risk shape variation may increase the Shape-Score. In the middle and on the right, a real radiograph of a low risk and high risk hip are shown.
internal bootstrap validation suggested that our model is not overoptimistic 26 . In the future, external validation should be performed, preferably in a cohort with symptomatic patients prone to HOA.
Strengths of this study include the use of a large prospective cohort with clinical complaints and inclusion criteria that allude early-stage knee and/or hip OA, with an adequate follow-up time and a sufficient number of incident rHOA or THR after 8 year (185) to test the 16 predictor candidates for the prediction models 28 . We used backward selection on clusters of predictors to mimic the flow of information in clinical care. While this may produce a slight reduction in absolute performance of the models compared to a fully data driven method, it improves the applicability of the models in clinical care and reduces the chance of overfitting. We tested the association between the Shape-Score and baseline clinical OA characteristics. The Shape-Score was related to hip OA characteristics but not knee OA characteristics (Supplementary text). Furthermore, follow-up data were rather complete and we used multiple imputation to reduce bias and increase precision of our analyses. Finally, we used optimism-adjustment methods throughout the development and validation of the models to reduce overfitting and overoptimistic results. By using bootstrap validation instead of multi-fold cross-validation we used the data available more efficiently 29 .
Combining rHOA (KL-grade 2) and THR as disease outcome may be debatable. The severity of clinical and radiographic symptoms correspond poorly in HOA, so that rHOA and THR may not always represent similar processes 30 . However, THR most often results from both clinical symptoms and radiographic signs. For a number of participants we included both hips. We used logistic regression analysis, which does not incorporate intra-participant correlation. However, GEE was not applicable in combination with the statistical packages used in the analysis, and mixed models regression had problems to converge when used on the available data. In the sensitivity analysis on the hips with complete data, the results between logistic regression analysis and GEE were very comparable.
The relationship between sex and HOA is less clear. In our models, female sex was initially associated with an increased risk of rHOA or THR, but with a decreased risk after adding the Shape-Score. This suggests that gender differences in OA risk may be related to hip shape. Higher BMI is a well-known risk factor for knee and hand OA, but its relationship with HOA is less clear 31e33 . In our models, a higher BMI had even a mild preventive effect for Fig. 3. Discrimination of the prediction models. The ability to separate cases from non-cases, is visualized as area under the curve with the 95%-confidence interval. Sensitivity and specificity of the model improved for all cut-off points after adding the Shape-Score to the model. HOA. Education served as a proxy for socio-economic status in our study. Higher educated persons had a lower risk for HOA or THR, in line with literature 34 . Smoking did not predict rHOA or THR. Although some studies show a protective effect of smoking, this effect may be caused by selection bias 35 . Pain in the hip area (groin/buttock/upper thigh) increased the risk for HOA, which is in correspondence with literature 36 . Pain around the knee sometimes directs a physician to search for a diagnosis in the knee only. However, hip OA should always be considered as a source of the pain 37 . Limited or painful internal rotation are clinical signs that suggest HOA and may predict THR 36,38e40 . In the present study, pain with/during internal rotation had a significant univariate relation with OA on follow-up (OR 1.7, 95% CI(1.3e2.3), but was eliminated from the prediction models as it's p-value was 0.17. However, the range of internal rotation was included in the prediction models, perhaps overrunning the weaker predictive effect of pain on internal rotation. The WOMAC is a tool to measure pain, stiffness and physical functioning in patients with knee and/or hip OA. While it is widely used, its predictive value for incident rHOA is unknown 41 . In this study, baseline WOMAC score was associated with future rHOA or THR. Morning stiffness is included as a diagnostic criterion for HOA in the widely used Altman criteria for HOA, and showed a high sensitivity, but low specificity 38 . Its predictive value as a risk factor for HOA is doubtful and in the present study it did not add predictive value to the models, perhaps because morning stiffness is a non-specific  Fig. 4 shows the predicted probabilities plotted against the observed outcomes in internal validation. This is used to assess the calibration for five different models. The striped black line represents a perfect match between predicted probabilities and observed outcomes, and thus perfect calibration. The dotted black line represents the calibration in training data. The colored lines each represent a different imputed dataset, and represents the mean calibration in validation, using a 1,000 bootstraps. A. Demographics. B. Demographic and clinical examination. C. Demographics, clinical and standard radiographic examination. D. Demographics, clinical, standard radiographic examination and Shape-Score produced using the fully automatic search model. The numbers given are averages over 15 imputed datasets. For the low-risk category, absence of OA at 8-year follow-up was seen as a positive gold standard. PPV should be interpreted as the probability of not developing OA when being classified as low-risk (<20 % chance). For the high-risk category, presence of OA at 8-year follow-up was seen as a positive gold standard. PPV should be interpreted as the probability of developing OA when being classified as high-risk (>50 % chance). symptom 27 . Surprisingly, the use of analgesics was negatively associated with the future risk for rHOA or THR. Possibly, this is the case because analgesics are used more often in acute pain caused by transient disorders compared to the more elongated pain trajectory in OA. JSN and osteophyte formation may be present before definite rHOA (defined as KL-grade 2) can be confirmed. These radiographic signs are known risk factors for progression to definite rHOA and indeed were strong risk factors in the present study 27 .
Buttressing, thickening of the medial femoral neck, is a radiographic sign associated with rHOA 42 . The predictive value of buttressing has not been described before but it showed to be a significant risk factor for future rHOA or THR in this study. The Shape-Score may also include JSN and osteophyte formation, characteristics used to define KL-grades. However, the current study shows that the Shape-Score has added predictive values on top of traditional radiographic characters alone.
Our Shape-Score incorporates (i) cam morphology (Aspherical femoral head-neck junction) (ii) decreased acetabular depth, and (iii) a higher neck-shaft angle as risk factors for OA. Cam morphology, decreased acetabular depth and a higher neck-shaft angle have been shown to increase risk for OA in large cohorts before 7,43e48 .
Currently, it is challenging for clinicians to predict future hip OA in patients with early-stage joint pain that cannot be explained otherwise, especially for patients who don't have definite signs of OA on the radiograph. Some of these patients will develop OA, while other might have (had) hip pain for other reasons and will not develop OA. The proposed prediction model could help clinicians to optimally inform patients about their personalized future risk for disease and to choose appropriate treatment (intensity) and may boost treatment adherence.
In the future, the proposed FASS and prediction model could be integrated into a software package that can be linked to the electronic patient record (including PACS for radiographic images). This way, the Shape-Score could be derived fully automatically to assist clinicians in estimating the risk for future HOA. Of note is that the implementation of the proposed prediction model is not dependent on time-consuming visual methods that may be subject to inter-/intra-observer variations. However, variation in positioning during image acquisition may cause differences in Shape-Score values and a standardized acquisition protocol, as used in the current cohort, is necessary. Unfortunately no data was available to assess the effect of.
Clinical trial inefficiency plays a major role in the current absence of disease-modifying OA drugs (DMOADs). By specifically selecting participants at an increased risk of incident OA, potential DMOADs have more potential to demonstrate detectable effects in a clinical trial. By adding the Shape-score or by using the Shape-Score only, we were able to stratify people for the risk of future rHOA or THR. When the Shape-Score is added to a screening with demographics, physical examination and basic radiographic parameters, 47% more patients could be stratified into the high-risk category (>50% risk), potentially improving screening efficiency.
We have developed an automatic Shape-Score tool, using machine learning algorithms, to optimally predict the risk for incident rHOA or THR based on hip shape as given by a pelvic radiograph. We demonstrated the added value of our Shape-Score in prediction models using easily obtainable parameters in persons with hip pain. Models including the Shape-Score had superior discriminative ability over models without and showed very good calibration. The Shape-Score may therefore prove to be a valuable tool for both patient care and research.

Contributions
WG, WvS, RA, and CL contributed to the data collection. WG, PW, and CL contributed to the statistical analyses. All authors contributed to the study design and interpretation of the data. All authors contributed to the preparation of the manuscript and gave final approval of the current version for submission.

Competing interests
Drs. Cootes and Lindner have a patent US 9928443, EP 2893491 issued.  5. Workflow to calculate personalized risk for future hip osteoarthritis. A. A standard weight-bearing pelvic radiograph is made in the anteroposterior direction with 15degrees internal rotation. B. The fully automatic segmentation system (FASS) annotates the anatomical landmarks on the radiograph. C. Statistical shape modelling (SSM) quantifies hip shape. D. The machine learning algorithm produces the Shape-Score, a single value representing the risk for incident HOA based on hip shape. E. Demographics, questionnaires, clinical examination, and basic radiographic parameters are assessed by trained personnel. F. Data from demographics, questionnaires, and physical examination and basic radiographic parameters are combined with the Shape-Score in easy to calculate formulas to produce accurate personalized risk for future hip osteoarthritis.

Role of funding sources
The CHECK-cohort study is funded by Reuma Nederland.
The FASS and SSM are available via http://bone-finder.com/ FASS and SSM are available via http://bone-finder.com/