Machine Learning Based Texture Analysis of Patella from X-Rays for Detecting Patellofemoral Osteoarthritis

Objective is to assess the ability of texture features for detecting radiographic patellofemoral osteoarthritis (PFOA) from knee lateral view radiographs. We used lateral view knee radiographs from MOST public use datasets (n = 5507 knees). Patellar region-of-interest (ROI) was automatically detected using landmark detection tool (BoneFinder). Hand-crafted features, based on LocalBinary Patterns (LBP), were then extracted to describe the patellar texture. First, a machine learning model (Gradient Boosting Machine) was trained to detect radiographic PFOA from the LBP features. Furthermore, we used end-to-end trained deep convolutional neural networks (CNNs) directly on the texture patches for detecting the PFOA. The proposed classification models were eventually compared with more conventional reference models that use clinical assessments and participant characteristics such as age, sex, body mass index(BMI), the total WOMAC score, and tibiofemoral Kellgren-Lawrence (KL) grade. Atlas-guided visual assessment of PFOA status by expert readers provided in the MOST public use datasets was used as a classification outcome for the models. Performance of prediction models was assessed using the area under the receiver operating characteristic curve (ROC AUC), the area under the precision-recall (PR) curve-average precision (AP)-, and Brier score in the stratified 5-fold cross validation setting.Of the 5507 knees, 953 (17.3%) had PFOA. AUC and AP for the strongest reference model including age, sex, BMI, WOMAC score, and tibiofemoral KL grade to predict PFOA were 0.817 and 0.487, respectively. Textural ROI classification using CNN significantly improved the prediction performance (ROC AUC= 0.889, AP= 0.714). We present the first study that analyses patellar bone texture for diagnosing PFOA. Our results demonstrates the potential of using texture features of patella to predict PFOA.


Introduction
Osteoarthritis (OA) is the most common degenerative joint disease causing disability. Knee is the most frequently affected of all joints. Although the etiology of OA is not fully known yet, it is a complex disease resulting from combinations of risk factors including aging, female sex, obesity, past injury, and inflammation [1]. The disease has a profound effect on quality of life affecting both physical function and well-being. Moreover, prevalence of knee OA is increasing due to aging population and increasing rates of obesity [1]. Overall, the costs of knee OA are substantial. There is no known cure for OA, and the disease is often progressive characterized with pain and loss of joint function. Therefore, more research is needed to better understand the disease. Imaging biomarkers is one of the exciting areas that can advance our understanding. Together with the developments in artificial intelligence, particularly deep learning, there is an increasing number of studies based on medical imaging of knee OA [2]. Most of these studies, however, focuses on tibiofemoral joint of the knee that consists of two condyloid articulation in which the femur has a rolling and sliding motion over the tibia [2]. However, there is a third articulation called patellofemoral (PF) joint ( Figure 1). Osteoarthritis that occur at the PF joint, patellofemoral osteoarthritis (PFOA), is highly prevalent and clinically important yet still under-investigated [3]. For instance, there are clear diagnostic guidelines for tibiofemoral (TF) joint OA ( [4,5,6,7]). Often these guidelines involve the use of well-validated radiographic grading systems for evaluating the severity of TFOA, such as Kellgren and Lawrence (KL) grading system [8] and Osteoarthritis Research Society International (OARSI) standardized atlas [4]. However, there are no similar diagnostic guidelines explicitly developed for PF joint. Moreover, in clinical setting, radiographic TF joint is often evaluated with a posterior-anterior (PA) view radiograph from which the PF joint can not be evaluated at all. In addition, clinical assessments and participant characteristics alone cannot be used to diagnose patellofemoral OA [9,10,11]. Since it has been reported that, at least in some phenotypes of knee OA, PF joint changes even precede TF joint changes ( [12,13,14]), there is a strong need to consider also PF joint when developing new imaging biomarkers to diagnose and monitor knee OA .
In the OA research field, several studies have demonstrated the use of texture analysis for knee radiographs to distinguish knees with OA and without OA [15,16,17,18]. However, these studies are all focusing on TFOA. With regard to PFOA, we showed recently that it can be detected using deep convolutional neural networks (CNNs) [11]. However, such deep learning based models are "black-boxes" and it is uncertain whether the model decisions are based on texture or some other imaging features such as shape or alignment. Therefore, in this study we investigated automated texture analysis of patella from lateral view radiographs. Specifically, we studied whether texture features can be used to distinguish the knees with definite radiographic PFOA from healthy knee radiographs. We proposed a framework based on machine learning and performed extensive experiments to demonstrate the influence of texture features on the diagnostic performance. Different region of interests (ROIs) from patellar region from lateral view radiographs (X-rays) were explored. Subsequently, we compared the performances of hand-crafted features, deep CNN features, and clinical features. Finally, we propose a stacked model where both patellar texture and clinical feature predictions are combined with a second level machine learning model -Gradient Boosting Machine (GBM) [19]. To the best of our knowledge, this is the first study to evaluate patellar bone texture in OA research.  Figure 2: Illustration of the workflow of our approach. First, we localized patellar landmarks using BoneFinder software (see Materials and Methods for more details). Subsequently, we applied intensity normalization and resampled the data to have a 0.2 mm pixel spacing. Finally, each knee is rotated in order to have an aligned patella. Three different regions of interest (ROIs) were located using patellar bone landmarks, after which a deep convolutional neural network (CNN) model was employed to predict the patellofemoral osteoarthritis (PFOA). Furthermore, we also trained a gradient boosting machine (GBM) model using handcrafted texture features (local binary patterns). To make a comparison of the proposed X-ray based methods, we trained another GBM model using clinical variables including age, sex, body mass index (BMI), the total Western Ontario and McMaster Universities Arthritis Index (WOMAC) score, and Kellgren and Lawrence (KL) score of the tibiofemoral joint. We used a stratified subject-wise 5-fold cross validation setting to measure the performance of all the methods. In addition to these individual models, we fused the predictions from these models in a second layer GBM model to improve the overall prediction performance.

Materials and Methods
The overall pipeline of our study is shown in Figure 2. In order to pre-process the data, we extracted anatomical landmark points [20] and applied intensity normalization using global contrast normalisation and a histogram truncation between the 5 th and 99 th percentiles. We then resampled the data to have a 0.2mm pixel spacing. Subsequently, we located ROIs using landmark points. Right knee images were then horizontally flipped to have a similar view with left knee images. Finally, we predicted PFOA using both handcrafted texture features using a machine learningbased approach and a deep CNN. We also trained a machine learning model (GBM [19]) on clinical features as a reference method to compare with the proposed approach.

Data
In the study, we used data from the Multicenter Osteoarthritis Study public use datasets (MOST, http://most.ucsf.edu). The MOST study is a longitudinal observational study of adults who have or are at high risk for knee OA. At baseline, there were 3,026 individuals aged 50-79 years who either had radiographic knee OA or were at high risk for developing the disease. In MOST, semiflexed lateral view radiographs were acquired according to a standardized protocol. Knee radiographs were read from the baseline to 15, 30, 60 and 84-month follow-up visits. We employed radiographs taken at the baseline visit that includes 5507 knees after removing data which have missing information. Out of 5507 knees, 953 had PFOA(17.3%).
In the MOST public use datasets, radiographic PFOA is defined from lateral view radiographs as follows: Osteophyte score ≥ 2 or the joint space narrowing (JSN) score is ≥ 1 plus any osteophyte, sclerosis or cysts ≥ 1 in the PF joint (grades 0-3; 0=normal, 1=mild, 2=moderate, 3=severe) ( Figure 3).  Individual radiographic features in the MOST dataset were graded by two independent expert readers based on the atlases from the Osteoarthritis Research Society International (OARSI) [4] which refers to the previous OARSI atlas for the patellofemoral joint [21] and Framingham Osteoarthritis Study [22]. When there was a disagreement in film readings, a panel of three adjudicators resolved the discrepancies [23].

Patellar Landmark Localization
In this study, we utilized BoneFinder ® software [20] in order to locate the landmark points along the contours of the patella (Figure 4). BoneFinder ® uses random forest Figure 4: Anatomical patellar landmarks were detected from lateral view knee radiographs. BoneFinder ® software [20] were used to locate 21 landmark points along the contours of the patella. Marginal landmarks were employed in order to align the patellar region. The line on the right between the marginal points were draw in order to illustrate the final alignment of patella.
regression voting with constrained local model approach to automatically locate 21 anatomical landmarks in patellar region from lateral knee X-rays. These points enabled us to locate ROIs within the patella with precision. Moreover, we used two marginal landmarks in order to align patella to obtain the same anatomical region among the knees in the ROI localization step ( Figure 4).

Texture Descriptor -Local Binary Patterns (LBP)
Previously in OA research, multiple texture descriptors have been applied to plain knee radiographs for several years, Fractal Signature Analysis (FSA) or fractal dimension (FD) being the most widely used ones [24,25,17,26,15]. However, FD method is susceptible to limited discrimination power due to its sensitivity to image artifacts and noise [27]. It was shown in [15] that Local Binary Patterns (LBP) [28] yields better performance in describing bone texture to detect OA for tibiofemoral joint. Therefore, in our study, we selected LBP as a handcrafted texture descriptor as well.
LBPs captures a local representation of texture. In order to compute LBP value of a pixel (c), p neighboring pixels that are evenly distributed in angle on a circle of radius r are sampled first. Then the LBP pattern is constructed by comparing the gray value of the center pixel (c) with its surrounding neighborhood of p pixels and a binary vector of p bits is extracted from this comparison. The decimal value of LBP pattern is then obtained from the binary sequence. For an N × M texture image, a LBP pattern can be the computed at each pixel -(N × M ) LBP patterns-, then the image texture can be represented by the distribution of LBP values, by a LBP histogram vector. In this study, we used r = 2, p = 8 × r and a 256 bins histogram to obtain LBP descriptions of our texture patches.

Selection of Region of Interest (ROI)
We explored three different ROIs within patellar region for texture analysis ( Figure  5). It is known that along with the progression of OA, bone is subject to changes in its structure and composition [29,30]. The radiographically most distinctive bony changes occur at the margins where osteophytes typically occur [15]. Therefore, we selected two ROIs from the inferior and superior region of patella . Their sizes are proportional to the height of the patella measured from the outermost points (20% of the patella height). To locate those ROIs, we used the marginal landmarks which were also utilized previously for alignment ( Figure 4). As the third ROI, we utilized the whole patellar region. For segmenting patella from the background, we employed landmark points and obtained a smoothed spline that roughly approximates the contour ( Figure 5). In order to detect the most informative (optimal) ROI, we employed LBP descriptor. Here, we defined the most informative region as the subregion where the texture classifier (based on LBP features) performs best to distinguish OA samples from non-OA.
We used Gradient Boosting Machine (GBM) classifier based on decision tree algorithms [19] to predict PFOA using LBP features that were extracted from ROIs defined previously ( Figure 5). We observed that superior ROI features yields better classification performance (See Results section). Therefore, we utilized superior ROI ( Figure 5a) in the subsequent experiments.

Machine Learning Models
We employed both GBM and deep CNN methods to predict PFOA from the texture patches and clinical assessments and participant characteristics. GBM is a powerful decision trees based machine learning algorithm that rely on the concept of boosting "weak learners" [19]. GBM is an iterative process -in each iteration training set is re-weighted such that it compensates for the weaknesses of the previous model. In this study, we used an efficient implementation of GBM called LightGBM [31]. Table 1 summarises all the models developed in this study.
First, we extracted LBP texture features from superior ROI texture patches and trained a GBM model based on these features (Model1). Then, we used a deep CNN architecture that was trained from end-to-end to classify superior ROI texture (Model6). In this way, texture features were learned from the imaging data itself. CNNs captures both fine-level high spatial-frequency details such as edges, lines, texture, corners, etc and recognize more complex features as it gets deeper (as the number of convolutional layer increases). We refer reader to [32] and [33] for more information on the deep neural networks. We used a three layers' CNN model where the input patches are scaled to 64 × 64 images. Details of the CNN architecture and our training strategy can be found in the Supplementary Material. In this study, we also explored more conventional machine learning based prediction models using the clinical data and risk factors. These include age, sex, body-mass index (BMI), the total Western Ontario and McMaster Universities Arthritis Index (WOMAC) score, and the KL grade of the tibiofemoral joint (Model2,3,4,5). These reference methods were also built using the GBM classifier.
In all models, we employed subject-wise stratified 5-fold cross validation. Subjectwise splitting is used to eliminate the subject-dependent bias between training and validation. That is, all the data (imaging and/or clinical) coming from a particular subject is either put in the training or the test set. Moreover, we used stratified folds where each fold represents the actual class distribution of the data. The class distribution is the ratio between the positive and negative samples. The same folds were used for all the models to have fair comparisons. All the models were trained separately, thus reported performances were derived from the separate models.

Statistical Analyses
The models were assessed using Receiver Operating Characteristics (ROC) curves, Precision-Recall (PR) curves, and Brier score [34]. In ROC curves, the true positive rate (TPR) against the false positive rate (FPR) are plotted, whereas Precision-Recall (PR) curves are composed of the precision and the true positive rate. Therefore, PR analysis is more focused on the positive class. When data is imbalanced (i.e. more negative samples than positive samples), the ROC curve might not reflect the true performance of the classifier as false positive rate increases more slowly because of the large numbers of negatives. We used the area under the ROC curves (ROC AUC) and similarly area under the PR curves (Average Precision; AP) to summarize model performances. Brier score equals to the mean squared error of the prediction. In order to compare the differences between model AUCs, we applied DeLong's test [35].

Comparison of Region of Interest (ROI)
Firstly, we compared the performance of the texture descriptor (LBP) on the superior ROI, inferior ROI, and also on the whole patella in subject-wise stratified 5-fold cross validation setting ( Figure 6). Here, we used GBM models to predict PFOA. From Figure 6, it can be seen that the model employing superior ROI features performs best yielding an AUC of 0.884 (0871-0.895) and AP of 0.697 (0.669-0.722). We chose the superior ROI in our further comparisons because it performs with higher precision at most recall levels. We observed a lower performance when we used texture features extracted from the whole patella.

Comparison of Texture Features and Clinical Variables
After testing the ROIs, we developed machine learning models based on clinical assessments and subject characteristics. In order to understand the value of texture features, we performed a thorough evaluation of age, sex, body-mass index (BMI), WOMAC and TFOA KL scores ( Figure 7). Here, we utilised GBM models and trained them to predict the probability of PFOA using different combinations of the risk factors mentioned above. Figure 7 Figure 10).

Discussions and Conclusions
In this study, we presented a machine learning method based on textural features of patella to detect PFOA from lateral view knee x-rays. Texture features were ex-  tracted both by handcrafted descriptor (LBP) and also learned directly from texture patches using CNNs. We compared-texture based models with more conventional models that use clinical assessments and participant characteristics (age, sex, BMI, WOMAC,KL). The model that uses only texture features obtained from superior ROI from patellar region yielded statistically significant improvement over the clinical features (ROC AUC of 0.884 vs. 0.817). This finding suggests that patellar texture features can provide useful imaging biomarkers for OA diagnostics.
The highest ROC AUC and AP for classifying knees without and with PFOA were obtained when model predictions based on texture features and clinical data were combined into a second level machine learning model. This can be explained by the fact that the textural features of patellar bone and the clinical features are complementing each other. Model stacking outperformed early feature fusion because the former method allow each classifier to include their own benefits.
We tested three different ROIs for texture analysis. Texture features from the superior ROI had the highest classification performance to distinguish between knees with and without PFOA. We observed a performance decrease when we used the whole patellar region for predicting PFOA compared to superior and inferior ROI. These findings suggest that the area closest to the inferior margin of the patella, where osteophytes typically occur, experience the most significant bony changes in PFOA.
This study has also limitations. The most important one is the lack of external data for validating the performances of the models. Therefore, we adopted the cross fold validation strategy to evaluate them. However, the actual generalizability of the models could only be understood on a separate test data that comes from different sources (population, hospital, device, etc.). However, because the patellofemoral OA is not studied as much as tibiofemoral OA, the number of PFOA-focused dataset is limited. Second, PFOA is not only altering the bone structure but at the same time changes the morphology and the alignment of patella [36]. In future studies, models that consider patella morphology and alignment should be explored and the additional value of textural features on the diagnosis of PFOA should be studied. Third, other views of patella such as skyline view should be analysed to confirm the location of the bone alterations. Finally, in order to extract texture features, we employed rectangular ROIs. In some cases such ROI could overlap with the background that might slightly affect the results. In principle, adopting ROIs that adhere to the boundaries would be better, but this requires exact segmentation of the patellar region, and it might also lead to a non-constant ROI size further affecting the standardized comparisons between different subjects.
In conclusion, we present the first study that evaluates patellar bone texture for detecting PFOA. Our results show that texture features of patellar bone are different between knees with and without PFOA. Good classification performance values indicate that analyzed texture features contain useful information of patellar bone structure, which seems to change in PFOA. These texture features may be used in the future as novel imaging biomarkers in OA diagnostics.

Acknowledgements
Multicenter Osteoarthritis Study (MOST) Funding Acknowledgment. MOST is comprised of four cooperative grants (Felson -AG18820; Torner -AG18832, Lewis -AG18947, and Nevitt -AG19069) funded by the National Institutes of Health, a branch of the Department of Health and Human Services, and conducted by MOST study investigators. This manuscript was prepared using MOST data and does not necessarily reflect the opinions or views of MOST investigators.
We would like to acknowledge the strategic funding of the University of Oulu, Infotech Oulu.
We gratefully acknowledge Claudia Lindner for providing the BoneFinder ® tool and lateral knee active shape model, Aleksei Tiulpin for providing an interface to BoneFinder to fully leverage multiple processors, and the support of NVIDIA Corporation with the donation of the Quadro P6000 GPU used in this research.

Funding
Funding sources are not associated with the scientific contents of the study.

Competing interests
The authors declare that they have no competing interests.

Summary Table
• We present the first study that evaluates patellar bone texture for detecting patellofemoral osteoarthritis (PFOA). • We proposed a framework based on machine learning and performed extensive experiments to demonstrate the influence of texture features on the diagnostic performance. • Different region of interests (ROIs) from patellar region from lateral view radiographs (X-rays) were studied whether texture features can be used to distinguish the knees with definite radiographic PFOA from healthy knee radiographs.  The CNN model consists of 3 convolutional layers dedicated to texture feature extraction. Each convolution layer (stride=1, padding =1) is followed by Batch normalization (BN), max pooling (2 × 2) and ReLU. After feature extraction, we used two fully connected layers to make the prediction. A dropout of 0.5 is inserted after the first fully connected layer.
In all the CNN based experiments, we used the same training strategy. We trained the models from scratch (end-to-end) using the random weight initialization. We adopted stochastic gradient descent training on a GPU. A mini-batch of 64 images were employed, and a momentum of 0.9 was used and trained without weight decay. We used a starting learning rate of 0.01 and decreased it by 10 every 8 epochs. The models were trained for 40 epochs and we selected the best performed model.