Extreme Gradient Boosting for Parkinson’s Disease Diagnosis from Voice Recordings

3 Background : Parkinson’s Disease (PD) is a clinically diagnosed neurodegenerative disorder that affects 4 both motor and non-motor neural circuits. Speech deterioration (hypokinetic dysarthria) is a common 5 symptom, which often presents early in the disease course. Machine learning can help movement 6 disorders specialists improve their diagnostic accuracy using non-invasive and inexpensive voice 7 recordings. 8 Method : We used “Parkinson Dataset with Replicated Acoustic Features Data Set” from the UCI-9 Machine Learning repository. The dataset included 45 features including sex and 44 speech test based 10 acoustic features from 40 patients with Parkinson’s disease and 40 controls. We analyzed the data using 11 various machine learning algorithms including tree-based ensemble approaches such as random forest 12 and extreme gradient boosting. We also implemented a variable importance analysis to identify 13 important variables classifying patients with PD. 14 Results : The cohort included total of 80 subjects; 40 patients with PD (55% men) and 40 controls (67.5% 15 men). PD patients showed at least two of the three symptoms; resting tremor, bradykinesia, or rigidity. 16 All patients were over 50 years old and the mean age for PD subjects and controls were 69.6 (SD 7.8) 17 and 66.4 (SD 8.4), respectively. Our final model provided an AUC of 0.940 with 95% confidence interval 18 0.935-0.945in 4-folds cross validation using only six acoustic features including Delta3 (Run2), Delta0 19 (Run 3), MFCC4 (Run 2), Delta10 (Run 2/Run 3), MFCC10 (Run 2) and Jitter_Rap (Run 1/Run 2). 20 Conclusions : Machine learning can accurately detect Parkinson’s disease using an inexpensive and non-21 invasive voice recording. Such technologies can be deployed into smartphones for screening of large 22 patient populations for Parkinson’s disease.


Introduction
Parkinson 's disease (PD) is a neurodegenerative disorder of largely unknown cause [1].After Alzheimer's disease, it is the second most common neurodegenerative disease [2].In 2010, there were approximately 680,000 people over 45 years old with PD in the US in 2010 and this number is expected to rise to 1,238,000 in 2030 [3].Diagnosis of PD currently relies on clinical examination, however, by the time PD becomes clinically apparent, there is more than 50% loss of substantia nigra neurons and an 80% decline in striatal dopamine levels [4], [5]. .Current gold standard in PD diagnosis is based on motor signs and symptoms (bradykinesia, resting tremor, rigidity, postural reflex impairment) and response to dopaminergic drugs [6], [7].However, the accuracy of commonly used clinical diagnostic are around 80% [8] [9], implying a large population with PD is undiagnosed or may be misdiagnosed with other types of neurodegenerative disorders [10].On the other hand, PD is well-recognized as a disorder that affects both motor and non-motor neural circuits [6], [11], [12].Hence, discovery of novel motor or non-motor markers of PD or improving the accuracy of currently available diagnostic tools have potential to better diagnose patients with PD.
Noninvasive speech tests have been explored as a marker of disease [8], [13], since deterioration of speech is consistently observed in patients with PD [14]- [16].In their work, Naranjo et al. [17], [18] showed that patients with PD could be detected with moderately high accuracy using acoustic features extracted from a speech test.In this study, we implemented machine learning methods, specifically Extreme Gradient Boosting [19], on Naranjo et al.'s [17] data to significantly improve PD detection accuracy from acoustic features extracted from voice recordings.

Data:
We utilized "Parkinson Dataset with Replicated Acoustic Features Data Set" that was donated to University of California Irvine Machine Learning repository by Naranjo et al. [17] in April 2019.The dataset includes sex and age and 44 acoustic features extracted from voice recordings of 40 patients with PD and 40 controls.The voice recordings were repeat three times (three runs) applied as sustained phonation of the vowel /a/ for 5 seconds on one breath.Digital recordings were implemented at 441 KHz sampling rate and 16 bits/sample rate [17].
Features: As described above, each acoustic feature was calculated three times for different runs of the speech test.Thus, in addition to testing the diagnostic accuracy of our analytic approach, we were also able to investigate intra-individual changes in response from different runs of the test.We considered acoustic features calculated for all three runs as individual predictors.Moreover, for a given acoustic feature, we created three artificial variables representing the change from one run to another (Figure 1).Therefore, our feature set included 264 acoustic features and sex for 80 subjects.Classification: We implemented an extreme gradient boosting algorithm to distinguish between subjects with PD and controls.Gradient boosting is an ensemble machine learning consisting of several weak models (shallow decision trees rather than overfitting deep ones) and it can be used for both regression and classification problems [19].Because it uses weak classifiers, it is more robust against overfitting compared to a random forest, a similar method that allows overfitting of individual tree predictors [19], [22], [23].In our work, we mainly implement 4-fold cross validation to identify any overfitting by randomly splitting data into four distinct folds.We also repeat this process multiple times and present average results.
Variable Importance Analysis, Feature Selection, and Re-Classification: We first built the gradient boosting model using 265 features with four folds cross-validation and repeated this process 100 times.
At each run, for each model built within 4-fold cross validation (4x100 models), we implemented a feature importance analysis that calculates the relative contribution of each feature to the corresponding model.A higher value of this metric for a specific feature implies it as a more important feature than another feature that has lower value of this metric [24].By averaging the feature importance obtained from 400 individual models, we obtained a ranking of the 265 features.Next, we built new classification models with 4-fold cross-validation by incrementally adding the top 15 most important features selected from the previous step into the model with respect to their importance ranking.We repeated each of these steps 100 times to better estimate the effect of each feature on the model performance when they are introduced into model.We then identified the step where the model performance started diminishing or stopped increasing.Finally, using the features introduced up to that specific step, we rebuilt gradient boosting models with 4-fold cross validation and report various performance metrics such as specificity, sensitivity, positive predictive value, accuracy, F1 score, and area under the receiver operators' curve (AUC).

Comparisons:
We compared our results to other commonly used classification models including logistic regression (LR), k-nearest neighborhood (KNN), support vector machines (SVM), and random forest (RF).

Results
Cohort: Our cohort included 40 subjects with PD (55% men) and 40 healthy controls (67.5% men).All subjects were over 50 years old and the mean age and standard deviation (SD) for PD subjects and controls was 69.6 (SD 7.8) and 66.4 (SD 8.4), respectively.PD diagnosis required at least two of resting tremor, bradykinesia, or rigidity [17].

Classification:
We initially built the gradient boosting model with 4-fold cross-validation using the entire set of 265 predictors, 50 learners and a learning rate of 0.1.We repeated this process 100 times by randomly splitting the data into four folds and obtained an average F1 score of 0.862 with 95 confidence interval (CI) 0.860-0.864.
Variable Importance Analysis: As described in Methods section, using the total of 400 models, we obtained variable ranking based on their importance in classification.The top 15 variables are placed in the x-axis of Figure 2.

Feature Selection and Re-classification:
To obtain a compact model, we repeated our 4-fold classification strategy 15 times by incrementally introducing a new variable into the model based on the order of importance.Figure 2 summarizes the F1-score with associated 95% confidence intervals for each step of this re-classification.(Run 2) and Jitter_Rap (Run 1/Run 2)-into the model, further additional variables did not improve the classification accuracy.We further implemented a grid search by changing the number of learners {20, 50, 60, 70, 80, 100} to identify whether the performance could be improved.However, there was no significant difference in AUC values of models with different parameter settings.
Comparisons: Since the F1 score using six variables was similar to accuracy obtained using all 265 variables, we decided to continue with this compact model of six variables.To compare the performance of our gradient boosting based model, we implemented the same 4-fold classification procedure using other commonly used classification methods (LR: Logistic Regression, SVM: Support Vector Machines, RF: Random Forest, KNN: K-nearest Neighbor) and provide the results in Table 1.Table 1 shows that gradient boosting outperforms all other machine learning models for all accuracy metrics considered.
The main reason for implementing 4-fold cross-validation in our work was to make our results comparable to the work of Naranjo et al. [17], [18], which is the original study utilizing this data.
However, we also implemented a 5-fold cross-validation using the six selected variables and obtained an F1 score of 0.869 (95%CI 0.861-0.877)and AUC of 0.944 (95%CI 0.939-0.949).We also repeated all results in Table 1 for 5-fold cross-validation and gradient boosting remained as the best performing method for all accuracy metrics (Supplementary Material).

Discussion
We were able to accurately classify persons with Parkinson's disease by analysis of voice recordings using machine learning.Acoustic features extracted from speech test recordings offer a potential for computerized non-invasive diagnostic tools.The data we used in this study included 44 acoustic features generated separately for three runs of the same speech test task.In their original studies on the same data, Naranjo et al. [17], [18] proposed a statistical approach by treating the results of these runs as repeated measures and obtain an 4-fold classification accuracy, sensitivity, specificity, precision, and AUC of 0.779, 0.765, 0.792, 0.806, and 0.879, respectively.Our results show that better accuracy metrics (0.864, 0.852, 0.877, 0.940) are obtained via gradient boosting algorithms using only six acoustic features.
As reported above, our results showed that Delta3 (Run2), Delta0 (Run 3), MFCC4 (Run 2), Delta10 (Run 2/Run 3), MFCC10 (Run 2) and Jitter_Rap (Run 1/Run 2) variables were the most important variables to detect Parkinson's disease from speech.Figure 3 shows that these variables are indeed very different for controls and patients with Parkinson's disease.It is worth noting that none of the acoustic variables obtained from the first run of the speech test were selected in the final model as a predictor.There were three variables from second run, one from the third run, one variable representing change between run 1 to run 2, and one variable representing change from run 2 to run 3.When we took a closer look at these two variables representing change, Jitter Rap Run 1 / 2 and Delta10 Run 2/3, in terms of the correlations of three runs (Figure 4), the responses to the same task in Run 1 and 3 were very highly correlated (similar) in patients with PD, while the same correlations were much smaller in controls.Interestingly, similar patterns were reported on typing style heterogeneity in patients with PD [25].

Figure 1
Figure 1 Acoustic features used in modeling

Figure 2
Figure 2 Feature selection and reclassification results for 4-folds cross validation

Figure 3
Figure 3 Distribution of six important variables across controls vs cases

Figure 4
Figure 4 Correlations between three runs of two important variables

Table 1
Comparison of alternative machine learning methods