Table 1 presents the means and standard deviations for the voice-related variables assessed in this study, comparing the control group to the Parkinson's group. This table also includes F-test and t-test outcomes that determine the mean differences between these groups. Our statistical analysis revealed significant differences between control and Parkinson’s patients across several acoustic measures; among the 39 variables examined, 24 showed significant differences. These include local perturbation measures such as local percentage jitter (locPctJitter), local absolute jitter (locAbsJitter), and relative average perturbation jitter (rapJitter); local and three-point amplitude shimmer measures like local decibel shimmer (locDbShimmer), and three and eleven-point amplitude perturbation quotient shimmer (apq3Shimmer and apq11Shimmer); long-term vocal stability metrics such as recurrence period density entropy (RPDE), detrended fluctuation analysis (DFA), and pitch period entropy (PPE); as well as specific Mel frequency cepstral coefficients (MFCCs) and their corresponding changes over time (delta coefficients), particularly MFCCs of orders 2, 6-12, and delta MFCCs of orders 0-2, 7, 8, 11, and 12.
Following the statistical analysis, hyperparameter tuning of the Deep Neural Network (DNN) model was conducted to optimize its configuration. The optimal setup included 32 neurons in the first hidden layer, utilizing tanh, relu, and sigmoid activation functions in successive layers. The model's learning process was guided by the Adamax optimizer, binary cross-entropy loss, a dropping rate of 0.1, and a learning rate set at 0.01. We trained the DNN for 30 epochs with a batch size of 16. The model, comprised of 1813 trainable parameters, exhibited an average accuracy of 72.10% ± 6.65% (CI 69.5% - 74.7%), sensitivity of 83.49% ± 9.22% (CI 79.7% - 86.9%), specificity of 60.79% ± 10.25% (CI 57% - 64.8%), precision of 68.34% ± 6.26% (CI 66% - 70.7%), F1 score of 74.89% ± 6.05% (CI 72.7% - 77.2%), and a ROC AUC of 79.94% ± 7.80% (CI 76.9% - 82.8%).
The optimized RF model comprised 150 trees (n_estimators) with a maximum depth of 6 for each tree. The model was fine-tuned with min_samples_split set to 10, min_samples_leaf to 8, max_depth to 6, max_features to sqrt, and criterion to gini, ensuring that each leaf had sufficient samples to make a reliable prediction. The model exhibited an average accuracy of 77.34% ± 7.61% (CI 74.2% - 80.3%), sensitivity of 76.58% ± 10.7% (CI 72.7% - 81.8%), specificity of 78.1% ± 9.46% (CI 74.6% - 81.8%), precision of 78.21% ± 8.1% (CI 75.1% - 81.1%), F1 score of 77% ± 8.01% (CI 73.8% - 80.1%), ROC AUC of 85.88% ± 6.38% (CI 83.5% - 90.3%), oob_error of 0.2323 ± 0.0134, and test-error of 0.2266 ± 0.761.
The GB model was configured with 200 estimators and a learning rate 0.3. The maximum depth for each tree in the GB model was set to 5, with a minimum sample split of 30 and a minimum sample leaf of 10, maximum features set to sqrt, and subsample set to 0.9. The model's architecture, featuring a maximum tree depth of 5, a minimum sample split of 30 and a minimum sample leaf of 10, maximum features set to 'sqrt' and subsample set to 0.9, resulted in an average accuracy of 83.23% ± 6.23% (CI 80.8% - 85.6%), sensitivity of 81.74% ± 8.38% (CI 74.2% - 81.4%), specificity of 84.79% ± 7.75% (CI 81.8% - 88.0%), precision of 84.59% ± 7.26% (CI 81.6% - 87.3%), F1 score of 82.91% ± 6.57% (CI 80.3% - 85.4%), and a ROC AUC of 90.46% ± 5.22% (CI 88.0% - 93.7%).
Our SVM model utilized a radial basis function (RBF) kernel with a regularization parameter (C) of 0.75 and a gamma value 0.1. The SVM exhibited an average accuracy of 83.75% ± 5.39% (CI 81.6% - 86.0%), the sensitivity of 89.07% ± 6.21% (CI 86.3% - 90.9%), specificity of 78.44% ± 9.05% (CI 74.7% - 81.9%), the precision of 80.98% ± 6.65% (CI 78.5% - 83.4%), F1 score of 84.62% ± 4.94% (CI 82.7% - 86.6%), and a ROC AUC of 91.31% ± 4.62% (CI 89.5% - 93.1%).
As for the ESM, the model achieved an average accuracy of 84.49% ± 6.08% (CI 82.1% - 86.8%), sensitivity of 85.74% ± 7.53% (CI 85.7% - 90.5%), specificity of 83.30% ± 9.36% (CI 79.8% - 87.0%), precision of 84.29% ± 7.96% (CI 81.0% - 87.2%), F1 score of 84.70% ± 5.95% (CI 82.4% - 87.0%), and a ROC AUC of 92.08% ± 4.94% (CI 90.0% - 95.2%).
Lastly, EVM obtained an average accuracy of 82.19% ± 6.59% (CI 79.1% - 86.0%), the sensitivity of 81.02% ± 8.60% (CI 76.2% - 86.4%), specificity of 83.36% ± 9.10% (CI 77.3% - 90.5%), precision of 83.46% ± 8.00% (CI 80.5% - 86.5%), F1 score of 81.92% ± 6.72% (CI 77.3% - 86.4%), and a ROC AUC of 90.46% ± 4.08% (CI 88.9% - 92.1%).
These results illustrate each model’s capacity to effectively differentiate between PD patients and healthy controls, underpinning the utility of integrating advanced machine learning techniques in the analysis of complex voice data.
3.1 Comparison Across Models:
In our model performance comparison, significant statistical differences emerged. The ESM, SVM, and GB models consistently outperformed other models. ESM and SVM significantly outperformed the ANN model (p<0.001) in accuracy, with no significant differences between them (p=0.993 for ESM vs. SVM). Similarly, GB's performance was comparable to ESM and SVM, with no significant difference in accuracy (p=0.9292 for ESM vs. GB; p=0.9988 for GB vs. SVM). Regarding sensitivities, SVM significantly surpassed RF and ANN (p<0.001). ESM and SVM were comparable in sensitivity (p=0.384), as were ESM and GB (p=0.1906). In specificities, ESM, GB, and SVM all showed substantial improvements over ANN (p<0.001), with no significant difference between GB and ESM (p=0.967) or GB and SVM (p=0.0093).
GB and ESM were significantly better in precision than ANN (p<0.001). For F1 scores, ESM, SVM, and GB were superior to ANN (p<0.001), with no significant differences found between ESM and SVM (p=1.0) or between GB and SVM (p=0.783). The ROC AUC values also highlighted the more remarkable results of ESM, SVM, and GB, with no significant differences (p>0.7145 for all comparisons). Conversely, the Ensemble Voting Model (EVM) was statistically inferior in specificities compared to GB (p=0.05) and in F1 scores compared to RF (p=0.003). These results demonstrate the robust performance of ESM, SVM, and GB in PD diagnosis, outclassing other models in most metrics.
The results from our ML analysis are pivotal for clinical application, particularly for speech-language pathologists who focus on voice disorders in Parkinson's Disease. The enhanced diagnostic accuracy demonstrated by our models, particularly the SVM and Ensemble Methods, indicates that these tools can reliably identify early signs of Parkinson's Disease through routine voice assessments. This capability to detect subtle vocal changes before they become overtly apparent offers a significant advantage in early disease management, potentially allowing for earlier interventions that can alter the disease's progression and improve patient outcomes.