Classification of long-term clinical course of Parkinson’s disease using clustering algorithms on social support registry database

Although Parkinson’s disease (PD) has a heterogeneous disease course, it remains challenging to establish subtypes. We described and clustered the natural course of Parkinson’s disease (PD) with respect to functional disability and mortality. This retrospective cohort study utilized the Korean National Health Insurance Service database, which contains the social support registry database for patients with PD. We extracted patients newly diagnosed with PD in 2009 and followed them up until the end of 2018. Functional disability was assessed based on the long-term care insurance (LTCI) and National Disability Registry data. Further, we measured all-cause mortality during the observation period. We included 2944 eligible patients. The surviving patients were followed up for 113.7 ± 3.3 months. Among the patients who died, patients with and without disability registration were followed up for 61.4 ± 30.1 and 43.2 ± 32.0 months, respectively. The cumulative survival rate was highest in cluster 1 and decreased from Cluster 1 to Cluster 6. In the multivariable Cox regression analysis, the defined clusters were significantly associated with increased long-term mortality (adjusted hazard ratio [aHR], 3.440; 95% confidence interval [CI], 3.233–3.659; p < 0.001). Further, age (aHR, 1.038; 95% CI, 1.031–1.045; p < 0.001), diabetes (aHR, 1.146; 95% CI, 1.037–1.267; p = 0.007), and chronic kidney disease (aHR, 1.382; 95% CI, 1.080–1.768; p = 0.010) were identified as independent risk factors for increased risk of long-term mortality. Contrastingly, the female gender (aHR, 0.753; 95% CI, 0.681–0.833; p < 0.001) and a higher LTCI grade (aHR, 0.995; 95% CI, 0.992–0.997; p < 0.001) were associated with a significantly decreased long-term mortality risk. We identified six clinical course clusters for PD using longitudinal data regarding the social support registry and mortality. Our results suggest that PD progression is heterogeneous in terms of disability and mortality.


Introduction
Parkinson's disease (PD), which is the second most common neurodegenerative disease after dementia, is caused by damage to dopamine-secreting neuronal cells in the substantia nigra [1].Dopamine acts on the striatum and is mainly responsible for regulating body motor function [2].Accordingly, dopamine deficiency causes motor symptoms such as tremors, gait disturbance, muscle spasms, and slow gait; further, it causes nonmotor symptoms such as autonomic nervous system disorders, depression, and cognitive decline [3].
Symptoms in patients with early-stage PD can be sufficiently improved by a small dose of drug treatment [4].However, as the disease progresses, the required drug dose and administration frequency are increased [5]; moreover, new symptoms, including dyskinesia and freezing of gait, appear and functional disabilities worsen [6,7].Additionally, non-motor symptoms such as sensory change and autonomic nervous system dysregulation significantly contribute to functional limitations [8].Although PD is not fatal enough to shorten life expectancy [9], complications such as falls, swallowing disorders, and gait restrictions that with disease progression are strongly related to long-term prognosis and mortality [10,11].Taken together, PD requires long-term management plans given the symptom changes with disease progression.
PD has heterogeneous disease courses, and thus can be classified into multiple subtypes [12,13].Previous studies have defined PD subtypes in terms of symptom domains and progression speed, with the primary subtypes being the mild-motor predominant, intermediate, and diffuse malignant subtypes [14,15].Armstrong et al. [16] reported that the mild motor predominant type was the most common and showed slow progression, while the diffuse malignant type was observed in 9-16% of patients.Macleod et al. [9] reported that the average post-diagnosis survival period was 6.9-14.3years, which considerably varied across patients.Mestre et al. [17] indicated the challenges of PD subtyping and the need to elucidate PD heterogeneity.
To understand the natural clinical course of PD, we explored the long-term outcomes (functional disabilities and mortality) of PD as well as the relationship of demographic features, comorbidities, and functional disabilities with long-term mortality in patients with PD using South Korea's National Health Insurance Service (NHIS) database.

Data source and patient inclusion
This retrospective, longitudinal cohort study was conducted using customized cohort data from the Korean NHIS database (NHIS-2020-1-160) [18].This study was reviewed and approved by the Institutional Review Board (NHIS-2023-02-002), which waived the requirement for informed consent given the retrospective study design and anonymity of the NHIS data.This study was conducted in compliance with the principles of the Declaration of Helsinki.
The initial sample comprised 31,167 patients with the International Classification of Diseases (ICD)-10th code G20 as the primary diagnosis.The study cohort comprised patients prescribed related drugs along with the G20 diagnosis code at medical institutions of general hospital level or higher.The drugs used for PD included levodopa, dopamine agonists (ropinirole, pramipexole, etc.), entacapone, amantadine, selegiline, rasagiline, and anticholinergics (trihexyphenidyl HCl, benztropine mesylate, and procyclidine).In South Korea, patients diagnosed with PD using the G20 code can be registered in the system of 'rare and intractable diseases' .Subsequently, they can receive support from the government for a significant portion of medical expenses related to the diagnosis.The criteria for registration of PD as a 'rare and intractable disease' are presented in Supplementary Document 1.
Among the initial cohort, we extracted 3227 patients who were first diagnosed with PD in 2009.Subsequently, we excluded patients with previously registered disabilities due to brain lesions, patients with missing values, and patients aged < 40 years.Finally, we included 2944 patients with newly diagnosed PD in 2009, who were followed up for approximately 10 years (Fig. 1).

Variable definitions
Regarding sociodemographic variables, the sex and age of the patients were confirmed.Patient insurance was classified into medical aid and health insurance service types; moreover, health insurance services were further classified as self-employed and employee-insured types.The contribution of the self-employed insured type is Fig. 1 Flowchart of patient inclusion.Abbreviations: LTCI, long-term care insurance; PD, Parkinson's disease calculated as the contribution score × value per score (Korean won).The contribution score is determined by considering the subscriber's income, property, economic activity participation rate, and sex and age of household members.Contrastingly, the contribution of the employee insured type is calculated as the monthly wage × contribution rate, with the subscriber paying 50% and the employer paying 50% of the insurance premium [19].Accordingly, we used the national health insurance premium level, which is an indicator of household income level, as a proxy for socioeconomic status and classified patients into four quartile groups.Residential areas were categorized as capital, metropolitan, city, and county.Comorbidities included hypertension (I10-I15), diabetes (E10-E14), dyslipidemia (E78), ischemic heart disease (I25), atrial flutter/fibrillation (I48), chronic kidney disease (N18), cerebral stroke (I60-64), and neoplasm (C00-D49).Finally, we confirmed all-cause mortality from the date of PD diagnosis until the end of 2018.

Social support registry data and group definition
Two social support registry databases for patients with PD were used as indicators of functional disability.Disability-registered patients were considered as those approved in either of these two social support registries.We used the grade at the time of the first registration for patients who underwent multiple reevaluations.
First, the long-term care insurance (LTCI) of South Korea provides nursing services for individuals with limitations in daily activities due to geriatric diseases such as stroke, dementia, and PD, as well as normal elderly individuals aged ≥ 65 years with limitations in daily activities.It is provided in the form of home-based, institution-based, and special cash benefits.The review of long-term care is based on a doctor's opinion after examining the individual's condition, with the Rating Committee subsequently deciding whether to approve LTCI and the approval grade.The doctor's note form required for the LTCI application and its grade definition are provided in Supplementary Document 2 and Table S1, respectively [20,21].
According to the National Disability Registry of South Korea, [21] PD can be approved for brain lesion disability after > 1 year of diligent and continuous treatment as well as sufficient medical records indicating major symptoms or dopaminergic neuronal loss confirmed by single photon emission computed tomography or N-(3-[18 F] fluoropropyl)-2β-carbomethoxy-3β-(4-iodophenyl)nortropane positron emission tomography.The diagnosis of disability mainly reflects the degree of overall functional impairment based on the degree and extent of paralysis, balance disorder, ataxia symptoms, and ability to perform activities of daily living.The doctor's note form required for application to the National Disability Registry of brain lesions and its grade definitions are provided in Supplementary Document 3 and Table S2, respectively.
Patients were classified according to functional disability and death as follows: survived and no disability registered (S-NDR group), survived and disability registered (S-DR group), death and disability registered (D-DR group), and death and no disability registered (D-NDR group).

Statistical analysis and clustering method
All statistical analyses and clustering were performed using the R software version 4.0.3(R Core Team, R Foundation for Statistical Computing, Vienna, Austria).Statistical significance was set at P < 0.05.We used the 'NbClust' R software package to determine the optimal number of clusters [22].The Hubert index, which is a graphical method for determining the optimal number of clusters, was confirmed; accordingly, we determined the optimal number of clusters to be six (Figure S1).Next, we constructed a divisive hierarchical tree for autoclustering and analyzed the baseline characteristics of the six groups (Clusters 1-6).
Continuous and categorical variables are expressed as mean ± standard deviation and frequency (proportion), respectively, with between-group comparisons using analysis of variance with Tukey's comparisons and the chi-square test, respectively.A multivariable Cox proportional hazards model was established to determine risk factors for long-term mortality.Multicollinearity between variables was defined as sqrt (variation inflation factor) > 2. The Cox regression analysis treated variables with six or more categories as continuous variables.We ran the time-dependent Cox regression for the feature of LTCI grade, which was determined after diagnosing PD.

Classification based on functional disability and death
There were 478 and 722 patients in the S-NDR and S-DR groups, respectively, as well as 1313 and 431 patients in the D-DR and D-NDR groups, respectively.Table 1 shows their baseline characteristics.Figure 2 shows cumulative changes in the distribution of patients according to death and disability registration.
Patients who survived were younger and had a higher proportion of females than those who died within the observation period.The death rate was high in both the medical aid and fourth-quartile premium level groups.The rates of hypertension, diabetes, ischemic heart disease, chronic kidney disease, and stroke were high in the deceased patient groups.In the D-DR group, both LTCI and National Disability Registry grades were significantly lower (more severe disability) than in the other groups.Surviving patients were followed up for 113.7 ± 3.3 months.The D-DR and D-NDR groups were followed up for 61.4 ± 30.1 and 43.2 ± 32.0 months, respectively.The S-DR group registered later in both the LTCI and National Disability Registry than the D-DR group; furthermore, the approval rate was higher for LTCI than for the National Disability Registry.

Classification based on the auto-clustering
Table 2 shows the baseline characteristics of each cluster.Table S3 presents the distribution of patients among the four manually classified groups and auto-classified clusters.All surviving patients were allocated to Clusters 1 and 2. Contrastingly, all patients in Clusters 3-6 died during the observation period.The cumulative survival rate was the highest in Cluster 1 and decreased with time from Clusters 1 to 6 (Fig. 3).
Cluster 1 had the lowest average age and a high proportion of women.Additionally, it had the lowest rates of diabetes and stroke and the patients were mainly approved for mild disability in LTCI without being registered in the National Disability Registry.Cluster 6 showed the shortest survival period (4.2 ± 3.0 months) and comprised all patients in the D-NDR group.Contrastingly, disability-registered patients belonged to Clusters 1 to 5. The time from diagnosis to disability registration decreased from Clusters 1 to 5. In Clusters 3 and 4, many patients were approved only for LTCI but not for the National Disability Registry.Patients in Cluster 5 had a severe degree of disability and a higher rate of registration in the National Disability Registry than the other clusters.

Discussion
We classified patients with PD according to their long-term clinical course and described the characteristics of each group.Our auto-clustering analysis revealed six phenotypes of the natural clinical course of PD, which mainly reflected functional disability and mortality.Therefore, our findings may reflect the motor symptom-oriented clinical course of patients with PD.To our knowledge, this is the first study to attempt to classify the natural clinical course of PD based on demographic factors, comorbidities, and social support registry data using the Korean NHIS database.We could objectively identify information regarding mortality approximately 10 years after PD diagnosis without loss to follow-up.Additionally, functional disabilities could be identified by combining two types of social support data, which minimized information bias.Furthermore, both the LTCI and National Disability Registry allow quantification of the disability degree through grade definition.Taken together, these characteristics allowed reliable description of the long-term clinical course of PD.
Previous studies have identified subtypes of PD using clustering methods [23], which were primarily performed based on disease progression and symptom domains.Belvisi et al. [24] performed agglomerative hierarchical clustering based on motor and    [26] used the trajectory profile clustering method and described mild, severe, and mixed subtypes of PD.Salmanpour et al. [27] performed a 4-year longitudinal clustering analysis using multiple dimensions of classification algorithms and found that 35% of the patients showed a stable course, while others showed disease escalation.
Based on previous studies, our clusters can be described as follows: Cluster 1 corresponds to the mild motor or non-motor predominant slow progression type; Clusters 2 and 3 correspond to intermediate types; and Clusters 4-6 correspond to diffuse, malignant, and bilateral motor disease subtypes.Patients in Cluster 6 lacked enough time to register their disabilities since their survival time was only about 4 months.Furthermore, in Cluster 6, late diagnosis or other PD-unrelated causes may have played a more critical role in death as indicated by the relatively higher proportion of comorbidities, including ischemic heart disease, atrial fibrillation, chronic kidney disease, and stroke.Moreover, we operationally defined patients with PD into four groups based on their survival and disability registrations.The S-NDR and S-DR groups may correspond to the mild motor or non-motor predominance and slow progression subtypes, while the D-DR and D-NDR groups had a mixture of intermediate and diffuse malignant subtypes.
Approval for LTCI is based on the assessment score indicating the degree of long-term care required by the applicant.For individuals without dementia, patients with a score of > 51 points for the sum of physical function, cognitive function, behavioral change, nursing care, and rehabilitation area are considered eligible.Contrastingly, the cut-off criterion for the National Disability Registry for brain lesions is a modified Barthel index ≤ 96 points.We could not present direct data regarding the Unified Parkinson's Disease Rating Scale (UPDRS) score and Hoehn and Yahr (HY) stages, which are widely used to evaluate PD.However, from the perspective of starting to need help from others, the initiation of disability registration can be considered equivalent to a total UPDRS score of 50-60 and HY stage 2.5-3 [28].We found that patients with PD in South Korea were more registered in the LTCI than in the National Disability Registry for functional disabilities.This could be attributed to several reasons.First, the National Disability Registry of brain lesions is primarily based on the modified Barthel index and emphasizes the evaluation of activities of daily living.Contrastingly, the LTCI evaluates both functional level and other motor items, including ataxia and tremor.Moreover, LTCI considers non-motor symptoms including cognitive function and problem behavior, which allows better objective evaluation of disabled patients with PD.This is evident from the fact that the time frame for registration with LTCI was shorter than that for all the groups.
We identified age, male gender, greater degree of disabilities, diabetes, and chronic kidney disease as risk factors for long-term mortality after the diagnosis of PD, which is consistent with previous studies [9,29].Moreover, socioeconomic status was not associated with long-term mortality, which is inconsistent with previous reports [30,31].As aforementioned, these results could be attributed to the government-provided medical cost support for the diagnosis code of G20 in South Korea.Further studies are warranted to identify the associations between socioeconomic status and long-term mortality of PD in South Korea.
This study had several limitations.First, since this was a retrospective study, we could not present the individual patient's medication history or compliance, which are critical for the long-term clinical course of PD.Further, the NHIS database cannot reflect detailed records of motor and non-motor symptoms at the time of diagnosis, including the UPDRS and HY stages.Second, our findings could not reflect the biological signatures of PD.Although the NHIS database allows analysis of social registry data and objective evaluation of longitudinal events, including death, it cannot accurately reflect detailed information about the disease, imaging data, and biological information regarding the pathogenesis of each patient.Third, we could not confirm whether the death was due to complications directly related to PD.Therefore, future studies should conduct systematic research using combined databases from hospitals and the NHIS to overcome these limitations.Finally, while the dataset used in this study has sufficient clinical and demographic characteristics at the time of PD diagnosis, it lacks variables measured repeatedly after the PD diagnosis.Then, we applied clustering analysis to overcome this limitation of our dataset and ensure statistical robustness.Clustering analysis is unsupervised learning that searches for patterns in data and provides relatively simple and intuitive results, but the results do not consider changes over time [32].On the other hand, the growth mixture model can identify subpopulations with different patterns within the data based on repeated measurement data [33].However, relatively complex statistical modeling is applied for the growth mixture model, requiring repeated measurement data from sufficient time points [34].In the case where the repeated measurement data are unavailable, as in this study, applying the growth mixture model may be limited.We suggest that the future study utilizing the growth mixture modeling with repeated measurement values in the PD cohort will contribute significantly to the robustness and generalizability of the results.

Conclusion
We described six PD clinical course clusters using longitudinal data from the social support registry and mortality data obtained from the NHIS database.We confirmed that PD progression is heterogeneous with respect to disability and mortality.Our findings may inform long-term management strategies for patients with PD.

Fig. 2
Fig. 2 Annual cumulative changes in target outcomes

Table 1
Baseline characteristics according to the manually classified outcome groups

Table 2
Baseline characteristics according to the auto-clustering groups Abbreviations: LTCI, long-term care insurance; NHIP, National Health Insurance Premium a mean observation months

Table 3
[25]proportional hazards model for long-term mortality after diagnosis of Parkinson's disease Cumulative survival rate according to the auto-clustering groups neurophysiological features and identified two clinical clusters: mild motor dominant and diffuse malignant types.Additionally, they confirmed that the diffuse malignant type is characterized by increased cortical excitability and decreased plasticity.Lawton et al.[25]performed K-means clustering based on data regarding the motor, cognitive, and non-motor domains for two cohorts of patients with idiopathic PD.The following four clusters were identified: fast motor progression with symmetrical motor disease, intermediate motor progression with mild motor and non-motor disease, intermediate motor progression with severe motor disease, and slow motor progression with unilateral disease.Krishnagopal et al.
b excluding the medical-aid group Fig.3