Assessing the Reproducibility of Machine-learning-based Biomarker Discovery in Parkinson's Disease

Genome-Wide Association Studies (GWAS) help identify genetic variations in people with diseases such as Parkinson's disease (PD), which are less common in those without the disease. Thus, GWAS data can be used to identify genetic variations associated with the disease. Feature selection and machine learning approaches can be used to analyze GWAS data and identify potential disease biomarkers. However, GWAS studies have technical variations that affect the reproducibility of identified biomarkers, such as differences in genotyping platforms and selection criteria for individuals to be genotyped. To address this issue, we collected five GWAS datasets from the database of Genotypes and Phenotypes (dbGaP) and explored several data integration strategies. We evaluated the agreement among different strategies in terms of the Single Nucleotide Polymorphisms (SNPs) that were identified as potential PD biomarkers. Our results showed a low concordance of biomarkers discovered using different datasets or integration strategies. However, we identified fifty SNPs that were identified at least twice, which could potentially serve as novel PD biomarkers. These SNPs are indirectly linked to PD in the literature but have not been directly associated with PD before. These findings open up new potential avenues of investigation.


Introduction
Parkinson's disease (PD) is a degenerative neurological condition that affects both the motor and non-motor aspects of movement including planning, initiation, and execution [1,2].Parkinson's symptoms result from an 80% or greater loss of dopamine-producing cells in the substantia nigra [3].Dopamine works with other neurotransmitters to coordinate nerve and muscle cells involved in movement [3].Without sufficient dopamine, the neurotransmitters' balance is disrupted, causing the distinctive symptoms of PD, such as tremors, rigidity, slow movement, and poor coordination [4].PD patients experience a significant decline in quality of life, social activities, and family relationships [5,6,7].PD is one of the most prevalent neurodegenerative disorders affecting 1-2 persons per 1,000 people at any time of their live and has a prevalence rate of 1% over the age of 60 [8].Due to a growth in the number of senior individuals and age-standardized incidence rates, the estimated number of people affected with PD in the world grew from 2.5 million in 1990 to 6.1 million in 2016 [9].
Motor symptoms are mostly used to diagnose PD [3].While non-motor symptoms (such as cognitive changes, difficulties with attention and planning, sleep disorders, sensory abnormalities, and olfactory dysfunction) are common in patients before the onset of PD, they lack specificity, are challenging to assess, and/or vary from patient to patient [1,10,11].Thus non-motor symptoms are not currently utilized to diagnose PD on their own, despite some of them being used as supportive diagnostic criteria [12,13].Additionally, there is no specific lab test to diagnose PD [3].
Genome-Wide Association Studies (GWAS) aim to identify genetic variations, especially Single Nucleotide Polymorphisms (SNPs), that distinguish individuals with a particular disease from those without it.By analyzing these differences, one can identify SNPs that are associated with the disease of interest.However, GWAS data are massive and contain more SNPs (features) than individuals (samples), which poses a challenge of high dimensionality.To address this, feature selection is a standard technique used to choose the most reliable, non-redundant, and relevant features to include in a model.The main goals of feature selection are to enhance the predictive model's performance and reduce computational expenses.In this study, we utilized five different GWAS datasets that included both Parkinson's disease (PD) cases and controls.To reduce the dimensionality of these datasets, we employed the SVFS algorithm [31] for feature selection.Additionally, we used the Random Forest (RF) algorithm [32] for classification and proposed four different approaches to integrate datasets, which we compared with the baseline approach of no integration.Finally, we collected SNPs that were identified by at least two different datasets or integration strategies.We identified four SNPs with direct links to PD and 50 with indirect links to PD in the literature.A direct link means that current literature directly links a SNP with PD; while an indirect link means that current literature suggests the involvement of a SNP in a disease other than PD but this other disease co-occurs with PD in a significant number of PD patients.These indirectlyassociated SNPs open up new potential avenues for investigating potential biomarkers for PD.

Dataset Description
Our study includes five different GWAS datasets obtained from dbGaP [33].A description of these datasets is as follows and a short summary is given in Table 1.
1. Phs000126 (Familial) [34,35,36,37,38,39,40,41] dataset combines the results of two major National Institutes of Health (NIH)-funded genetic research studies (PROGENI and GenePD) which aimed at discovering new genes that influence the risk of PD.These studies have been analyzing and recruiting families with two or more PD affected members for over eight years.There are almost 1,000 PD families in the total sample.
2. Phs000394 (Autopsy)-Confirmed Parkinson Disease GWAS Consortium (APDGC) [42] was established to perform a genome-wide association research in people with neuropathologically diagnosed PD and healthy controls.Their study's hypothesis is that by enrolling only cases and controls with neuropathologically proven illness status, diagnostic misclassification will be reduced and power to identify novel genetic connections will be increased.

Phs000089 The National Institute of Neurological Disorders and Stroke
(NINDS) repository [43,44,45,46,47] was created in 2001 with the intention of creating standardized, widely applicable diagnostic and other clinical data as well as a collection of DNA and cell line samples to enhance the field of neurological illness gene discovery.NINDS dataset is divided into NINDS1 and NINDS2.4. Phs000048 (Tier 1) The dbGaP team at NCBI calculated this Genome-Wide Association scan [48,49,50,51] between genotype and PD status of 443 sibling pairs that were at odds for PD between June 1996 and May 2004.

Preprocessing and imputation
We applied KNNcatimputer [52] for imputing missing values.To tune the parameters of KNNcatimputer using grid search, we removed from 5% to 20% of genotype values from the data, impute them and evaluate the imputation accuracy.We found that "Cohen" for distance measure and n = 20 for the number of neighbours as the parameters to optimize imputation accuracy.We also removed columns (SNPs) with more than 5% missing values.We came up with the threshold of 5% by experimenting on columns without missing values, randomly removed a certain percentage of the values, and used KNNcatimputer to determine the accuracy of imputation.On the Familial, Autopsy, NINDS1, NINDS2, and Tier1 datasets, the accuracy of imputation for different parameters was comparable to one another.In order to get consistent findings across all datasets, we chose the aforementioned parameters.The imputation accuracy for Familial, Autopsy, NINDS1, NINDS2, and Tier1 dataset was 84.04% ± 1.12, 87.38% ± 2.31, 86.31% ± 1.5, 84.53% ± 1.3, and 83.81% ± 0.42 respectively.

Feature Selection
We used the SVFS algorithm [31] for dimensionality reduction.Afshar and Usefi demonstrated the efficiency of SVFS in terms of accuracy and running time on various biological datasets, and compared it against other feature selection algorithms.As per [31] we set SVFS parameters as k = 50, T h irr = 3, T h red = 4, α = 50, and β = 5 and we use these values on all datasets and experiments.

Classification
For classification, we used the Random Forest (RF) algorithm [32].We performed repeated 5-fold cross-validation (CV) for 10 times.We used RF with the following settings: n estimator = 100, criteria = "gini", max depth = None, and min samples split = 2.

Common SNPs among datasets
We observed that the Familial dataset has the most number of samples and SNPs (Table 1).As we discuss in Section 2.6, for pair-wise dataset integration, we need to obtain the SNPs in common between two datasets.Thus, to maximize the number of SNPs in common between two datasets we decided to use the Familial dataset as the dataset to be integrated with all other datasets.The number of SNPs in common between the Familial and all other datasets are provided in Table 2.

Approaches to integrate datasets
We proposed four different approaches to integrate datasets, and compared these approaches against the baseline approach of no integration (Approach 0).For each approach, we defined two modes: (A) take the mostfrequently-selected SNPs as features for the Random Forest algorithm, and (B) extend the SNPs in A by adding those SNPs in linkage disequilibrium (LD) with the SNPs in A and used all these SNPs as features for the Random Forest algorithm (RF).We used SNIPA [53] to get SNPs in LD with the The most-frequently-selected SNPs were defined as those SNPs selected in at least fives runs out of 50 runs of the SVFS algorithm.These SNPs are referred to as the most common SNPs.

Approach 0
Approach 0 is our baseline approach.In this approach we ran the SVFS algorithm 50 rounds (5-fold CV for 10 times) on each dataset separately to find the most common SNPs per dataset.We used cross-validation to assess the classification performance of a model generated using the most common SNPs as features and RF as the classifier.This process was performed again following mode B described above.

Approach 1
In Approach 1, we selected features from the Familial dataset and found the SNPs to use for classification as per mode A and B above.Then, for each of the other four datasets, found which of the selected SNPs was available on that specific dataset and performed CV to train and assess the classification performance of a model generated using that dataset and those SNPs.

Approach 2
In Approach 2, we first obtained the intersection of SNPs between the Familial dataset and each of the other four datasets (Table 2).SNPs not in the intersection were removed from both datasets.We did then followed the same steps as per Approach 1 by selecting the features from the condensed version of the Familial dataset and doing CV on each of the other datasets.

Approach 3
In Approach 3, before doing feature selection, we got the SNPs in the intersection between the Familial dataset and the other 4 datasets, and then merged the datasets pair-wise as follows: Familial and Autopsy, Familial and NINDS1, Familial and NINDS2, and Familial and Tier1.We ran the SVFS feature selection algorithm on each of the four merged datasets and extracted the most common SNPs.Then, we performed CV to assess the classification performance of a model generated using each of the four merged datasets.
To allow for a direct comparison with the other approaches, in this approach we calculated the accuracy per each dataset in addition to the accuracy on the merged dataset.Thus, we obtained three accuracies: one for the merged dataset and one for each of the merged dataset's individual instances.

Approach 4
This approach is the same as Approach 3, but equal number of samples from each of the two datasets were randomly selected before merging them.The number of cases and controls taken from each dataset is the same.

Results and Discussion
Approach 0 (A) obtained the highest accuracy for all datasets (Table 3 and Figures 1 and 2).We anticipated that Approach 0 would achieve the highest accuracy because there is no technical variation among the samples used.Approaches 1 & 2, and Approaches 3 & 4 have similar accuracy.Approaches 3 & 4 achieved higher accuracy than approaches 1 & 2 on Autopsy, NINDS1, and NINDS2.This lead us to recommend to perform feature selection on merged datasets.As extending the list of most common SNPs with SNPs in LD with the most common SNPs (mode B) did not improve accuracy, for the following analyses we considered only the SNPs obtained using mode A.
In Figure 3, we show the number of SNPs and genes that are in common among different integration approaches for the same dataset.To get the genes where the SNPs are located we used the Biomart platform [54].On average around 6% of the SNPs identified using Approach 0 (without data integration) are replicated by at least one other approach (Supplementary Figures), while on average 38.1 ± 22% of the SNPs identified with any data integration strategy (Approaches 1 to 4) are replicated by at least one other integration approach (Figure 3).This indicates that integrating datasets increases by six-fold the proportion of replicated SNPs.The highest agreement was observed between Approaches 3 and 4, followed by Approaches 1 and 2. Figure 3(f) only includes three approaches because there were no common SNPs between Approach 1 and Approaches 2, 3, and 4. Figure 4 shows the number of identified SNPs and genes that are in agreement using different datasets per approach.The number of SNPs in agreement between datasets fell dramatically when using different datasets but the same approach.The highest pair-wise agreement is in Approach 0 between NINDS1 and NINDS2 which are two datasets generated by the same studies.This suggests that the genotyping platform and study design are the greatest source of variation.If we consider NINDS1 and NINDS2 as the same dataset then the number of genes reproduced by at least two different datasets goes from 33 in Approach 0 to 75 in Approach 4. This suggests that integrating datasets more than duplicates the number of genes that are reproduced in at least two datasets.

Phenotypes associated with potential biomarkers
To investigate whether SNPs identified by at least two datasets or two approaches (referred to as replicated SNPs) are associated with PD, we retrieved the phenotype associated with each of these replicated SNPs using the Biomart platform.Four of these replicated SNPs (rs11248060, rs239748, rs999473, and rs2313982) are directly linked to PD.By performing a literature search on the associated phenotypes, we found 50 other replicated SNPs indirectly associated with PD.That is, they are directly associated with a disease that frequently co-occurs with PD.These 50 replicated SNPs together with the manuscripts used as basis of their indirect association to PD are listed in Supplementary File 1.These 50 replicated SNPs open up new potential avenues of investigation for PD biomarkers.

Conclusion
The integration of datasets may result in a slight decrease in classification accuracy but it can significantly enhance the reproducibility of potential biomarkers.We show that on average 93% of the SNPs discovered using a single data set are not replicated in other datasets, and that dataset integration reduces that average lack of replication to 62%.A limitation of this study is that SNPs genotyped on each dataset differ due to different genotyping platforms, leading to the exclusion of many SNPs as potential biomarkers when integrating data sets.To address this issue, repeating the study on SNPs detected by whole-genome sequencing could be a solution.In addition, accessing other datasets from diverse populations would enable us to further test the replication of our identified SNPs.Moreover, our methodology is applicable to various diseases such as breast cancer, lung cancer, and colorectal cancer.

Declaration of Interest
None.

Table 2 :
Number of SNPs in common between Familial and the other four datasets SNPs in A. SNIPA parameters: genome assembly, variant set, population, and genome annotation were set as GRCH-37, 1000 genomes phase 3 V5, European, and Ensemble 87, respectively, throughout all our experiments.

Table 3 :
Comparison of datasets accuracy