Deep Learning Framework for Complex Disease Risk Prediction Using Genomic Variations

Genome-wide association studies have proven their ability to improve human health outcomes by identifying genotypes associated with phenotypes. Various works have attempted to predict the risk of diseases for individuals based on genotype data. This prediction can either be considered as an analysis model that can lead to a better understanding of gene functions that underlie human disease or as a black box in order to be used in decision support systems and in early disease detection. Deep learning techniques have gained more popularity recently. In this work, we propose a deep-learning framework for disease risk prediction. The proposed framework employs a multilayer perceptron (MLP) in order to predict individuals’ disease status. The proposed framework was applied to the Wellcome Trust Case-Control Consortium (WTCCC), the UK National Blood Service (NBS) Control Group, and the 1958 British Birth Cohort (58C) datasets. The performance comparison of the proposed framework showed that the proposed approach outperformed the other methods in predicting disease risk, achieving an area under the curve (AUC) up to 0.94.


Introduction
The human genome is the whole set of deoxyribonucleic acid (DNA) sequences in humans, which consists of approximately three billion base pairs [1,2]. Human genomes are almost identical; however, at least three million nucleotides per individual are different. The most common type of these genetic variations is Single Nucleotide Polymorphisms (SNPs). Studies have proven that SNPs are the most contributing markers in several complex and rare diseases [1]. Most SNPs are natural; however, certain SNPs are functional and affect the phenotype of interest, such as skin colour, height, infection, resistance, or responses to drugs.
Genome-wide association studies (GWASs) have proven their ability to unveil susceptibility variants in human diseases [3,4]. These studies provide a better understanding of diseases by enabling researchers to identify SNPs that significantly differ in frequencies between the affected and healthy individuals. GWASs have identified more than 4164 loci contributing to common complex diseases such as diabetes [5][6][7], cancer [8][9][10], and rheumatoid arthritis [11]. Moreover, GWASs allow researchers to develop models for complex disease risk prediction based on genetic information [12][13][14][15][16]. If the disease of interest can be identified at the early stage, specific therapy plans can be applied to delay or even prevent the outset of some diseases [17,18]. Building risk prediction models can contribute to personalised medicine becoming feasible by utilising an individual's genome to predict disease risk and the response to treatment. However, the critical issue is how to predict disease risk accurately from a huge number of SNPs.
given disease. A combination of genomic data and demographic data has been used to predict the disease risk of breast cancer by [42]. Their system used a gradient tree-boosting method in both the selection and classification phases. However, many clinical and medical applications require more accurate prediction systems.
Risk prediction methods typically apply different techniques in order to select a manageable number of SNPs. Most studies rank SNPs based on the p-value of their association with the phenotype of interest to control the number of selected SNPs and use the top associated SNPs as input to a prediction algorithm [43]. However, the predictive power of these studies is relatively poor, and discarding SNPs with a low p-value could limit the opportunity to identify inter-SNP correlations [19,44]. Moreover, there are variations associated with many diseases that have not yet been identified; hereby, analysing an expanded list of SNPs may improve the prediction system performance. For example, in [19], it was suggested that considering uncommon and rare SNPs can improve risk prediction for some diseases such as Parkinson's disease using SVM. In addition, SNPs were selected for SVM by applying different p-value thresholds. Moreover, in [44], the BootRank technique was used in order to select robust informative SNPs to be used in a risk prediction model. The BootRank technique was combined with seven different classifiers to evaluate the performance of their proposed technique. Their model improved the ability to predict the disease risk of unseen individuals in the WTCCC data.
In this work, an accurate deep-learning framework for complex disease risk prediction has been proposed. An adequate subset of SNPs that are highly correlated and nonredundant has been selected using the Joint Mutual Information (JMI) method [45]. Then, the selected features were fed to an MLP that consists of an input layer, five hidden layers, and an output layer to train the prediction system. The proposed system was evaluated using datasets from WTCCC. The comparative experimental results demonstrate the ability of the proposed to accurately predict risk for different diseases as compared to the stateof-the-art approaches including [15,32,34,39,40,44,[46][47][48][49], achieving an AUC of up to 0.94. The rest of this work is organised into three sections. Section 2 discusses materials and methods. Section 3 presents the experimental results and discussion. Finally, Section 4 concludes this work.

Genotype Datasets
Genotype data were obtained from the WTCCC [38] for seven different diseases. The diseases are Type 1 diabetes (T1D), Type 2 diabetes (T2D), inflammatory bowel disease (IBD), coronary artery disease (CAD), bipolar disorder (BD), rheumatoid arthritis (RA), and hypertension (HT), as presented in Table 1. Each disease dataset contains approximately 2000 cases. The control sets obtained from the UK National Blood Service Control Group (NBS) and 1958 British Birth Cohort (58C) contained 1500 individuals [38]. Each sample consists of 500,568 SNPs that were produced by an Affymetrix 500k chip sequencer. As recommended by the associated datasets, 809 samples and 30,956 SNPs were excluded due to deviation from Hardy-Weinberg equilibrium, bad quality, or bad clustering [38]. The dataset has been filtered to exclude SNPs based on the following threshold: a Minor Allele Frequency (MAF) of 1%, p-value < 1 × 10 −3 , and a missing rate of 5% [32,46,[50][51][52]. As a result of the filtering, the final number of samples for each data set is presented in Table 1, with 469,606 SNPs for each one. To ensure that our results are not biased to cases or control, an equal number of samples for each class have been used. Where a group of healthy samples were randomly selected from UKBS and 58C to keep the case:control ratio at 50%:50% for each disease.
Mutation exists in the gene copy that is inherited from both parents. The allele frequencies are represented by A and B for the major allele frequency and minor allele frequency, respectively. Any given SNP could have the value of AA or BB to indicate that it is a homozygous SNP and the value of AB for a heterozygous SNP. In this proposed work, we used the additive model to encode SNPs. The encoding technique counts the minor allele appearance. Consequently, the coding value 0 represents AA, 1 represents AB, and 2 represents BB. Finally, after implementing the aforementioned coding technique, the dataset is represented in numerical format. If a dataset consists of n samples and q SNPs, which can be represented by a G = n × q matrix, then G ij is the number of the minor allele of SNP j for the sample i. Let Y i be a binary indicator for the disease status of a given sample i = 1, · · · , n. The affected samples (case) are considered as having a positive class label (Y i = 1) and the healthy ones (control) as having a negative class label (Y i = 0).

Method
The proposed framework predicts the risk of an examined disease using SNP data. An MLP-based binary classifier has been developed to predict the disease risk status. A mutual information feature selection technique has been applied to decrease the feature space dimensionality and select the most discriminative SNPs. The dataset was split into (70%) training and (30%) testing sets, keeping the class ratio of each group similar to that of the whole dataset, and the testing data were only used for analysing the predictive power of the proposed system as illustrated in Figure 1. Five-fold cross-validation has been applied over the training data in order to perform feature selection. Finally, different performance metrics have been used to evaluate the predictive power of the proposed framework.

Feature Selection
The extremely large number of SNPs in the genome makes the application of machine learning techniques on SNP data computationally impossible. Consequently, the application of feature selection techniques is necessary for the selection of a significantly smaller subset of SNPs. Statistical and machine learning-based feature selection methods have demonstrated their ability to select an optimal SNP subset out of the whole genome [35,53].
In this work, JMI has been employed as a feature selection method in order to reduce computational complexity and improve risk prediction performance.
Mutual information feature selection methods have been widely applied in the biomedical field [54]. Mutual information is used to measure the features' relevancy and redundancy [55]. In the multivariate filter selection method, mutual information does not make any assumption of the data or change the original data representation [56]. In this work, JMI was used to measure the discriminative power of features and select a reduced set of SNPs to be injected into the prediction model.
JMI is a popular feature selection technique that selects a subset of features to maintain a high feature association and maximum correlation with the class of interest [45]. This method measures the information provided by the feature vector s 1 , s 2 , · · · , s q that decreases the uncertainty about the class label Y. JMI uses mutual Information to measure the amount of relevancy and redundancy between features. JMI calculates not only mutual information between features and the class label but also takes into consideration the correlation between the new feature and already selected features D, thus ensuring a good trade-off between relevancy and redundancy [45]. A higher JMI value for a feature s i means that the feature s i is relevant to the target Y and is highly complementary to the already picked features s j , j in D. The JMI for a feature s q is computed as shown in Equation (1) [57].
where : In this work, JMI was applied over the training set to select a subset of SNPs, D. F-fold cross-validation has been applied in order to create a matrix q × F, with q being the number of SNPs of each sample and F = 5. At any given fold, the subset of selected SNPs is assigned to 1, and the value of 0 is assigned to the remaining unselected ones as presented in Equation (5).
At the end of the fifth fold, the accumulated weight W for a given SNP j is calculated as presented in Equation (6). The weight for each SNP represents how many times a given SNP was selected, as illustrated in Figure 2. For example, if an SNP weight is 1, that means it has been selected in all folds. An SNP weight will be 0.6 if it has been selected in three folds. An SNP weight will be 0 if it has not been selected in any fold. Only SNPs that have a weight exceeding a threshold value (a) will be propagated to the prediction model.

Deep Learning
Artificial neural networks (ANNs) are modelling tools inspired by the function of neurons in the human brain. These networks offer an alternative way to handle complex problems and are able to perform predictions for linear/nonlinear problems. Multi-Layer Perceptron (MLP) is one of the popular feed-forward neural networks that consists of an input, hidden, and output layer. In this work, the feedforward MLP consisting of one input layer and one output layer along with five hidden layers has been employed, as conceptualised in Figure 3. Each layer contains a number of neurons, which are interconnected in multiple layers by weighted connections. The input feature vectors are passed through the multiple hidden layers downstream to the output layer [58]. The feature vectors are combined with weights to identify the informativeness of the inputs to the next layer. For any given neuron in layer L, the input is the sum of the weights for each neuron with a bias after applying an activation function in layer L − 1.
Given data input x i (i = 1, 2, 3, . . . , N), the neural model output y can be gained by Equation (7) where W is the model weight, b is the bias vector, and f is the activation function.
In this work, an MLP has been employed in order to identify patients with a certain disease, as conceptualised in Figure 3. The input layer consists of N nodes and considers SNPs as features. The output layer consists of one neuron (affected or healthy). The proposed models' hyperparameters have been optimised using a grid search of the k-fold cross-validation technique with k = 3. This technique can ensure how accurately the model would perform in practice and avoid overfitting. Since we are implementing our model on different diseases, a modular model consisting of multiple modules is used. While similarities between models are possible, training the models separately means that all the architectures are optimised using various hyperparameters. After implementing a grid-based search, all possible hyperparameter value combinations have been examined. The best performances have been achieved with five hidden layers and 512 neurons in each hidden layer for all dataset models. Different activation functions for the hidden layers were evaluated: the tanh function performed the best out of the examined activation functions in three datasets, namely RA, T1D, and T2D, while the relu function performed better in CAD and IBD. The best performance of the HT dataset has been achieved using the sigmoid function. The softmax function was used in the output layer for all models. The softmax activation function used in our model is presented in Equation (8).
where x is the input vector to the softmax function, x i is the ith element of the input vector, and k is the number of classes. Different optimizers were used as a learning algorithm. The Adam optimizer outperforms the compared optimisation functions in most models. For BD and HT datasets, the best optimizers were NADAM and rmsprop activation functions, respectively. However, for the other parameters, all models achieved their best performance using the same values. In order to avoid model overfitting, a dropout technique that drops neurons randomly along with their connections has been used with a probability of 0.6. The best performance was achieved using 200 epochs for all models. Finally, the best learning rate was achieved using a 0.001 learning rate. The proposed models can be validated using test data in order to demonstrate their high-performance ability. Possible hyperparameter values are given in Table 2. After building the networks and optimising the parameters on the seven aforementioned complex disease datasets, we came up with more than one model: the first one uses the same hyperparameters for three datasets (RA, T1D, and T2D), and the model uses the Adam optimizer and tanh activation function. A slight difference was implemented by using the relu activation function in the CAD and T1D datasets. On the other hand, two more models were implemented using NADAM, tanh and rmsprop, sigmoid as optimiser and activation functions, respectively. Finally, the proposed models can be validated using test data in order to demonstrate their high-performance ability.
The cross-entropy cost function, which is explained in Equation (9), has been used to estimate the output error.
for N data points, where t i is the truth value taking a value 0 or 1, and p i is the Softmax probability for the ith data point.

Evaluation
Different experiments were conducted to evaluate the performance of the examined deep learning prediction architecture in terms of accuracy (Equation (10)), sensitivity (Equation (11)), precision (Equation (12)), F1-score (Equation (13)), AUC, and Matthews correlation coefficient (MCC) (Equation (14)). In order to compute the metrics, different values were calculated: 1. True positive (TP): the number of samples that were correctly identified to be corresponding to the targeted disease. F1 score = 2 · Precision · sensitivity Precision + sensitivity (13) Each dataset was split into two main parts: 70% for training the system and 30% for testing the system. Furthermore, in order to identify the optimum features subset, a five-fold cross-validation technique was implemented to train data during the feature selection process. To this end, the training samples were shuffled and split into five groups. The splitting kept the ratio of the classes similar in each group to that of the original dataset. Then, the experiment of selecting features was repeated five times, and at each fold we used one group for testing the model and the remaining four groups as a training set. The final classification performance results were computed using the 30% of the original dataset that was unseen by the feature selection and training phases.
At the end of the fifth fold of the feature selection process, each SNP had a weight value, depending on how many times the SNP has been selected. SNPs with weights larger than the threshold value a were selected for the final feature vector.

Results and Discussion
The threshold value was selected experimentally, by evaluating different values and selecting the optimal value, as shown in Figure 4 and Table 3. In most datasets, the best prediction accuracy was achieved using a threshold of 0.6. However, in predicting the risk of CAD and HT, the best performance was achieved using a threshold value of 0.8 and 1, respectively.  The performance of the examined prediction deep neural network approach for each dataset in terms of accuracy, sensitivity, precision, F1 score, and MCC is presented in Table 4. The achieved results demonstrate the ability of the proposed disease risk prediction system to perform accurately. The affected samples were identified with an accuracy range between 0.796 to 0.948 for seven different complex disease datasets. Regarding the sensitivity and precision values, the proposed system was able to detect most patients in the datasets with high sensitivity values in most cases, ranging from 0.798 for HT disease to 0.934 for T1D disease. Moreover, very few healthy samples were identified as a case with precision values ranging from 0.83 to 0.966 for all diseases, apart from IBD disease, with a precision of 0.726. The proposed system performed the best in identifying the risk of CAD with an F1 score reaching 0.95. In predicting the risk of T1D, T2D, BD, HT, and RA, the F1 score of the proposed system ranged between 0.84 and 0.92. However, predicting the risk of IBD disease was the most challenging, achieving an F1 score of up to 0.782. For the MCC, the performance of our proposed system ranged between 0.606 and 0.901. The proposed MLP was compared with other state-of-the-art machine learning techniques. The best performance with the compared techniques was achieved using SVM and linear discriminant analysis (LDA). A comparison of SVM, LDA, and the proposed MLP performance in terms of F1 score is shown in Figure 5. It is evident that MLP achieves the highest prediction performance with an improvement of 1.2% to 7.9% over SVM and LDA. The best improvement was achieved in the HT dataset, while the lowest was obtained in the T1D dataset. Finally, we compared our proposed system prediction performance against other studies conducted on the WTCCC datasets as presented in Table 5. Comparing with these studies, we can guarantee that their systems dealt with datasets that have the same properties, the same number of controls and cases, and the same genotyping density. The AUC of the compared methods varied between 0.56 and 0.90 depending on the dataset and the algorithm. The proposed system outperformed the other frameworks for all datasets. The improvement of the proposed system in terms of AUC was less than 4% in identifying affected samples of T1D, T2D, and BD datasets. A better improvement of approximately 9% was achieved in predicting the risk of IBD. The risk prediction for RA, CAD, and HT was the best with an improvement of more than 15% over the best competitors.
Applying deep learning techniques to complex disease genomic datasets is not a trivial task, and the pre-processing of the data can be highly affected by many factors leading to a severe impact on the final conclusions. The proposed framework was able to select a subset of high discriminative SNPs that contributed to improving the prediction ability. A number of SNPs that have been identified with high discriminative values in our proposed system have been previously identified to be associated with diseases in other published works. Out of the selected SNPs, 75 SNPs were identified to be highly correlated with different diseases in the original dataset [38], 23 SNPs were identified in a study conducted over the same dataset [59], and 9 SNPs were identified in a study conducted only on the HT dataset [60].

Conclusions
In this work, a deep learning approach using MLP has been proposed to predict the risk of complex diseases based on genomic variations. The proposed approach exploits the JMI filter feature selection method in order to select a subset of SNPs with high discriminative power. The selected features are then fed to an MLP-based prediction algorithm to distinguish between healthy and affected samples. The proposed model has been evaluated on seven state-of-the-art datasets from WTCCC, UKNBS, and 58C. The experiment results demonstrate the superiority of the proposed model as compared to the traditional machine learning techniques, achieving an F1-score of 0.94. Moreover, the obtained results have been compared with state-of-the-art methods that were applied on the same datasets. An improvement in terms of an AUC of up to 22% compared to previous methods was achieved using the proposed approach. The proposed framework was also able to identify a number of SNPs that have high discriminative value and were previously identified to be linked with diseases in other published work. Taking into consideration the obtained prediction performance, as well as the performance of other methods proposed in the literature, it is evident that the proposed approach is applicable and efficient for complex disease risk prediction from SNP data.  Institutional Review Board Statement: Ethical review and approval were waived for this study due to all the data used being publicly available.