Cardiovascular Disease Prediction from Electrocardiogram by Using Machine Learning

— Cardiovascular disease (CVD) is the leading cause of deaths worldwide. In 2017, CVD contributed to 13,503 deaths in Malaysia. The current approaches for CVD prediction are usually invasive and costly. Machine learning (ML) techniques allow an accurate prediction by utilizing the complex interactions among relevant risk factors. This study presents a case–control study involving 60 participants from The Malaysian Cohort, which is a prospective population-based project. Five parameters, namely, the R–R interval and root mean square of successive differences extracted from electrocardiogram (ECG), systolic and diastolic blood pressures, and total cholesterol level, were statistically significant in predicting CVD. Six ML algorithms, namely, linear discriminant analysis, linear and quadratic support vector machines, decision tree, k-nearest neighbor, and artificial neural network (ANN), were evaluated to determine the most accurate classifier in predicting CVD risk. ANN, which achieved 90% specificity, 90% sensitivity, and 90% accuracy, demonstrated the highest prediction performance among the six algorithms. In summary, by utilizing ML techniques, ECG data can serve as a good parameter for CVD prediction among the Malaysian multiethnic population.


Introduction
Cardiovascular disease (CVD) involves the heart and blood vessels and can lead to premature mortality [1].CVD includes coronary heart disease (CHD), cerebrovascular disease, rheumatic heart disease, and other heart conditions.Approximately 17.9 million people die annually from CVD, which account for 31% of the total deaths worldwide [2].In Malaysia, the incident of ischemic heart disease has substantially increased by 54% within 10 years and remained as the principal cause of deaths in 2017 [3].CVD risk factors, namely, diabetes mellitus (DM), hyperlipidemia, obesity, hypertension, age, gender, smoking, and inactive lifestyle, are important predictors of CVD risk [4] - [5].The Malaysian Cohort (TMC) project, which was initiated in 2006 to address the rising trends in non-communicable diseases, is a large prospective study involving 106,527 multiethnic participants [6].More than 2000 parameters, including lipid profile, fasting blood glucose (FBG), body composition, blood pressure, and electrocardiogram (ECG), were obtained or measured from each participant.
ECG measures the electrical activity of the heart and has been extensively used in detecting heart diseases because of its simplicity and noninvasiveness.Moreover, independent risk markers for cardiovascular deaths can be found from ECG metrics, [7] which provide comprehensive information on cardiac rhythms and conduction patterns.The standard ECG uses 12 leads from 12 vantage points recorded using 10 electrodes, six of which were on the chest wall and four were on the limbs.The three limb leads are used to generate a recording, whereas the right leg lead serves as an electrical ground [8].Among the 12 leads, lead II, which measures the potential difference between the electrodes attached to the right arm and left leg, is commonly utilized for diagnosing heart diseases.Lead II readings highlight various segments within the heartbeat and displays three of the most important waves: P, QRS, and T [9].The R-R interval is the time between the R peak of a heartbeat with respect to another heartbeat.The heart rate variability (HRV), which is abnormal in patients with coronary artery disease, DM, and coronary heart failure, is the interval between the consecutive normal heartbeats that reflects cardiac autonomic function [10] - [11].Yadav [12] found a significant correlation among the indices of HRV by using the root mean square of successive differences (RMSSD, p = 0.018) and R-R intervals (p = 0.010).According to O'Neal [13], the standard deviations (SDs) of the R-R intervals and RMSSD are associated with an increased risk of CVD and all-cause mortality and vary by sex and race [14].HRV can also serve as the main predictor of future vascular events [15].
Breathing rate (BR) is a key physiological parameter used in a range of clinical settings.Among the vital signs measured in acutely ill hospital patients, BR provides a highly accurate prediction of deterioration [16].Despite the diagnostic and prognostic values, BR is still widely measured by manually counting breaths.Many algorithms have been proposed to estimate BR from ECG and photoplethysmogram signals.These BR algorithms provide opportunity for the automated, electronic, and unobtrusive measurement of BR in healthcare and fitness monitoring [17].
Machine learning (ML)-based artificial intelligence, such as knowledge-based expert systems, differs from other methods and is extensively used in the classification and prediction of CVDs [18].The well-known ML algorithms have four types: supervised, unsupervised, semi supervised, and reinforcement learning.The supervised learning methods, which include linear discriminant analysis (LDA) [19], support vector machine (SVM) [20] - [21], decision tree (DT) [23], k-nearest neighbor (kNN) [24], artificial neural network (ANN) [19], [25], logistic regression [26], and fuzzy logic [27], are widely used for group classification.ANN is widely applied in predicting CHD [28], whereas SVM is frequently adopted in classifying arrhythmia [29].The capabilities of the new ML algorithms in deep learning, such as convolutional neural network (CNN), are recently explored.Acharya [30] compared the accuracy, sensitivity, and specificity of CNN with and without noise from ECG signals.
The present study aims to identify the most significant parameters extracted from ECG signals for CVD prediction by using six types of supervised ML techniques, namely, LDA, linear and quadratic SVMs, DT, kNN, and ANN.To the best of our knowledge, this study is the first to use the raw ECG waveform in predicting CVD among Malaysian subjects.A predictive model for CVD diagnosis at an early stage is crucial in reducing and preventing the morbidity and mortality due to CVD.Furthermore, a solution for this issue is timely because the Malaysian Ministry of Health has launched the National Strategic Plan for Non-Communicable Disease (NSP-NCD 2016-2025) in response to the global challenge in combating NCDs in general and CVD in particular.

Study sample
A total of 66 subjects aged 35 to 65 years recruited between April 2006 and September 2012 were selected from the participants of the TMC project for the nested case-control study.These subjects provided written informed consent for follow-up and also agreed to a 10-min ECG re-recording.The subjects with history of stroke, myocardial infarction, and heart failure were defined as cases, whereas those without a history of CVD were defined as controls.The study was conducted in accordance with the Declaration of Helsinki, and the ethics approval was obtained from the Medical Research Ethics Committee of Universiti Kebangsaan Malaysia (Project Code: FF-205-2007).
Demographic data, height, weight, body mass index (BMI), lipid profile, FBG, and ECG were retrieved from the Electronic Cohort Information Management System database.The demographic data, such as gender, age, and ethnicity, were collected via face-to-face interviews.Weight and height were obtained using the Seca weight scale (SECA, German) and the Harpenden stadiometer (Holtain Limited, UK), respectively.The BMI was calculated from height and weight as BMI = weight (kg) / [height (m)]2.The blood pressure was measured using Omron HEM-907 (Omron Corporation, Japan).All parameters were measured thrice, and the average measurements were recorded.Peripheral blood samples were collected by venipuncture from each participant after overnight fasting.Biochemical analysis was performed within a 24-h post-blood collection.The fasting plasma glucose and full lipid profile were analyzed using COBAS Integra® 800 (Roche Diagnostics Gmbh, Germany).All tests were performed in an accredited bioanalytical laboratory.The 10-min resting ECG signals for training and testing datasets were collected from each subject by using the Schiller electrocardiograph to develop an efficient model for the CVD classification from the ECG signals.All measurements were collected at the TMC recruitment center in UKM Molecular Biology Institute (UMBI), Kuala Lumpur.MATLAB was used to extract all signals originally in .xmlformat.After quality controls, only 60 subjects were selected for the final analysis.The six other subjects were eliminated due to poor ECG quality signal during the 10-min recording.

Preprocessing
The ECG signals were preprocessed using filtering algorithms to eliminate highand low-frequency noises.The baseline wander, which is usually caused by respiratory, body movements, and inadequate electrode position, is a low-frequency noise absorbed by the ECG signals [31].A high-pass filter with a cutoff frequency of 0.5 Hz was used to overcome the problem of baseline wander in the signal.Filters with a linear phase from the finite impulse response (FIR) were required to remove the baseline wander to avoid phase distortion when changing the wave feature in the heart cycle.Power line interruptions at a frequency of 50 Hz resulted in amplified sinus noise on the ECG signals.In this study, the infinite impulse response notch filter was used to eliminate 50 ± 0.2 Hz.The signals used were from the third until the ninth minute to eliminate the transient state.The sampling frequency was 500 Hz.
Four different algorithms, which were developed by Behar [32], Zhang [33], Pan and Tompkins, and Clifford [32], were used to detect the QRS peaks.The accuracy of the detectors was compared.QRS peaks are essential for the feature extraction of ECG signals and determination of the signal quality of ECG.Signal quality indexing involved using the two best evaluated QRS peak detectors [34].The index was given a score of 1 if both detectors agreed where the point of the peak was and a score of less than 1 otherwise.The window for signal quality indexing was 10 s.After the preprocessing phase, the HRV was analyzed, and the results were recorded.

Data extraction
Data extraction is the key to success when using the ECG signals in CVD classification.The extracted parameters were the R-R interval, HRV, and BR.The R-R interval was the most common identifiable feature and was calculated from cardiac rhythm.This interval was measured using the detected QRS peaks, and the HRV was indexed using the RMSSD of the R-R interval.
BR was extracted using the respiratory sinus arrhythmia (RSA).RSA, which is also known as respiratory-induced frequency variations [35], is the correlation between the variations of the heart rate cycle and respiratory system.Heart rate increases when a person breathes in and decreases when a person breathes out.The successive difference in the R-R peaks was used as the value for the amplitude of the new waveform.The wave formed by RSA was resampled at a frequency of 4 Hz through spline interpolation to perform fast Fourier transform (FFT).After the FFT process, the RSA waveforms were filtered using an FIR band pass filter with cutoff frequencies of 0.1 and 0.6 Hz, which were equivalent to the respiratory rates of 6-36 breaths per minute, to eliminate non-respiratory frequencies [36].The respiratory signals were identified in a sinusoidal form.The BRs were calculated using the total number of sinusoidal peaks per minute.

Statistical analysis
All categorical parameters were presented as numbers and percentages, whereas continuous parameters were presented as mean and SD.Statistical analyses were performed using the chi-square test and t-test for categorical and continuous data, respectively.Boxplot was used to represent the distribution of data between the case and control groups (Appendix).A p-value threshold of 0.05 was used for declaring significance.

Automatic classification
At this stage, predictive models were built from significant input data by using ML algorithms for CVD risk classification.The six state-of-the-art methods with the most widely used algorithms related to CVD classification were ANN [25], LDA [19], linear and quadratic SVMs [21] - [22], DT [23], and kNN [24].
The automatic classification model was developed using supervised ML methods in 60 samples.The data obtained were randomly divided into training and test data.Table 1 shows the distribution of the data used.The training and test datasets consisted of variables or features that exhibit significant associations to the case and control groups.The ratio of the training to test data was 2:1.Two linear and four nonlinear classifiers were used to categorize the training data into groups.LDA and linear SVM were the two linear classifiers used to create a linear function that can separate the control and case data.SVM with the quadratic kernel function (i.e., quadratic SVM), DT, kNN, and ANN were used as nonlinear classifiers to differentiate both groups.All models were trained with 40 training data (20 cases and 20 controls) by using the tenfold cross-validation method, which divided the training data into 10 equal subsets.Nine subsets were used to train the model, whereas the remaining subset was utilized to test the trained model.The accuracy of the first training model was determined.This process was repeated 10 times for each individual subset.The average of the 10 accuracy data was used to indicate the overall training performance of the trained models.
LDA algorithm determines the direction for projecting the dataset between and within the maximized and minimized class variances, respectively.This algorithm offers a linear transformation or predictor variables that provide accurate discrimination.When the measurements are made on independent variables for each observation, the LDA functions are continuous quantities [18].SVM is a binary classifier widely used for classification and regression [39].This technique constructs an optimal hyperplane (decision boundary between classes) that separates all classes [38].A hyperplane is built by maximizing the margin or space between the boundary line and the dataset of the classes.For this study, the linear and quadratic function kernels were used to find the respective linear and nonlinear relations of the selected input to the corresponding groups [39].
The DT classifier constructs a tree from the training data by using five selected features.The tree provides the rules to classify case and control data, and the rules were used to determine the group of the test data.Designing the tree is important to increase the classification performance [40].In this study, Gini's diversity index was used as the split criterion, with a maximum number of splits set to 100.The kNN algorithm identifies similarities among training inputs in groups or classes.New inputs are classified by measuring the minimum distance between the test and training data.Those who are close to others are called neighbors [41].A Euclidean distance of 10 neighbors was applied in this study to determine the nearest neighbor of the test data to the corresponding case or control group.
ANN is a training method that emulates the human brain and is an outstanding method for predicting the relationship between the input and target values [42].ANN has been widely used in cardiology applications for pattern recognition and classification tasks [39].The feed-forward neural network of ANN can accurately classify ECG signals by optimizing the number of hidden layers, hidden neurons, learning algorithm, and transfer function used [43].A two-layer feed-forward backpropagation network with five input neurons and one output neuron was used in this study.The network was trained with 10 different values of initial weights and biases (random "seed" of 1-10), 30 different numbers of hidden neurons (1-30 hidden neurons), 2 different training algorithms, and Levenberg-Marquardt ("trainlm") and Gradient descent with an adaptive learning rate ("traingda").The log-sigmoid transfer functions were used in both layers to scale the output from 0 to 1.
Each trained model of every classifier was tested with 20 sets of test data (distinct from the training data) to examine the performance in terms of specificity, sensitivity, and accuracy.Specificity refers to the ability of the trained model to categorize healthy subjects in the control group.Sensitivity refers to the categorization of CVD risk subjects into the case group.Accuracy is the average of specificity and sensitivity and represents the overall performance of the model.The model of the six classifiers that can classify the test data into the respective group with the highest performance was selected as the best model for CVD risk prediction.

Results and Discussion
Figure 1 shows an example of one of the datasets applied with four different types of R peak detectors in a 360-s ECG recording.The round colored marks in each graph showed the detected R peaks.A satisfactorily detected quality signal would have a median signal quality index (SQI) score of 1 from the recorded ECG signals.Throughout the evaluation, the 60 TMC data (30 cases, 30 controls) had a median SQI score of 1.In this step, the Pan-Tompkins and Zhang peak detectors were used due to their high R-peak detection performances from the ECG signals.The detected marks for these two peak detectors were more accurate than Behar and Clifford's (Figure 1).    2 summarizes the characteristics of the TMC participants.The systolic and diastolic blood pressures (SBP and DBP, respectively), total cholesterol, RMSSD, and R-R interval were statistically significant on the basis of the mean, SD, and obtained p-values.Moreover, the participants with a history of hyperlipidemia with medication showed significant differences (p = 0.032).In this study, the significant values produced from the means and SDs were used as inputs for ML classification, except for the categorical parameters, i.e., a history of hyperlipidemia with medication.The final features consisted of HRV, R-R interval, SBP, DBP, and total cholesterol.Recent studies proved that the RMSSD and R-R intervals extracted from ECG signals can predict CVD [44] - [45].ML was recommended to improve the performance of detection and prediction models [46].The training and classification performances of all six classifiers were based on the five selected inputs listed in Table 3.The perfectly trained ANN, which achieved 90% specificity, 90% sensitivity, and 90% accuracy, was the best model for distinguishing the case and control groups.Although only two sets of data were wrongly classified, the limited dataset caused significant percentage drops in the testing performance.The other trained models (except DT) were more sensitive to CVD risk than ANN in the training set but not in the testing one.The linear classifiers, especially the SVM with linear kernel function also showed comparable results.These results indicated that some features were line-arly separable with each other.Additional data are required to train and test additional inputs to improve the classification performance of the CVD risk prediction model.

Conclusion
This study inferred that the R-R interval, RMSSD, SBP, DBP, and total cholesterol were the most significant parameters in predicting CVD.These parameters were used as inputs for six ML techniques, namely, LDA, linear and quadratic SVMs, DT, kNN, and ANN.The outputs of these automated prediction systems were compared in terms of specificity, sensitivity, and accuracy.Among the six ML algorithms, ANN showed the highest performance (90% specificity, 90% sensitivity, and 90% accuracy).The results verified that the predictive model consisting of ECG, SBP, DBP, and total cholesterol can be used to predict the CVD risk in multiracial Malaysian population with almost 90% precision by using the ANN ML technique.These findings, however, should be validated using a larger set of individuals than the set utilized in this study.

Figure 2
Figure 2 presents an example of heart rate intervals, variance of heart rate intervals, and respiratory signals extracted from the ECG of the case group.The mean heart rate interval was 1.0 s and was supported by a histogram.The mean BR extracted from the respiratory signal was 17 breaths per minute.

Table 1 .
Distribution of data for classification

Table 2 .
Descriptive characteristics of the TMC participants (N=60)

Table 3 .
Training and classification performance of all six classifiers