A Machine Learning Model Utilizing a Novel SNP Shows Enhanced Prediction of Coronary Artery Disease Severity

Background: Machine learning (ML) has emerged as a powerful approach for predicting outcomes based on patterns and inferences. Improving prediction of severe coronary artery disease (CAD) has the potential for personalizing prevention and treatment strategies and for identifying individuals that may benefit from cardiac catheterization. We developed a novel ML approach combining traditional cardiac risk factors (CRF) with a single nucleotide polymorphism (SNP) in a gene associated with human CAD (ID3 rs11574) to enhance prediction of CAD severity; Methods: ML models incorporating CRF along with ID3 genotype at rs11574 were evaluated. The most predictive model, a deep neural network, was used to classify patients into high (>32) and low level (≤32) Gensini severity score. This model was trained on 325 and validated on 82 patients. Prediction performance of the model was summarized by a confusion matrix and area under the receiver operating characteristics curve (ROC-AUC); and Results: Our neural network predicted severity score with 81% and 87% accuracy for the low and the high groups respectively with an ROC-AUC of 0.84 for 82 patients in the test group. The addition of ID3 rs11574 to CRF significantly enhanced prediction accuracy from 65% to 81% in the low group, and 72% to 84% in the high group. Age, high-density lipoprotein (HDL), and systolic blood pressure were the top 3 contributors in predicting severity score; Conclusions: Our neural network including ID3 rs11574 improved prediction of CAD severity over use of Framingham score, which may potentially be helpful for clinical decision making in patients at increased risk of complications from coronary angiography.

: Machine Learning Pipeline. Pipeline of machine learning model development includes 3 main steps: 1) defining inputs and outputs 2) data pre-processing (feature selection, data transformation), and 3) model optimization and model selection.

Interquartile range outlier filter
Box-whisker plot is a standard plot commonly used to visualize data distribution and highlight outliers. Outliers in box-whisker plot are calculated by using interquartile range with the lower bound of (Q1 -1.5xIQR) and upper bound of (Q3 + 1.5xIQR) [1]. Box-whisker plots of our data and also lower and upper bound for outlier filtering are shown in Figure S2. In order to evaluate the effect of the outliers on the machine learning model performance, the optimized sequential neural network was also applied to the total number of subjects. The result evaluated by confusion matrix and ROC-AUC indicates that the prediction highly skewed towards low risk prediction and prediction accuracy was compromised ( Figure S2).

Data transformation and normalization
Continuous input data were transformed by using Yeo-Johnson power transformation [2] with maximum likelihood estimator, [2]. Yeo-Johnson transformation is in the same family as Box-Cox family but does not require input values to be strictly positive. This family of power transformation is commonly used to transform non-normal data into a more normally distributed data. This transformation is essential since normality is either required for or improves performance of various machine learning approaches. Increased data normality after transformation and associated that maximizes likelihood estimation for each feature are shown in Figure S3. These transformed data were then normalized by using a robust scaling approach (equation shown below) which avoids scaling data by the maximum value which tends to be unrobust. Categorical input data were transformed with OneHotEncoder [3]. OneHotEncoder works by splitting categorical variables (e.g., sex = male or female) into multiple logical (true/false) variables (e.g., separate male and female categories) [3] which take the values 1 (true) or 0 (false). For example, in the case of a female, female = 1 and male = 0).
Normalization: = / 1 ; where = 95 ℎ ∀ Yeo-Johnson transformation: Figure S3: Applying Yeo-Johnson power transformation to stabilize variance, minimize skewness and increase normality of each input feature. Histogram plots (left) of the data before (above) and after (below) indicated an increase in normality of the data after transformation.

Feature Selection
We used the Spearman correlation between Gensini severity score and each feature to perform feature selection together with feature elimination approaches. Features were selected if they passed at least two of the three criteria. For the Spearman criteria, features passed if their correlation coefficient > 0.08. The two feature elimination approaches were recursive feature elimination (RFE) [4] using an XGBoost regressor model [6] and a logistic regression model where the features are selected based on a cumulative local maximum in the changes of a cross validation score. Briefly, Spearman correlation between two variables is a rank-based approach to quantitatively assess their association. Perfect correlation, anticorrelation and no correlation yield Spearman correlation coefficients of +1, -1 and 0, respectively. RFE fits a specified model and removes a specified number of the weakest features. These weakest features are identified through recursively eliminating features per iteration and features are then ranked by the specified model's accuracy score [3]. Therefore, RFE on XGBoost regressor and on logistic regression indicate XGBoost and logistic regression fitted models [5]. Applying the Spearman correlation feature selection criteria detailed above yielded the following 9 features: age, total cholesterol, triglyceride, HDL, systolic blood pressure, HbA1c, glucose, sex and Id3 SNP ( Figure S4). By using RFE on XGBoost regressor, the optimal number of features is 10 with the same 9 features indicated by Spearman correlation and hsCRP ( Figure S4). By using RFE on logistic regression, the optimal number of features is 9: age, total cholesterol, triglyceride, HDL, LDL, HbA1c, sex, smoking status and Id3SNP ( Figure S4). The overlap from all three approaches yielded 7 features: age, total cholesterol, HDL, HbA1c, sex and Id3 SNP ( Figure S4). Applying our overall criteria of features passing at least two criteria, we arrived at the following 9 input features to further assess various machine learning models: age, total cholesterol, triglyceride, HDL, systolic blood pressure, HbA1c, glucose, sex and Id3 SNP.

Splitting training and testing sets
All patients were randomly split into 2 groups with 80% of all patients (n = 325) to be in a training set and 20% of all patients (n = 82 patients) to be in a testing sets by using a hold-out approach. The testing set is only used for evaluating the performance of the model, but not optimizing the model. Patients characteristics were not significantly different between training and testing sets as described in Table 1. However, proportion of patient with low risk severity score was higher in the training set despite higher average severity score. To further understand this phenomenon, distribution of severity score in training and testing set was visualized through histogram plots in Figure S5. Severity score distribution demonstrated that training set contains some subjects with severity score higher than 100 while testing set contains none. This might be a potential cause explaining the phenomenon.

Model selection and optimization
We chose SVM models [6], tree/ensemble-based models (Randomforest [7], ExtraTree [8], Adaboost [9], GradientBoosting [11], XGBoost [5]), and neural network-based models (feedforward neural network and sequential neural network) 35 for our high and low Gensini Severity Score classification due to their generally good performance on classification tasks. Each type of model consists of different type of parameters required for optimization. The optimization was done on the training set using 10 fold-cross validation. Performance of each optimized configuration was evaluated by using the F1 cross validation score (F1 = 2 , = , = ).

Brief description of each model
SVM: an algorithm that creates a line or a hyperplane which separates data into classes. Kernel and soft SVM are additional approaches that allow SVM to solve non-linear problems [6]. Tree/ensemble: a family of algorithms that performs classification from many decision trees. Introduction of bagging and boosting methods has increased prediction performance of this family of algorithm significantly. Bagging is a method to create several subsets of data from training samples and use each subset to build a decision tree. Average prediction of all decision trees will be used to predict the final result. Boosting is a method to fit consecutive decision trees at every step and use the error learned from the prior iteration to penalize input features driving the prediction error during the next round of model building and assessment. Randomforest and ExtraTree are the two algorithms that utilize the bagging method, and Adaboost, GradientBoosting and XGBoosting are the three algorithms that use the boosting method [5,6,9,11].
Neural network: a family of algorithms that iteratively weights contribution of each input feature that feed into non-linear neuron-like elements to find the best set of weights that accurately predict the output. After each input is weighted, summation of all weighted inputs is calculated before being input into non-linear activation functions to predict the output. A feedforward neural network is the simplest neural network with direct connection between input and output layers, while deep neural network contains multiple layers in between the input and output layers [12]. Performance of each machine learning models is shown in Figure S6.

Optimizations
SVM models: we optimized kernel functions (RBF function and polynomial functions), degree of kernel function (degree of 1 to 7), and C value (degree of 1 to 3 for soft SVM penalty). Tree/ensemble-based models (Randomforest, ExtraTree, Adaboost, GradientBoosting, XGBoost): we optimized number of trees, number of features considered at each split, maximum number of tree splitting, minimum number of samples for each split, and method of selecting samples for training each tree (bootstrap). Neural network-based models (feedforward, sequential neural network): we optimized types of loss function, types of activation function, types of optimizer, types of regularization and its best to minimize the model loss, learning rate, and early stop implementation.

Sequential Neural Network
The Sequential (or deep) neural network (SNN or DNN) was demonstrated to be the best machine learning model with the highest accuracy when fitted with our data. SNN is a type of a fully connected neural network with multiple hidden layers in between an input layer and an output layer. These hidden layers facilitate transformation of information from the input data and detecting relevant inputs by reducing dimensionality while allowing as many inputs as necessary. Detailed algorithm and mathematical representation of the model can be found at Denoyer et al (2014) [12].
Model architecture: Our model consists of a 12-neuron input layer, 12-neuron first hidden layer, 6neuron second hidden layer, 3-neuron third hidden layer, and an output layer. The 12-neuron input layer receives 7 continuous inputs after transformation and normalization (age, total cholesterol, triglyceride, HDL, systolic blood pressure, HbA1c, and glucose) and 5 categorical inputs after OneHotEncoder transformation (male, female and three Id3 SNP genotypes). In every neural network layer, a weight and bias matrix was applied on the incoming matrix (X for input layer, h1 for 1 st hidden layer, h2 for 2 nd hidden layer, and h3 for 3 rd hidden layer). The resulting matrices after applying weight and bias were passed through activation functions. Neuron-like non-linear elements are implemented by activation functions which require the sum of weighted inputs to exceed a threshold in order to "fire" or output a non-zero value. An activation function is applied to every layer with a rectified linear unit (ReLU) as the activation function of every layer except the output layer (shown in dark blue boxes in Supplementary Figure 6) and a sigmoid as an activation function of the output layer (shown in a green box in Figure S7). Once the input/output data propagate through all the layers to the last sigmoid activation function, the resulting vector is the predicted output vector. Using training set data, this predicted output was compared with the assigned output by calculating the mean square error loss function (E). Backpropagation was used to calculate the gradient of the loss function with respect to the weights of the neural network, thereby arriving at weights that minimize the difference between the model prediction and actual output data. The Adaptive Moment Estimator (Adam) optimizer was the best performing gradient method which we used to minimize model loss or optimize our model [10]. Figure S7: Sequential neural network algorithm. Input matrix was passed through 4 layers including an input layer and 3 hidden layers. In each layer, L2 regularized weight and bias matrices were applied and the resulting vector was passed through activation functions. ReLU activation functions were used for the input layer and first two hidden layers. A sigmoid activation function was used as a final activation function to output the prediction layer. Backward propagation was applied to calculate the mean square error (MSE) and the MSE was minimized by using the Adam optimizer.
Model fitting: The aim of model fitting is to optimize the predictive power of the model or minimize the model loss. After model optimization, we determine best loss function to be a mean square error loss function with L2 (ridge) regularization model complexity penalty parameter = 0.01 to optimally minimize loss, and Adam as the best optimizer.
Adam has an advantage over other optimizers by implementing an adaptive learning rate for each parameter. This learning rate ( ) was tuned by including bias corrected exponentially decaying average of the past gradients and the past squared gradients term (described in equation (2)- (4)) to the learning rate). The adaptive learning rate can be described as ∈ in equation (1). Therefore, the higher the average of the past gradient is the higher the adaptive learning rate. We also implemented an early stop to the iterative optimization procedure by monitoring maximum accuracy to enhance the performance of our model ( Figure S8). By plotting numbers of iteration against model loss between the training and testing set, we show that the model losses of the training and testing sets are comparable demonstrating that our model complexity is appropriate and we are not overfitting our training data ( Figure S8).