Machine Learning Improves Risk Stratification After Acute Coronary Syndrome

The accurate assessment of a patient’s risk of adverse events remains a mainstay of clinical care. Commonly used risk metrics have been based on logistic regression models that incorporate aspects of the medical history, presenting signs and symptoms, and lab values. More sophisticated methods, such as Artificial Neural Networks (ANN), form an attractive platform to build risk metrics because they can easily incorporate disparate pieces of data, yielding classifiers with improved performance. Using two cohorts consisting of patients admitted with a non-ST-segment elevation acute coronary syndrome, we constructed an ANN that identifies patients at high risk of cardiovascular death (CVD). The ANN was trained and tested using patient subsets derived from a cohort containing 4395 patients (Area Under the Curve (AUC) 0.743) and validated on an independent holdout set containing 861 patients (AUC 0.767). The ANN 1-year Hazard Ratio for CVD was 3.72 (95% confidence interval 1.04–14.3) after adjusting for the TIMI Risk Score, left ventricular ejection fraction, and B-type natriuretic peptide. A unique feature of our approach is that it captures small changes in the ST segment over time that cannot be detected by visual inspection. These findings highlight the important role that ANNs can play in risk stratification.

partitioned into beats, and each beat is segmented into its constituent waves. The ST-segment of each beat is taken to be all sample points falling between the S-wave and T-wave labels generated by the segmentation algorithm; beats missing either an S-wave or T-wave label or both were rejected as being too noisy to use. The length of the ST-segment can vary widely between beats depending on the variation of the heart rate and condition of the patient over time; segment lengths were found to range between fewer than 16 to more than 32. Segments with fewer than 16 and more than 32 samples were rejected as either being too noisy to use or improperly labeled by the segmentation algorithm. For segments with lengths between 16 and 32 samples, the midpoint of the segment was found and eight samples were taken on either side of it. Doing so helped to ensure that the samples used were part of the true ST-segment, even in the presence of small errors in the placement of the S-wave and T-wave labels. The set of 16 samples is stored in a vector called x, which is used to generate the Legendre polynomial representation of the segment as follows: Here, s j is the Legendre polynomial expansion for beat j, x j is the sample vector for beat j, Φ L is as defined above, and σ denotes the standard deviation; T indicates the matrix transpose operation.

Background on Neural Networks
Neural networks consist of a set of interconnected processing units that are loosely inspired by the supposed organisation of the human brain 1 . An example of a processing unit, which is referred to as an artificial neuron in analogy with biological neurons, is shown in Supplementary Fig. S1 2 . The neuron receives inputs from an arbitrary number of other neurons, and each input is multiplied by a weight, which represents the synapse between two neurons. The weighted inputs are then summed in a process called integration, and the resulting sum is given as input to a nonlinear function called an activation function, the output of which may then be sent to other neurons or used in other computations. The choice of activation function depends on the purpose for which the neuron is being used. Common choices include the hyperbolic tangent and sigmoid functions, since these functions are bounded and therefore help prevent computations from becoming unbounded in large networks. A feedforward neural network (FNN), which is shown in Supplementary Fig. S2, is constructed by arranging a number of these neurons into cascaded layers; the name "feedforward" indicates that information flows from the input of the network to the output, without any of the intermediate computations being stored for future use. Each layer of the network consists of an arbitrary number of neurons that receive inputs from the neurons in the preceding layer, if there is one. Each neuron may receive inputs from all or a subset of the neurons in the preceding layer; the term "fully-connected" is applied to layers in which the neurons receive inputs from all of the neurons in the preceding layer. The first layer of the network is called the input layer, and contains as many neurons as there are input variables. The final layer of the network is called the output layer, and contains as many neurons as there are output variables. All intermediate layers, if there are any, are called hidden layers and may contain any number of neurons. Increasing the number of hidden layers generally increases the ability of the network to identify complex patterns in the data, but also increases the risk of overfitting.
Recurrent neural networks (RNNs) are similar to FNNs in that they consist of cascaded layers of interconnected neurons, but unlike FNNs, RNNs store intermediate results for use in future computations; the term "recurrent" is used to denote this memory-like capability of RNNs. An example of a RNN is shown in Fig. 1B in the main text. The essential difference between the FNN in Fig. S2 and the RNN in Fig. 1B is that the output of the hidden layer in the RNN in one step of the computation is used as input to the hidden layer in the next step of the computation. Reuse of previous results allows the network to identify long-term dependencies in data that follow a sequence, making RNNs useful for the analysis of text, as well as time series data.
The training of a neural network is done in a supervised fashion, wherein the inputs are applied to the network and the resultant outputs are compared with the expected outputs; the training process seeks to optimise the choice of the weights of the interconnections between neurons so as to minimise the error between the expected and computed output. The error is measured by an objective function that is chosen according to the use of the network, and is sent back through the network in order to adjust the weights; this process is called "backpropagation." The adjustment of the weights is typically done using a method based on gradient descent, in which the gradients of the objective function with respect to the weights are computed using the errors calculated above and minimised. While the weights are updated after each training example is seen, the gradients may be updated after the network sees all examples, after it sees each example, or after seeing batches of examples; the first approach is called batch gradient descent, the second is called stochastic gradient descent, and the third is called mini-batch gradient descent. A single round of training in which all examples in the training set are shown to the network is referred to as an "epoch." The order of the training examples is usually shuffled at the beginning of each epoch, and the network is often trained using many epochs.

RNN Classifier Architecture
The time series nature of the Legendre polynomial coefficient data necessitates the use of a classifier capable of processing sequences of data; as discussed above, a RNN is well suited to this problem. The RNN employed consists an input layer of two units, a hidden layer of 13 units, and an output layer of one unit. Although 16 coefficients were generated from each ST-segment, only the first two were used in the final RNN classifier, since higher-order coefficients were found to capture too much noise in the data to offer accurate predictions. The number of hidden units was chosen so as to ensure that the number of free parameters, represented by the weights, of the model was limited to no more than 10% of the number of patients in the training set; with two input units, 13 hidden units, and one output unit, there are 208 weights, which is less than 10% of the patients in the entire Cohort-1 dataset. The classification problem to be solved by the network is binary, so only one output unit is needed. Training of the network was done using 100 epochs with a batch size of 32 patients, as this configuration was found to produce the maximal AUC. The objective function optimised was the binary cross-entropy loss function, which is commonly used for binary classification problems.

Supplementary Figures
Supplementary Figure S1: Individual processing unit or neuron. Inputs are multiplied by corresponding weights, summed, and passed to a nonlinear activation function. The i's correspond to inputs, the w's to weights, and the o's to outputs 44 . The high-risk group consists of those patients with a raw LRHX+ST prediction value for the Cohort-2 holdout set above the value representing the upper-quartile on the Cohort-1 training set. The number of patients remaining at each labeled time point in each risk group is shown below the plot. Supplementary Table S1: Univariate hazard ratios for the LRHx, LRST, and RNN models from Cohort-1. LR is logistic regression; RNN is recurrent neural network; CI is confidence interval. Hazard Ratios and CIs represent averages over 1000 trials (each trial yields one HR and one 95% CI).