End-to-end Risk Prediction of Atrial Fibrillation from the 12-Lead ECG by Deep Neural Networks

Background: Atrial fibrillation (AF) is one of the most common cardiac arrhythmias that affects millions of people each year worldwide and it is closely linked to increased risk of cardiovascular diseases such as stroke and heart failure. Machine learning methods have shown promising results in evaluating the risk of developing atrial fibrillation from the electrocardiogram. We aim to develop and evaluate one such algorithm on a large CODE dataset collected in Brazil. Results: The deep neural network model identified patients without indication of AF in the presented ECG but who will develop AF in the future with an AUC score of 0.845. From our survival model, we obtain that patients in the high-risk group (i.e. with the probability of a future AF case being greater than 0.7) are 50% more likely to develop AF within 40 weeks, while patients belonging to the minimal-risk group (i.e. with the probability of a future AF case being less than or equal to 0.1) have more than 85% chance of remaining AF free up until after seven years. Conclusion: We developed and validated a model for AF risk prediction. If applied in clinical practice, the model possesses the potential of providing valuable and useful information in decision-making and patient management processes.


Introduction
Atrial fibrillation (AF) is progressively more common worldwide within an ageing population [1].It is associated with adverse outcomes such as cognitive impairment and can lead to more severe heart diseases if not treated early.Previous studies have found a close link between AF and increased risk of death [2] and heart-related complications, such as stroke and heart failure [3,4,5].Good assessment of patient risk can allow more frequent monitoring and facilitate early diagnosis.Early detection of the problem might allow to start anticoagulation treatment and help prevent death and disability.
The electrocardiogram (ECG) is a convenient, fast, and affordable option used at many hospitals, clinics, primary and specialised health centres to diagnose many types of cardiovascular diseases.Over the past 50 years, computer-assisted tools have complemented physician interpretation of ECGs.Notably, the realm of deep learning has emerged as a promising avenue to enhance automated ECG analysis, showcasing impressive strides in recent years [6,7,8].Prior studies have predominantly explored the use of deep neural networks (DNNs) to automatically detect AF and other cardiac arrhythmias from standard 12-lead ECGs [9,10,11].This advancement holds valuable implications for clinical decision support, offering auxiliary tools for diagnosing cardiac arrhythmias.However, while achieving consistent diagnoses in patients-even among those with established conditions-is an essential aspect, the parallel need remains for systems yielding timely and early warning for patients with prospective conditions to develop AF.
Combining the features obtained from DNNs with survival methods is a promising approach for accurate risk prediction.Recent studies explored this approach for the risk prediction of heart diseases [12] and mortality [13,14].The risk prediction of AF from the 12-lead ECG has been studied before with different approaches and varying degrees of success.Raghunath et al. [15] used DNNs for a dataset collected during 30 years to directly predict new-onset AF within one year and identified the patients at risk of AF-related stroke among those predicted to be at high risk of impending AF.The authors in [16] focused on predicting future AF incidents and the time to the event but used a DNN model trained on a different dataset, and the survival analysis spanned a longer period.From our group, Zvuloni et al. [17] performed end-to-end AF risk prediction from the 12-lead ECG but did not go further to implement survival modelling and estimate the time to the AF event.Further, Biton et al. [18] presents a model that used digital biomarkers in combination with deep representation learning to predict the risk of AF.Their model uses a random forest classifier including features from a pre-trained DNN where the weights are kept fixed from a different ECG classification task.
The aim of our work is to bridge the gap between these studies.While these previous studies focused either on directly predicting future AF cases within a given time frame or incorporated DNNs trained on disparate datasets for survival modelling, there exists no comprehensive approach that synergizes the capabilities of DNNs in AF diagnosis with the precision of survival analysis techniques for estimating time-to-event outcomes.Contrarily, our approach combines both of these aspects: firstly, by employing an end-to-end trained DNN to assess the risk of AF development, and secondly, by utilizing the DNN's output to construct a time-to-event model that forecasts the occurrence of AF from the date of ECG examination.We demonstrate the effectiveness of the method which offers accurate prognostic insights into AF occurrences.Further, we release implementation codes and trained weights to facilitate future studies.

The dataset
The model development and testing were conducted using the CODE (Clinical Outcomes in Digital Electrocardiology) dataset [19].The CODE dataset consists of 2,322,465 12-lead ECG records from 1,558,748 different patients.The ECG records were collected in 811 counties in the state of Minas Gerais, Brazil by a public telehealth system, Telehealth Network of Minas Gerais (TNMG) between 2010 and 2017.A detailed description of the recordings and the labelling process for each ECG exam of the CODE dataset can be found in [11].
Information about the patients was recorded together with their ECG tracings.The average age The ECG signals are between 7 and 10 seconds long and recorded at sampling frequencies ranging from 300 to 600 Hz.The ECG records were re-sampled at 400 Hz to generate between 2800 and 4000 temporal samples.All ECGs are zero-padded to obtain a uniform size of 4096 samples for each ECG lead, which are then used as input to the convolutional model.
The labels for AF in the CODE dataset were extracted from the text report produced by the expert who looked at the ECGs.To improve the quality of the annotations, some exams were reviewed by doctors, in this case, disagreement with the labels produced by the University of Glasgow automatic diagnosis software was used to select exams to be reviewed.The procedure is described in detail in [11].

Problem formulation
The study considered patients in the CODE database with at least two ECG exams or that have AF.Patients were classified into three groups (NoAF, BaselineAF, FutureAF) according to the presence or absence of a record with AF condition and whether the record with AF is the baseline or not.The ECG exams from the patients were classified into three different classes, focusing on patients who undertook multiple exams.The classification process, which is illustrated in Figure 1, is detailed as follows: • NoAF Class: all ECG exams from patients who recorded multiple exams without presenting an AF abnormality.We exclude the last exam for each patient or exams recorded within one week from the last exam.
• WithAF Class: combined all ECG exams that exhibit the AF condition.
• FutureAF Class: regrouped normal ECG exams from patients who had normal ECG exams at the beginning, but who were diagnosed with AF condition in a follow-up exam.The retained records were made before the patients were first diagnosed with AF condition.We exclude all subsequent normal exams after the first positive case, and exams made within one week before this case.
The one-week threshold was set so we don't have to deal with paroxysmal atrial fibrillation cases, which is a brief event of atrial fibrillation that usually stops in 24 hours and may last up to a week.We are interested in using predictions of the FutureAF class for predicting the long-term risk of AF, hence we consider that exams should be distanced by at least one week to be considered as a follow-up exam.Hence, ECG exams recorded within one week before the first exam with the AF condition were not added FutureAF.Similarly, exams for which we do no follow the patient for longer than one week were not added to NoAF.We used the remaining exams for developing and testing the model.In the final dataset, 637,514 exams (92.17%) belong to the class NoAF ; 41,851 (6.05%) to class WithAF ; and, 12,280 (1.78%) to the class FutureAF.This final dataset was split uniformly at random and by patient into train set, validation set and test set.60% of the data were allocated for training, 10% for validation and 30% for testing.Splitting the data into train and validation sets as we have done is common for large datasets such as ours because cross-validation becomes computationally expensive [9,11,20].The train-test split happened so that ECG records belonging to one patient ended up in the same split.

DNN architecture and training
The DNN architecture in this study was based on a deep residual neural network implemented in previous studies [11,13].The neural network consists of a convolutional layer followed by five residual blocks and ends with a fully connected (dense) layer that passes its output to a softmax to obtain three class probabilities for the classes NoAF, WithAF and FutureAF which are defined to add up to one.While the focus is on predicting the class FutureAF from ECG exams with an absence of the AF condition, we kept the exams belonging to the class WithAF to improve the performance of the model.Hence, the developed model also has the capability of conducting automatic AF diagnosis.
The DNN model was trained by minimising the average cross-entropy loss using the Adam optimiser [21].Default parameters were used with weight decay of 5 • 10 −4 to regularise the model.As the results obtained in [11,13] were satisfactory, this study kept most of the selected hyperparameters from these studies.Hence, no further hyperparameter tuning was performed.The initial learning rate was 10 −3 and was reduced by a factor of 10 whenever the validation loss remained without improvement for 7 consecutive epochs.The dropout rate was manually tuned between values: 0.8 and 0.5 with the latter value resulting in improved performance.The training was performed until the minimum learning rate of 10 −7 was reached or for a maximum of 70 epochs.We save and use as the final the one with the best validation results (i.e.minimum error loss) during the optimisation process as a form of early stopping.
Despite the pronounced class imbalance, we abstain from employing strategies like over-or undersampling to mitigate it.Over-sampling risks overfitting the minority class, while under-sampling discards numerous majority samples.Since our emphasis lies not on threshold-dependent metrics like accuracy, but rather on utilising the resulting class probabilities for the survival model, the class imbalance becomes less influential.

Model evaluation and metrics
After the training process, the performance of the DNN model was evaluated on the test data using classification evaluation metrics: sensitivity, positive predictive value (PPV), specificity, false positive rate, F -score, the Receiver Operating Characteristic (ROC) curve, Area Under the Receiver Operating Characteristic Curve (AUC-ROC), Precision-Recall Curve and Average Precision (AP) score.This study first evaluated the performance of the model on the task of classifying the three groups: NoAF, WithAF and FutureAF, based on the class probabilities from the DNN model.We plotted the ROC curves, the precision-recall curves and the confusion matrix, and computed the AUC score and AP scores for each class.Next, an evaluation of the model considering only the FutureAF class and the NoAF class was performed to assess the ability of the model to distinguish normal exams within the two classes.In other words, to evaluate how the model performs at AF risk prediction for patients without AF.For this task, samples labelled as WithAF class were removed.The class probabilities for the NoAF class and for the FutureAF class were normalised for each instance to sum to one.Lastly, a probability threshold that maximises the F 1 -score for NoAF class and FutureAF class was selected, and the threshold-based metrics, namely sensitivity, PPV, specificity and F 1 -score were computed.The threshold was obtained using the validation set, while all metrics including the plots were measured using the test set.

Time-to-event models
This study considers non-parametric and semi-parametric methods for time-to-event prediction.Patients in the test set belonging to the class NoAF (191,665 recordings, 116,255 unique patients) and the class FutureAF (3691 recordings, 2016 unique patients) were considered for the time-to-event prediction.We used Kaplan-Meier method [22] and Cox proportional hazard (PH) models [23].
The Kaplan-Meier method [22] (also referred to as the product-limit method) is a non-parametric method that provides an empirical estimate of the survival probability at a specific survival time using the actual sequence of the event times.Similar to other non-parametric methods, the advantage of the Kaplan-Meier is that it allows for the analysis without assumptions.On the other hand, the Cox PH model [23] allows us to adjust to different covariates and hence are also interesting to the analysis.Cox PH models are the most commonly used semi-parametric model for survival analysis.The model assumes that the covariates have an exponential influence on the hazard.The log-hazard of an individual is a sum of the population-level baseline hazard and a linear function of the corresponding covariates.
We provide two analyses for the Cox PH model, in one analysis we adjust the model with age and gender, and in a second analysis we adjust the model with comorbidities in addition to age and gender.We consider 16 variables that were recorded during a patient visit, that include comorbidities, cardiovascular risk factors and cardiovascular drug usage, namely: use of diuretics, beta-blockers, converting enzyme inhibitors, amiodarone, or calcium blockers, obesity, diabetes mellitus, smoking, previous myocardial revascularization, family history of coronary heart disease, previous myocardial infarction, dyslipidemia, chronic kidney disease, chronic lung disease, chagas disease, arterial hypertension.The observation time T is given in weeks.During the development of the Cox PH model, patients were subdivided into four groups according to quintiles of the probability output of the DNN: [0, 0.1); [0.1, 0.4), [0.4,0.7) and [0.7, 1.0].The study used the first group of patients having a predicted probability of less than 0.1 as a reference and produced hazard ratios for the remaining groups.For the Kaplan-Meier model, patients were grouped according to the same intervals: [0, 0.1); [0.1, 0.4), [0.4,0.7) and [0.7, 1.0].We used the lifelines python library [24].

Results
We developed a model to predict whether a patient belongs to the classes NoAF, WithAF or FutureAF.Our results for the classification task are available in the supplementary material.Since our ultimate goal is to predict the risk of a future AF event, we present here the ability of the model to predict the class FutureAF and the results from survival analysis.

AF risk prediction and survival analysis
The DNN model outputs class probabilities for the three classes.In a first analysis, we excluded exams from the class WithAF in order to study the ability of the model to distinguish between FutureAF and NoAF.We compute the performance metrics using the probability of FutureAF against that of NoAF.In Table 1 we display the confusion matrix, where the predicted values are compared against the true values.In Figure 2 we show the ROC curve and the AUC-ROC score obtained for this case.The AUC-ROC score was equal to 0.845.This reveals that the model can detect elements in each class.Figure 3 displays the PR curves and the calculated average precision (AP) scores.The AP score for the class FutureAF was quite small (AP = 0.22) and its PR curve had a low area under the curve.This suggests that the model is unable to provide both, high sensitivity and PPV values at once for exams in the class FutureAF.An option for applying the model on the prediction task between two classes is to select a threshold that maximises the F 1 -score, i.e. putting equal weights on both sensitivity and PPV.The threshold was computed using the validation set and was applied to the classification task for both the validation set and the test set.The obtained optimal probability threshold was equal to 0.1043 and the corresponding performance metrics are shown in Table 2.All the metrics consider the class FutureAF as the positive class.The sensitivity and PPV values on the test set are 0.322 and 0.247, respectively.In contrast, the specificity is very high (0.981), which is mainly due to class imbalance.
The class probabilities from the DNN model belonging to the class FutureAF were used to develop survival models.Two Cox PH models were implemented, one adjusted with age and gender, and another adjusted with comorbidities in addition to age and gender.Table 3 shows the hazard ratios of patients whose probabilities for the class FutureAF belong to one of the groups: (0.1-0.4], (0.4-0.7] and (0.7-1.0], taking patients in the group (0.0-0.1] as a reference.As the table indicates, moving from a lower probability range to a higher probability range, the hazards leading to AF also increase.Considering the Cox PH model adjusted with age and gender plus comorbidities, the probability range of (0.7-1.0] had the highest hazard ratio that equals 40.869 (95% CI: 32.83 − 50.87;P < 0.005).During  the model assessment, however, some covariates (the three probability ranges in this case) did not pass the non-proportional test, hence rejecting the null hypothesis of proportional hazards.This led the study to use a non-parametric model in order to make further survival analyses.A Kaplan-Meier approach was used to this end.The survival curves that were generated through the Kaplan-Meier estimator are displayed in Figure 4.Note that survival time refers in the context of our study to the time-to-event which is the development of AF and not to actual mortality-related survival.Therefore, survival probability refers to the likelihood that no event occurs.The shaded area highlights the 95% confidence interval of the survival probability at different survival times (exponential Greenwood confidence intervals were used [25]).Patients within the lowest risk group maintained survival probabilities greater than 0.8 during the study period of about seven years.The survival probability is reduced at a higher rate moving from patients in a lower probability range to patients in a higher probability range.The median survival times for patients in probability groups (0.0 − 0.1], (0.1 − 0.4], (0.4 − 0.7] and (0.7 − 1.0] are infinity, 248, 82 and 40 weeks respectively.The median time without developing AF defines the point in time where on average 50% of the patients in a group would have had the condition.That means for example, patients in the first cohort (probability range (0.0 − 0.1]) have a 50% chance of never developing AF within seven years, while patients in the last cohort (probability range (0.7 − 1.0]) are 50% likely to develop AF within 40 weeks (less than a year).
A table below the survival curve in Figure 4 shows the number of patients at risk, censored   patients (i.e.no further follow-up or the event time is beyond the study period) and patients with AF at different time intervals (50 weeks each time interval).Taking the event times 0 and 50 weeks as an example, for patients within the probability range (0 − 0.1], the number of patients at risk was 129,369 (68%), censored cases were 60,091 and 794 (0.42%) AF events were recorded after 50 weeks; while for patients within the probability range (0.7 -1.0] the number of patients at risk was 61 (33.7%), censored cases were 26 and 94 (51.9%)AF events were recorded.This again provides an estimate of the time to event for patients in different risk groups.

DNN model performance
The DNN model produced a good AUC score for the class FutureAF, which suggests its potential at predicting this class.The actual ability to predict the class FutureAF was attested by the AP score obtained for this class (AP = 0.22).The low score reveals the difficulty in predicting this class and suggests that there would be many false positive cases (incorrectly predicting the class FutureAF) regardless of the threshold.
Regarding the risk prediction task (normal ECG exams in FutureAF vs NoAF), the DNN model produced lower sensitivity and PPV values as shown in Table 2 (the probability threshold here maximises F 1 -score).However, the specificity was as high as 0.982.This indicates that most of the exams that could be predicted as negative are truly negative and that there would be very few false positive cases.Hence, the information from this prediction task can be of value during a screening of a large population, i.e. one can consider that among the individuals predicted as negative, approximately 1.8% are at risk of developing AF.

Survival analysis
The survival analysis implemented in this study provided additional and valuable information about the risk level and an estimate of the time to the event of having an AF condition.The Cox PH model produced the hazard ratios for patients belonging to four different probability groups taking the group with the lowest risk as a reference.The Cox PH model failed the non-proportional test; still, it provides insight into the risk level incurred by patients in different groups.As stated in [24], a model that does not meet the proportional hazards assumption still can be useful in performing prediction (e.g.predicting survival times) as opposed to making inferences.Recent work also suggests that virtually all real-world clinical datasets will violate the proportional hazards assumptions if sufficiently powered and that statistical tests for the proportional hazards assumption may be unnecessary [26].
To understand the influence of a class probability group on the survival duration, a Kaplan-Meier model was implemented.The results showed that patients in the highest risk group (FutureAF class probability range of (0.7−1.0]) were approximately 60% likely to develop AF within one year, compared with less than 15% of patients in the minimal risk group (FutureAF class probability range of (0.0−0.1]) that would develop the condition within the complete time span of seven years.These findings proved the ability of the DNN model at predicting patients with impending AF conditions and with different risk levels.Compared to the results of the study in [18], which used digital biomarkers from the raw 12-lead ECG, clinical information and features from deep representation learning to make AF risk prediction, our approach learns predicting features directly from the raw ECG signal without the need to extract any biomarker.Thus precluding the need to extract biomarkers from the ECG signal which facilitates the ECG processing pipeline.It is also worth mentioning that the median survival time obtained in [18] is more than two years for patients in probability group (0.8 − 1.0].Even though the methods used to produce survival curves are different (Cox PH model versus Kaplan-Meier) and also the classifier used (Random Forrest versus Neural Network with Softmax), their results seem less alarming considering the results in this work, where 50% of patients in the probability group (0.7−1.0] are likely to develop AF within 40 weeks (less than one year).This difference in median survival times may also be attributed to the fact that the study in [18] used a random forest classifier while this study uses neural networks and a sigmoid function for classification.

Clinical implications
Patients with clinical AF that are not taking anticoagulant medication have an elevated risk of stroke, and the strokes caused by AF are more severe than strokes caused by other causes [27].AF does not always cause symptoms, and for roughly 20% of the population, stroke is the first manifestation of AF [28].Thus, there is a lot of interest in detecting cases of AF before the occurrence of a stroke, by systematic screening for asymptomatic AF [29] or, more recently, by the recognition of those in sinus rhythm who will develop AF in the future [9,17,18,30,31].Among the risk scores that use clinical variables, the CHARGE-AF risk score is one of the most accurate and well-validated and uses variables readily available in primary care settings [30].A recent review of risk scores based on clinical variables for prediction of AF [31] found that 14 different scores are potentially useful, with AUC-ROC curves between 0,65 to 0,77 for the general population, with best results for the CHARGE-AF and MHS scores.Risk scores based on standard 12-lead ECGs are a promising tool considering both practical and technical questions [9,17,18].Reported studies, including ours, showed much higher discrimination capacity, with AUC-ROC curves over 0.85.Since ECGs are routinely performed in most subjects at risk, ie, those older than 60 years old, the prediction can be obtained automatically, without the need of inputting variables in a risk calculator.In this study, we also provide semi-parametric and non-parametric time-to-event models that might help inform doctors of the development of the disease for each group of patients.The model was tested in cases where the disease could be observed up to seven years of the examination, providing a more complete picture for the use of this model in clinical practice The ability to accurately recognise patients that have a high chance of developing AF may allow the intensified surveillance of those patients, with early recognition of the appearance of the AF.In this case, the early institution of anticoagulant treatment could prevent the drastic event of a stroke and change the natural history of this condition.Moreover, new therapies to prevent AF could be developed and used for preventing not the stroke but potentially the whole set of complications related to the appearance of AF.All these clinical applications of the method deserve to be tested in controlled clinical trials, but preliminary prospective studies confirmed that AI-augmented ECG analysis could be helpful, at least, to recognise those at higher risk of developing AF [32].

Limitations
One limitation lies in the dataset used for model development and testing.Many of the patients that were considered as all-time normal (without AF during the whole data collection period) had dropped from the follow-up before the study period ended or had a relatively shorter time interval between their first and last ECG records.Therefore, it is impossible to tell with certainty whether an individual was at no risk of developing AF within seven years.Censored data are unexceptional in survival analysis, however, in normal supervised learning, an ideal dataset would consist of patients who had recorded ECG exams regularly for the considered study period.Moreover, we do not prove this is better than existing clinical scores such as CHARGE-AF [30].
Similar to a statement in [18], during data selection, there was a bias towards individuals who had a cardiac disease or a forthcoming heart condition, since all the patients considered had attended multiple medical visits.The AF label is also solely based on the ECG analysis.This label might contain errors from medical mistakes and from problems in the extraction of the label (see [11] for a more complete discussion of the labeling process).This way, some FutureAF exams might be previously missed AF cases during the ECG analysis.Finally, the model is developed and tested solely on patients from Brazil, and external validation in other cohorts is needed to verify the efficiency of the model in other populations.

Conclusion
This study employed ResNet-based convolutional DNNs for end-to-end AF risk prediction from 12-lead ECG signals.The trained DNN effectively identified ECG signal changes indicative of AF development, facilitating risk prediction and survival analysis.By integrating DNN probabilities into Cox PH and

Figure 1 :
Figure 1: Diagram of patients groups and exams categories.

Figure 3 :
Figure 3: The precision-recall curves and AP scores for FutureAF class versus NoAF class.Recall denotes the sensitivity, and precision denotes the positive predictive value.
-50.87 < 0.005 * We adjust for the following comorbidities, cardiovascular risk factors, and drug usage: use of diuretics, beta-blockers, converting enzyme inhibitors, amiodarone, or calcium blockers, obesity, diabetes mellitus, smoking, previous myocardial revascularization, family history of coronary heart disease, previous myocardial infarction, dyslipidemia, chronic kidney disease, chronic lung disease, chagas disease, arterial hypertension.

Figure 4 :
Figure 4: Survival curves for the different cohorts based on their probability range using the Kaplan-Meier model.

Table 2 :
Performance metrics on the task of predicting the class FutureAF versus NoAF.

Table 3 :
Hazard ratios for different probability groups from the Cox PH model.