Study On ECG Data Dependency For Atrial Fibrillation Detection Based On Residual Networks

Atrial �brillation (AF) is an arrhythmia that can cause blood clot and may lead to stroke and heart failure. To detect AF, deep learning-based detection algorithms have recently been developed. However, deep learning models were often trained with limited datasets and were evaluated within the same datasets, which makes their performance generally drops on the external datasets, known as data dependency. For this study, three different databases from PhysioNet were used to investigate the data dependency of deep learning-based AF detection algorithm using the residual neural network (Resnet). Resnet 18, 34, 50 and 152 model were trained with raw electrocardiogram (ECG) signal extracted from independent database. The highest accuracy was about 98–99% which is evaluation results of test dataset from the own database. On the other hand, the lowest accuracy was about 53–92% which was evaluation results of the external dataset extracted from different source. There are data dependency according to the train dataset and the test dataset. However, the data dependency decreased as a large amount of train data.


Introduction
Atrial brillation (AF) is the most common cardiac arrhythmia which is irregular or rapid heartbeat.The number of AF patients is expected to increase by 12.1 million 1 and the related cost of AF is estimated at USD 6-26 billion per year 2 in US.Furthermore, AF can not only form thrombosis, which is the can cause stroke, but also affect heart failure and other heart disease 3 .AF rises the risk of stroke ve times 4 and the risk of death twice 5 , compared to healthy a person.Therefore, considering the social cost of healthcare and the quality of life, early and accurate detection of AF is important and bene cial.In the clinical environment, the detection of AF is manually done with visual inspection of the electrocardiogram (ECG) recordings.Cardiologists inspect the ECG recordings collected about 24 hours by ambulatory ECG device (Holter monitor).However, manually inspecting large amounts of ECG recordings can be tedious and time-consuming 6,7 .Also, time and frequency components of ECG are very subtle for accurate and consistent manual inspection 7 .A study showed that the manual inspections of many primary care practitioners are insu cient for accurate detection of AF 8 .This implies that there are limitations in detecting hidden patterns of AF and extensive training of clinician is necessary to nd AF effectively.
Recently, with the emerging research on arti cial intelligence (AI), automatic AF detection algorithms have been developed to resolve above problems.Reported AI based AF detection algorithms generally utilizes machine learning or deep learning techniques.Machine learning based AF detection algorithms employ features, which are measured or calculated by original ECG signal [9][10][11][12][13][14][15][16][17] .This feature extraction step is important for the machine learning based AF detection algorithms.However, it is generally the most timeconsuming process in developing those algorithms.Recent year, deep learning-based AF detection algorithms have been developed.Deep learning is an AI algorithm that automatically train the computational model to solve complex problems.Model learns a representation of the data through training the multiple processing layers.Afterward, this trained model can be used to predict events on new data with performance beyond human-level.Due to these advantages, deep learning techniques is widely used nowadays in various healthcare applications such as medical imaging, drug discovery, and genomics.However, with the high performance of the most deep learning-based algorithm developed in healthcare eld, they suffer from data dependency, which means that the developed algorithm generally works well within the database used for the development but the performance generally drops when the algorithm was used in other database.Unlike general applications of deep learning, healthcare data is highly heterogeneous, ambiguous, noisy, and incomplete.Furthermore, healthcare data collected from different medical institutions, hospitals, or devices is uneven and no uniform which can lead to worthless analysis 24 .To avoid adverse effect on patient, thorough validation is necessary before applying deep learning-based algorithm to healthcare data.The validation using external data collected from various devices or institutions is important to evaluate the generalization performance of deep learning-based algorithm.However, deep learning-based algorithm is generally validated by the internal database used for the development.For example, in the medical imaging application included radiology, ophthalmology, and pathology diagnostic analysis, most deep learning-based algorithms did not employ the validation using external database 25 .
Deep learning model build with AF data collected from the different setting, such as sampling frequency, resolution, and acquisition environment, may suffer from data dependency.There are several open databases for studying heart related research.Many previously reported papers for making AF detection algorithm utilized these databases.However, most research does not consider the data dependency, which can be problem when the algorithms are used in real environment.In this study, to quantity this data dependency, we experimentally investigated the data dependency of deep learning model of AF classi cation build with those open databases.

Results
The experiments were executed on python package with Keras, along with the computer environments of Intel(R) Core(TM) i7-6900k CPU 3.20GHz, NVIDIA Geforce GTX 1080 Ti of GPU and Windows 10 operating system.
We evaluated different Resnet models (Resnet 18, 34, 50 and 152) as shown in Table 3.In each trained model, the highest accuracy is about 98 ~ 99% on the test dataset of internal database.On the other hand, the lowest accuracy is about 53 ~ 92% on the test dataset of external database.Initially, the differences between highest and lowest accuracy in Resnet 18 model are 17.26% on the training model of LTAFDB, 18.87% on the AFDB and 44.59% on the MITDB, respectively.Secondly, those of Resnet 34 model are 17.97% on the LTAFDB, 21.24% on the AFDB and 44.10% on the MITDB, respectively.Those of Resnet 50 model are 16.90% on the LTAFDB, 20.44% on the AFDB and 45.89% on the MITDB.Those of Resnet 152 model are 16.42% on the LTAFDB, 19.34% on the AFDB and 42.92% on the MITDB.There is no signi cant performance difference according to the number of layer, so that following experiments were executed with Resnet 50 model which has medium depth in the models used in the experiment.In contrast, the model accuracy was decreased on evaluating the external dataset extracted from different source.Resnet generally shows a good performance without gradient exploding or gradient vanishing even if model is much deeper network.However, the data dependency occurs regardless of the depth in Resnet architecture.Therefore, the deeper network cannot resolve the data dependency.On the unseen data, when the true positive rate increases, the false positive rate also tends to occur higher.Also, the true negative rate and false negative rate also show the same trend.Unlike the evaluation results of own database, if model show a high sensitivity for external data, speci city oppositely is low.Similarly, the high speci city for external data lead to low sensitivity in trained models.
These results imply that the trained model may biasedly predicts the external data to be positive or negative.
The MITDB has the imbalanced and smallest amount of data among the databases used in this study.
The trained models of those show a largest data dependency in the experiment results.On the contrary, the LTAFDB has the largest amount of data among them.In the trained model with LTAFDB, the data dependency is lower than other trained models.Also, these results imply that training the deep learning model using the large amount and balanced data can decrease the data dependency on the AF detection algorithm.However, the acquisition and use of large amount AF data may be di cult because of the patient privacy and legal issues for healthcare data.
When evaluate the external data extracted from NSRDB, all trained Resnet 50 models with the LTAFDB, AFDB, and MITDB showed speci city more than 95% and false positive rate about 2-4%.The speci cities of the trained models tested with NSRDB was higher than that of the trained models with other databases except the database used for building the model.These results implies that normal sinus rhythm of healthy patients less suffers from data dependency.
The data dependency in building AI models can be caused by several aspects.Data imbalance is one of the most common cause.If a database has AF events far lesser than normal rhythms, this is common in medical databases in general, the performance can be biased and the performance can be drop when tested with external database.Another problem is noise in ECG signals by motion artifact or other reasons.Physical movement of patient when measuring ECG can cause wandering of baseline of ECG or unwanted noises.These noises can be minimized from digital ltering or other signal processing.However, during these processes, distortion or losing characteristic waveform of the original ECG can occur and the processed signals can have different characteristics according to the method of the preprocessing.The performance of AI models can lower due to the difference in preprocessing method of the database.The other problem is discrepancy in measuring hardware.There are several companies making devices for measuring ECG.These devices have different in hardware settings, such as ampli er con guration, lters, and gain, and software settings, such as sampling frequency and resolution.The waveforms from approved ECG devices do not differ largely, and resampling or normalization technique can reduce these problems but not perfectly resolve.
It can be concluded that it is necessary to validate the deep learning based AF detection algorithm using the various external databases at the developing step to avoid the data dependency.
Limitation There are some limitations in this study.Initially, this study was implemented with only one deep learning architecture, Resnet.Evaluation using other deep learning architectures will be helpful in investigating the data dependency.Secondly, "Non-AF" classes in the MITDB and LTAFDB are composed of normal sinus rhythm but "Non-AF" class in the AFDB is composed of all other rhythm because of the absence of normal sinus rhythm annotation.This limitation may lead to low speci city when evaluate the AFDB.However, the performance of sensitivity could be effectively re ected.Thirdly, the data used in this study is from three open-source databases, LTAFDB, AFDB, and MITDB.The using more databases collected from various location, device, and setting will be helpful to effective research results.

Methods
We train the deep learning model using three different AF databases and evaluate the data dependency using not used for training.The training method is described in Figure .3

Open database
The three open databases on Physionet, Long-Term Atrial Fibrillation database (LTAFDB) 26 , MIT-BIH Atrial Fibrillation database (AFDB) 27 , MIT-BIH Arrhythmia database (MITDB), are used 28 .The LTAFDB consist of 84 subjects with paroxysmal or sustained atrial brillation, which is twochannel ECG signal digitized at 128 Hz with 12-bit resolution over 20 mV range for about 24 to 25h 26 .The annotated diseases are normal sinus rhythm (N), supraventricular tachyarrhythmia (SVTA), ventricular tachycardia (VT), atrial brillation (AF), ventricular bigeminy (B), ventricular trigeminy (T), idioventricular rhythm (IVR), and atrial bigeminy (AB), sinus bradycardia (SBR).The AFDB is composed of 25 subjects with atrial brillation (mostly paroxysmal), which is two-channel ECG signals each sampled at 250 samples per second with 12-bit resolution over a range ± 10 mV for 10 hours.The rhythm annotation les were prepared manually.The rhythm annotations of types are atrial brillation (AF), atrial utter (AFL), atrioventricular junctional rhythm (J), and other rhythms (N) 27 .In this study, two ECG recordings of AFDB (records 00735 and records 03665) were excluded because they are unavailable.The MITDB with 48 halfhour two-channel ECG recordings are included in 47 subjects.The 23 recordings were collected from a mixed population of inpatient (about 60%) and outpatient (about 40%).The remaining 25 recordings were collected from the same set to include less common but clinically signi cant arrhythmias.The recordings were digitized at 360 samples per second per channel with 11-bit resolution over a 10 mV range 28 .In the LTAFDB and MITDB, we used AF rhythm as "AF" class and normal sinus rhythm as "Non-AF".1][32] .Resnet has a good performance without gradient vanishing because of the shortcut of the previous layer X to layer ahead F(X) as shown in Figure.Stochastic Gradient Descent optimizer.The initial learning rate was set to 0.001.A momentum was set to 0.9 and a weight decay set to 0.0001 based on 29 .Mini-batch was size of 32 33 .The learning rate was divided by 10 when error plateaus, and the weight of networks was initialized as in 34  Performance evaluation The confusion matrix visualizes the summary of classi cation results and reports the number of true positive, true negative, false positive and false negative.The confusion matrix is used to visualize the performance of trained model datasets composed of independent databases.The Receiver Operating Characteristic curve (ROC curve) illustrate the Sensitivity against the false positive rate for various decision thresholds.The Area Under Curve (AUC) is a populate evaluation metric which measures the area under entire ROC curve.We used the ROC curve and AUC to present the data dependency according to dataset.Finally, statistics is used to report the data dependency according to different Resnet models (Resnet 18, 34, 50 and 152).They are accuracy de ned as the proportion of correctly classi ed segments among the total number of segments, sensitivity de ned as the proportion of true positive among the total number of positive segments, and speci city de ned as the proportion of true negative among the total number of negative segments.
4(a).If the dimension of X and F(X) are not match, the convolution layer (Conv) and Batch Normalization (BN) are used to match the spatial resolution as shown in Figure.4 (b).In this study, we employed different 1-D Resnet models to detect AF.The architecture of original Resnet model is converted 2-D to 1-D for training 1-D ECG signal.Additionally, we trained the 1-D Resnet 18, 34, 50, and 152 layers at each three database and compared with each other.For instance, Resnet 50 architecture as shown in Figure.4(c).We used the cross-entropy which is well known cost function on classi cation problem.Subsequently, the cost function was minimized by

Figures Figure 1
Figures

Figure 3 Overview
Figure 3

Table 3
20rformance results of different Resnet models However, the highest false negative rate 33.12% is resulted on the evaluated case by the external database (LTAFDB) and the highest false positive rate 21.26% is resulted on the evaluated case by the external database (MITDB).Thirdly, in the case of the trained model on MITDB, the highest true positive rate 98.27% and true negative rate 99.84% are resulted on the evaluated case by the internal database (MITDB).However, the highest false negative rate 76.86% and false positive rate 7.03 % by internal database (AFDB) and the lowest AUC score is 0.9190 on the evaluated case byexternal database (MITDB).Figure.2(c)shows the results of trained model on MITDB.The highest AUC score is 0.9999 on the evaluated case by the internal database (MITDB) and the lowest AUC score is 0.5296 on the evaluated case by the external database (AFDB).In order to estimate the data dependency only healthy subjects, we evaluated the trained models on MIT-BIH Normal Sinus Rhythm database (NSRDB)35composed to only normal sinus rhythm ("Non-AF" class) recorded from patients had no signi cant arrhythmias.It is good estimate of the false positive rate in healthy subjects20.The trained 50 models on LTAFDB, AFDB and MITDB were performed with the speci city of 97.16, 97.60, and 95.46 and false positive rate of 2.84, 2.40 and 4.55, respectively.It is listed in Table4.
The confusion matrices of Resnet 50 model are represented Figure.1.In the case of the trained model on LTAFDB, the evaluated case by internal database(LTAFDB) shows the highest true positive rate 98.76% and true negative rate 98.55%.However, the highest false negative rate 5.60% is reported on the evaluated case by the external database (AFDB) and the highest false positive rate 20.23% is reported on the evaluated case by the external database(MITDB).Secondly, in the case of the trained model on AFDB, the highest true positive rate 99.11% and true negative rate 99.43% are resulted on the evaluated case by the internal database (AFDB).areresultedon the evaluated case by the external database (AFDB).Next, the ROC curve of Resnet 50 model is shown as Figure.2.Initially, Figure.2(a)shows the ROC curves of trained model on LTAFDB.The highest AUC score is 0.9994 on the evaluated case by the internal database (LTAFDB) and the lowest AUC score is 0.9494 on the evaluated case by external database (MITDB).Figure.2(b)shows the results of trained model on AFDB.The highest AUC score is 0.9993 on the evaluated caseDiscussionWe experimentally investigate the data dependency of deep learning-based AF classi cation using Resnet and raw ECG signal.As indicated in Table3, the highest accuracy of each trained model was resulted on evaluating the test dataset extracted from own database within all Resnet model used in this study (Resnet 18, 34, 50 and 152).

Table 1 .
All the two-channel ECG signals in each database are used to training and test datasets.The detailed descriptions about the databases are shown on Table 1.Description of three different databases, LTAFDB, AFDB, and MITDB The number of ECG segments used for experiments is listed in Table 2. Since the sampling rate of the LTAFDB, the AFDB and the MITDB are different as 128 Hz, 250 Hz and 360 Hz respectively, the AFDB and MITDB are downsampled at 128 Hz.In the addition, Each ECG data are normalized by Z-score normalization.The normalized ECG data are divided into a duration of 10 seconds (1280 samples) for input size.The residual network (Resnet) model developed by He 29 is used for AF detection because this Resnet model has recently been used for a lot of studies on cardiac arrhythmia classi cation

Table 2 .
. The early stopping technique was implemented over 10 epochs to avoid over tting.We train and test different Resnet with 18, 34, 50, and 152 layers composed of training dataset 80% and test dataset 20% using LTAFDB, AFDB, and MITDB, independently.The validation consists of 20% of each train dataset.The number of data segments for AF classi cation