Introduction

Recent progresses in artificial intelligence and deep learning (DL) methods created a leap toward automatic decision-making in various domains including health and medicine. Sophisticated deep learning (DL) methods have been proposed for classifying biological signals1, including heart sound2,3 and electrocardiogram4. Electrocardiograph (ECG) is a recording of the electrical activity of the heart. An ECG signal shows a rhythmic behavior identified by a sequence of the patterns in a cyclic manner, where the regularity of the rhythm along with the shape of the patterns convey important information about the electrical activity of the heart. Paroxysmal atrial fibrillation (PxAF) is a type of irregularity in heart rhythm, called cardiac arrhythmia, characterized by intermittent episodes of rapid and irregular heartbeat, originating in heart atria. PxAF can cause symptoms such as palpitations, shortness of breath, dizziness, and chest discomfort that can lead to fatal conditions like cardiac stroke5. PxAF episodes often occur spontaneously and can last from a few seconds to several days before spontaneously converting back to normal sinus rhythm. Screening patients with PxAF is currently performed by physicians in their clinical practice, and the development of a reliable system for automated detection of PxAF is a need for any healthcare system for patient monitoring purposes.

Several methods have been proposed for detecting PxAF on ECG signal4,6,7,8, from which the DL-based ones are considered as the state-of-the-art of this topic6,9. Nevertheless, accurate detection of PxAF is still an open research question6,9.

We hypothesize two bottlenecks in reaching accurate PxAF diagnosis. Firstly, the class imbalance is commonly seen in most of the public ECG databases, where the size of the class with PxAF arrhythmia is by far smaller than the one with normal cases. Secondly, the backbone architectures used in the state-of-the-art studies may not be optimal as they were manually designed for image classification tasks.

One solution to tackle the first issue is to increase the group size of the minority class, i.e., the PxAF class10, by producing synthetic data from the real ones. Patients’ real data are being recorded electronically by healthcare providers and private industries. However, the recorded data is hardly accessible to scientists due to patient privacy concerns. Even when researchers are able to access high-quality data, they must ensure that the data is properly used and protected in a legal and ethical manner which is a time-consuming process11.

Generating synthetic medical data has been broadly explored for various sorts of medical data including physiological signals12. Synthetic ECG data has been reported as the case study in several reports (“Synthetic data generation for ECGs”). Recently, generative adversarial networks (GANs) have demonstrated impressive performance in medical data augmentation. However, the synthetic ECGs, generated by GAN, are mostly immature to be used as the training data due to morphological irrelevance, and thus, leveraging them in the training process can mislead the classifier. As we will see in the sequels, this important point is elaborately considered by the proposed method.

Neural architecture search (NAS), as an automated technique for designing artificial neural networks, has recently received attention from researchers and engineers. It provides a solid tool to achieve an optimized architecture for the problem of designing an optimal machine learning solution. Applicability of this technique has been explored in different domains such as biomedical engineering, in which classification of physiological signals is an important challenge13,14,15,16.

In this paper, we propose an original framework for detecting PxAF arrhythmia based on an enhanced combination of GAN and NAS. The framework is composed of three compartments: (1) data enrichment, (2) signal processing, and (3) machine learning compartments. The proposed framework introduces innovative ideas in the methodologies employed for this important research question. It proposes the use of a GAN architecture for data enrichment in a new manner, named certified-GAN, in conjunction with the original signal processing and machine learning methods. The performance of the framework is statistically evaluated both holistically and independently for each compartment. The accuracy of the framework in detecting PxAF was estimated to be 99%, exhibiting a considerable improvement in the state-of-the-art.

To the best of our knowledge, this paper is the first study proposing an automatic methodology for certified synthetic data generation and designing an accurate CNN architecture for PxAF detection. We name this combination of certified-GAN and NAS for PxAF detection as Deep-PxAF. The contributions of this paper are:

  • A novel data enrichment method is proposed that enables the generation of the certified synthetic PxAF samples based on the recommendations of an expert physician (“Certified synthetic data generation”).

  • A novel data pre-processing approach is proposed to improve the detection performance (“ECG signal processing”).

  • A cell-based neural architecture search method is employed to design a specialized CNN architecture for the PxAF detection task (“CNN architecture search”).

  • We provide extensive experiments to demonstrate the effectiveness of Deep-PxAF (“Experimental results”). Plus, we discuss the reproducibility results of the proposed method (“Discussion”).

Results show that Deep-PxAF achieves higher accuracy compared to handcrafted DL architectures and automated machine learning (AutoML) tools on the PhysioNet PxAF database17. Moreover, Deep-PxAF shows stable results with marginal differences with multiple repetitions, confirming the reproducibility of the results. The database of certified labels is open-access and can be used by any researcher for scientific purposes.

Preliminaries

Paroxysmal atrial fibrillation (PxAF)

ECG is a registration of the electrical activity of heart cells. A normal ECG is a cyclic signal composed of several waves and peaks within each cycle from which the QRS complex, T-wave, and P-wave are mostly regarded as indicative patterns of the signal. Figure 1a depicts a normal ECG signal along with the indicative patterns occurring in a certain order in time. The cyclic behavior of the ECG signal comes from the fact that heart muscles have two phases of activity: contraction and relaxation. A contraction is normally followed by a relaxation, where the contraction is initiated from the right atrium down to the ventricles and returned to its initiating point to create a self-stimulating activity through the heart muscles with a rhythmic behavior. This rhythmic action is projected to the ECG signal. The P-wave and the QRS complex coincide with the atrial and ventricular contraction, respectively, while the T-wave results from the ventricular relaxation. In the cardiac investigation, a complete relaxation followed by a left ventricle contraction is known as the cardiac cycle. However, for simplicity in ECG signal processing, a cardiac cycle can be defined as the interval between two successive R-peaks for computerized processing.

The morphology of an ECG signal conveys important information about the heart’s electrical activity and, to a lesser extent, about its mechanical activity. This includes not only the duration of the QRS complex and the time intervals between the waves and the complex but also the amplitude of the patterns. Deviation from the typical characteristics of ECG can be resulted either from a physiological condition such as sinus arrhythmia or from pathological conditions, e.g., arrhythmia. Sinus arrhythmia can be dominantly caused by respiration. PxAF is a pathological condition of the electrical heart action that can happen when the atrial contraction is performed inappropriately. PxAF can initiate an arrhythmia and requires medical considerations and sometimes appropriate management. Figure  1b shows a PxAF condition versus a normal sinus rhythm.

Figure 1
figure 1

(a) Illustration of a sinus rhythm condition. Heart rate variation within 60–100 beats per minute. (b) PxAF condition. Heart rate variability in the form of arrhythmia and P-wave alterations.

As can be seen in Fig. 1, cardiac cycles show a physiological variation of sinus rhythm with clearly visible P-waves in all the cycles. In contrast, in the PxAF case, the P-waves show noticeable alterations over the cycles along with the arrhythmia. An association between PxAF and mortality has been previously demonstrated18. It is also studied that timely detection of PxAF can improve survival in this patient group by appropriate medical management18.

Generative adversarial networks

Generative adversarial networks (GANs) are a class of deep learning architectures that have been successfully used to generate synthetic images, time-series data, and other data modalities19,20. In general, GANs are comprised of two sub-networks: the generator (G) and the discriminator (D). G generates synthetic data that is as close as possible to the real data, while D determines whether the generated data is real or not. These two sub-networks compete with each other in a two-player minimax game with a loss function of V(GD) (Eq. 1). The goal of solving Eq. (1) optimization problem is to reach Nash equilibrium21.

$$\begin{aligned} \mathop {min}_{G} \mathop {max}_{D} V (G, D) = E_{x\sim p_{data}(x)} \big [log D(x)\big ] + E_{z\sim p_{z}(z)} \big [log (1 - D(G(z)))\big ] \end{aligned}$$
(1)

Probability D(x) determines whether x is generated data or real data.

Related works

PxAF diagnosis using DL methods

Previous studies on DL-based methods showed less attention paid to PxAF detection than other forms of arrhythmia4. Pourbabaee et al.6 proposed a method for identifying patients with PxAF. Their proposed method employs raw ECG data as input; then, uses a CNN with one fully-connected layer to learn a discriminative pattern of data in the time domain. Plus, they manually tweaked various classification methods to achieve maximum performance. Shashikumar et al.8 proposed an attention-based DL method for detecting PxAF episodes from a synthetic database composed of 24-h Holter ECG recordings. Time-frequency representations of 30-s windows are fed sequentially into the CNN. Then, the extracted features are presented to a bidirectional recurrent neural network with an attention layer. Gilon et al.7 constructed a new long-term ECG database (24–96 h) for the purpose of detecting PxAF. After careful analysis by a cardiologist, 250 AF onsets of PxAF have been detected. They proposed a CNN followed by a bidirectional Gated Recurrent Units (GRU) network for PxAF detection. The network was trained to distinguish between RR intervals that precede an AF onset and RR intervals distant from any AF. They concluded that RR intervals contain information about the incoming AF episode. Tzou et al.22 proposed to predict the occurrence of PxAF by combining wavelet decomposition and a CNN classifier. Surucu et al.9 aimed to detect PxAF episodes before occurrence. Surucu et al.9 leveraged a CNN to process normalized heart rate variability features resulting in 87.76% accuracy and 87.50% f1-score in heart rate variability.

In this perspective, atrial fibrillation (AF), which is regarded as a case study with physiological similarities, has been reported in a large number of related studies. Zhang et al.23 developed a dual-domain attention cascade model called D2AFNet, which addresses the challenge of accurate AF detection. Marinucci et al.24 introduced the utilization of a fully-connected network that incorporates diverse input ECG features and tested on ECG recordings obtained through portable devices. Kamozawa et al.25 introduced a method for detecting AF from Holter-ECG recordings using a CNN. After eliminating artifacts and noises, the proposed approach first extracts abnormal waveforms using a one-dimensional CNN, then identifies AF using a two-dimensional CNN trained with segmented ECG spectrograms. Rui et al.26 and Yang et al.27 introduced the utilization of transformer models28 for the purpose of detecting AF. Their aim was to enhance the capturing of inter-heartbeat dependencies by leveraging transformers in the detection process. Compared to state-of-the-art AF detection methods, Deep-PxAF stands out as the pioneering study that utilized neural architecture search on a synthetic-verified database.

Synthetic data generation for ECGs

Medical data tend to be highly sensitive by nature and are often subject to severe usage restrictions. As a result, it is difficult for researchers to collect and share this data. A possible alternative to address the problem of data scarcity is to generate realistic synthetic data20. McSharry et al.29 and Sayadi et al.30 proposed mathematical dynamical models to generate continuous ECG signals. These models, however, were limited to one lead signal and did not provide any insight into the mechanism of disease.

Recent studies have demonstrated that GANs are extremely effective at synthesizing ECG waveforms based on a prior distribution of data. Prior works are mainly focused on efficient GAN architecture31,32,33,34,35,36,37. Delaney et al.31 studied various GAN architectures by leveraging LSTM or BiLSTM as the generator and a CNN discriminator with single or multiple Convolution-ReLU-Pooling layer(s). Results show that a BiLSTM GAN with a single Convolution-ReLU-Pooling layer provides the best performance. Zhu et al.32 used a BiLSTM-CNN GAN model to generate synthetic ECG signals. A GAN architecture based on a four-layer generator and a five-layer fully-connected discriminator is proposed in38. Banerjee and Ghose33 proposed a multi-GAN method to generate ECG waveforms for atrial fibrillation arrhythmia by combining the output of GAN models. Thambawita et al.35 proposed two GAN architectures, WaveGAN\(*\) and Pulse2Pulse, with the ability to generate synthetic 10-s ECG waveforms. Pulse2Pulse, which is based on a U-net generative model, is superior to producing realistic ECGs. Li et al.36 was the first to propose a transformer-based conditional GAN architecture, named TTS-CGAN, to generate synthetic time-series with sequences of arbitrary length. Compared to popular RNN or LSTM-based GANs for generating time-series31,39,40, TTS-CGAN has no difficulties in producing long synthetic sequences. In continuation, Xia et al.37 proposed TCGAN, an architecture combined with a transformer generator and CNN discriminator.

Despite the success of these methods, they do not guarantee that the generated data is trustworthy, resulting in the failure of classifiers to make accurate predictions. This paper sheds light on the fact that synthesizing high-quality artificial data play a crucial role in accurate predictions. Thus, we propose a novel physician-certified synthetic data generation method that provides ECG samples indistinguishable from real ones.

Neural architecture search for ECG

Several DL models have been developed for detecting a variety of cardiac arrhythmias. However, increasing the complexity of manual-designed networks does not always lead to better performance. Moreover, the introduced deep neural networks mostly require a cumbersome phase of trial-and-error, which results in enormous computational costs41. Recent advances in neural architecture search (NAS) have enabled the designing of scalable and resource-efficient neural architectures. Being inspired by the remarkable success of NAS in the computer vision domain42, several techniques very recently proposed to leverage NAS for designing accurate architectures for arrhythmia detection13,14,15,16,43.

Fayyazifar et al.43 studied the impact of manually tweaking deep neural networks for cardiac abnormality classification. Additionally, they used wavelet decomposition to enhance the classification performance of the PhysioNet Challenge 202044. Fayyazifar et al.13 employed a NAS method for AF classification where they achieved an accuracy of 84.15% on the PhysioNet challenge 201717. Heart-Darts15 proposed a heartbeat classification method by automatically designing an efficient CNN architecture with a differentiable NAS method. Heart-Darts provides state-of-the-art performance, applied to the MIT-BIH arrhythmia database45. Liu et al.16 developed a NAS-based learning method to detect cardiovascular diseases in 12-lead ECG data. In particular, they proposed a novel search strategy that optimizes different attention modules of the same network synchronously. EExNAS14 designed energy-efficient CNN architectures for detecting myocardial infarction (MI) and human activity recognition (HAR) on wearable devices.

These methods utilize NAS to design an efficient arrhythmia classifier; however, they are limited to optimizing the feature extraction part. Further, it is not conclusive that the findings of the prior studies are reproducible, especially since there is no comprehensive evaluation found in their report46.

Methodology

Method overview

We propose a novel method with three phases, comprising: (1) synthetic data generation, (2) ECG signal processing, and (3) CNN architecture search. Figure 2 depicts the bird’s eye view of the proposed method. In the first phase, we generate synthetic ECGs for the PxAF class using a GAN model. After the GAN creates synthetic ECGs, an expert physician evaluates them to identify high-quality training data. The second phase of the method employs the wavelet transform of an ECG signal along with the recurrence graph. Rhythmic information of an ECG within short length windows of 4 s is preserved in a recurrence graph. The outcome of the first stage is a sequence of the two-dimensional images, each incorporating rhythmic contents of a 4 s interval of an input ECG. In the last phase, a CNN is trained to classify the images where the architecture of the CNN is found using NAS. As we will see, the combination of these innovations noticeably improves the performance of the classification.

Phase 1: certified synthetic data generation

The public databases of ECG mostly contain a heavy class imbalance for the arrhythmia classes. The machine learning methods trained by such databases will be consequently biased for the normal classes. In order to cope with the shortage of signals from the minority class, i.e. the PxAF class, a structure GAN is invoked to create synthetic ECGs from the PxAF class. Obviously, inappropriate synthetic ECGs can mislead the classifier. Therefore, the synthetic ECGs created by the GAN are evaluated by an expert physician in terms of quality using a clearly-defined protocol. The disqualified ECGs will be discarded from the training and the synthetic ECGs certified by the expert physicians will be invoked for the learning process (“Certified synthetic data generation”).

Phase 2: ECG signal processing

ECG signal in its raw form is contaminated by different sources of noises and disturbances, such that the PxAF information can be fully concealed. In order to extract discriminant contents of PxAF from the pathological signals, a level of signal processing is required to purify indicative signal contents (“ECG signal processing”). This processing yields a sequence of 2D images, each containing the dynamics of a few seconds of the signal, to a CNN architecture, in which the ultimate classification is performed.

Phase 3: CNN architecture search

Manual design of task-specific neural architectures requires tremendous human effort and domain expertise. In addition, the knowledge learned from designing a network cannot be directly transferred to another person. Neural architecture search (NAS) is the process of automatically optimizing a neural network architecture. NAS research has shown significant progress in enabling accurate neural architectures for computer vision applications41,42,47,48. Because of this insight, we came up with the idea of leveraging NAS with the hope of improving the accuracy of PxAF detection (“CNN architecture search”).

Figure 2
figure 2

The bird’s-eye view of the proposed method.

Certified synthetic data generation

GAN architecture

In this paper, we used the Pulse2Pulse GAN model proposed by35. Here, we briefly present generator and discriminator architectures. Then, we present the procedure of certifying the quality of generated data with the help of an expert physician.

Generator

The architecture of the generator is inspired by the U-Net architecture. The U-Net implementation uses 1D convolutional layers for ECG signal generation. The network takes a 2 \(\times\) 5000 noise vector to generate a 2-lead signal, which is equal to the dimension of the output layer. The noise is passed through six down-sampling blocks followed by six up-sampling blocks. Each down-sampling block consists of a 1D-convolution layer followed by a Leaky ReLU activation. The deconvolution blocks were built from a series of four layers: an up-sampling layer, a constant padding layer, a 1D-convolution layer, and a ReLU activation function consecutively.

Discriminator

The discriminator takes an ECG as input and outputs a score indicating how close it is to a fake ECG. The architecture is composed of seven convolutional layers that follow the Convolution+Leaky ReLU+Phase Shuffle order. Using phase shuffle operation, each feature map’s phase is uniformly perturbed49. Training specification is reported in Table2.

Synthetic data certification

We observe that not all GAN-generated synthetic ECGs cannot be used as training segments due to their improper morphology, and thus, leveraging all GAN-generated segments in the training process will negatively affect the classification accuracy. Based on the morphological characteristics of ECG signal for PxAF cases, an expert physician manually verified all the synthetic ECGs and certified the valid ones (e.g., Fig. 3a) based on the directives listed in Table 1. In this table, the bizarre shape implies the condition in which the sequence of the ECG peaks and waves, and/or their shapes fundamentally differ from the ones, seen in clinical practice. This condition might be seen in a segment (directive 2), or the entire of synthetic ECG. It was also observed that the QRS complexes of the synthetic ECG are inconsistent, or accompanied by extra weird morphology (directives 3, 4). The PxAF characteristics were inconsistently seen in some of the data, affecting the learning process, and thus were eliminated (directive 5).

Figure 3
figure 3

(a) Plotting a certified synthetic PxAF sample. Plotting PxAF synthetic samples rejected by an expert physician due to (b) bizarre shape, (c) distorted PxAF, (d) inconsistent QRS-complex, (e) redundant/noisy R peaks (showing with red points), and (f) partially existing PxAF pattern in the segment. The sampling frequency is 128 Hz.

Table 1 Directives for rejecting improper synthetic ECG segments.

ECG signal processing

Figure 4 shows the major steps of the proposed signal processing pipeline. As shown, the input ECG signal is firstly decomposed to its constitutive components using wavelet transformation until the 10th level using the Daubechies 3 wavelet family. The detail of the wavelet transforms at the levels 2, 3 and 4 along with the approximation contents of the 10th level are reconstructed and added together, to eliminate the noises and the disturbances contaminating the signal. The resulting signal is then normalized by the absolute value of the points with the largest value. Next, the Shannon energy of the normalized signal is calculated using the following formula:

$$\begin{aligned} {\textbf {Y}}_{i}(t) = x^2(t) log (x^2(t)) \end{aligned}$$
(2)

where x(t) is the normalized ECG signal which is positively biased to secure non-zero values. An envelope of the resulting Shannon energy signal is found by using a non-overlapping temporal window of length 100 ms, which slides over the signal. Lastly, a recurrence 2D function of the envelope is obtained. Calculational details of finding the recurrence plot are found in50. The output of the signal processing algorithm is a 2D representation of an input signal, which is discriminant for the PxAF and the normal classes. A CNN employs 2D images for classification.

Figure 4
figure 4

Illustration of the proposed signal processing pipeline.

CNN architecture search

In general, the learning proficiency of CNNs will be improved by increasing the number of network layers. However, simply stacking the network layers may cause accuracy degradation since the deeper networks will encounter a vanishing/explosion gradient problem. Neural architecture search (NAS) methods aim to help engineers to design highly efficient neural networks from scratch41,47,51.

The NAS pipeline typically begins with a pre-defined space of network operators. Since the search space is often enormous (e.g., containing \(10^{24}\) or even more possible architectures47), it is unlikely that an exhaustive search is tractable. Thus, heuristic search methods are widely applied to speed up the search process. At an early age, each sampled architecture undergoes an individual training process from scratch, and thus the overall computational overhead is large, e.g., hundreds of GPU-days (e.g.,52 requires 3800 GPU days).

To alleviate the computing cost of NAS methods, researchers proposed to share computation among the sampled architectures, with the key idea of reusing network weights trained previously47,53 or starting from a well-trained super-network54. These efforts shed light on the one-shot NAS methods, which require training the super-network only once, and therefore run more efficiently (e.g., 2–3 orders of magnitude faster than conventional approaches).

One-shot NAS methods jointly formulate architecture search and network training48,51,55. Differentiable NAS methods solve this problem using gradient-based algorithms such as Stochastic Gradient Descent (SGD). DARTS51 is a well-known differentiable NAS method that constructs a super-network with all possible operators. DARTS utilizes a cell-based design space to search for a well-behaved cell architecture51,55. Then, the cell may be stacked any number of times to meet various hardware devices’ resource requirements. In this paper, we utilize DARTS51 to design CNN architectures due to significantly reducing the notorious design time of neural networks.

Mathematically, the final DARTS architecture is a function, \(f(x; \omega , \alpha )\), where x is input, \(\omega\) is network parameters (e.g., convolutional kernels), and \(\alpha\) in architectural parameters (e.g., indicating the importance of each operator between each pair of layers). \(f(x; \omega , \alpha )\) is differentiable to both \(\omega\) and \(\alpha\) could be optimized using the SGD algorithm. \(f(x; \omega , \alpha )\) is composed of a few cells, where each cell of DARTS is defined by a directed acyclic graph with a pre-defined number of layers and a limited set of neural operators. Each cell contains N nodes, and there is a predefined set, E, which indicates connected pairs of nodes. For each connected node pair (ij) and \(i<j\), node j takes \(x_i\) as input and propagates it through a pre-defined operator set, O, and sums up all outputs (Eq. 3). O supports separable convolution (\(3\times 3\), \(5\times 5\)), dilated convolution (\(3\times 3\), \(5\times 5\)), max/average-pooling (\(3\times 3\)), and Identify operators.

$$\begin{aligned} y^{(i,j)}(x_i)=\sum _{o \in O}\frac{exp(\alpha _{o^{(i,j)}})}{\sum _{o' \in O}exp(\alpha _{o^{(i,j)}}}.o(X_i) \end{aligned}$$
(3)

The normalization is performed by computing the Softmax function over the architectural weights. \(\alpha\) and \(\omega\) get optimized alternately in each search iteration. Afterward, the operator o with the maximum value is preserved for each edge (ij), and all other network parameters \(\omega\) are discarded. In DARTS, the type of each cell is either a normal cell for feature extraction or a reduction cell for both feature extraction and dimension reduction. After designing the optimal cell, we assemble the final network by stacking 18 normal cells with two reduction cells, where every six normal cells are followed by one reduction cell56. Last, the final architecture is re-trained from scratch to fine-tune the network parameters.

Experimental setup

Database preparation

Deep-PxAF identifies individuals who are at risk of PxAF. To this end, we utilized the PhysioNet PxAF prediction challenge database17. This database includes two-channel ECG recordings. The ECG signals were digitized with a 128 Hz sampling frequency, 16 bits per sample, and nominally 200 A/D units per millivolt. The database is divided into training and testing sets. The original train set consists of 100 records with a duration of 30 min that are collected for normal individuals and PAF patients, each with an equal number of recordings. The test set contains 50 records of 30 min duration in which 28 subjects are at risk of PxAF, and 22 subjects are healthy individuals. We completely isolate the training and testing sets. We also did not create a separate validation set to evaluate training performance since the size of the database is relatively small.

In this paper, we partitioned each 30-min ECG signal into segments of 4 s duration resulting in 512 samples/segment. To build the original database (\(D_{Original}\)), we randomly select 4231, 906, and 906 segments for train, validation, and testing, respectively. We consider two classes for training and testing sets: normal (healthy) and PxAF patients. \(D_{Original}\) contains 3545 and 2498 samples for normal and PxAF classes, respectively. The ECG data labeling tool is released alongside the codes.

We generate 10,000 synthetic segments for the PxAF class using GAN. As we have data imbalance for the PxAF class, we only synthesize PxAF segments. The original training database has been augmented with 10,000 synthetic segments (\(D_{GAN}\)). Due to the fact that most of the generated segments are not of high quality, an expert physician evaluated all synthetic data and certified 539 segments containing PxAF. Then, we add the certified synthetic PxAF segments to the original training set to make the final synthetic database (\(D_{CGAN}\)). The synthetic data generation time takes \(\approx\) 42 GPU hours on a single NVIDIA GTX 1080 Ti that produces \(\approx\) 4.3 kg CO\(_2\)57.

Configuration setup

Table 2 summarizes the configuration setup of experiments. In this paper, each DARTS cell consists of seven nodes equipped with a depth-wise concatenation operation as the output node. The convolutional operations follow the Convolution+Batch Normalization+ReLU order. The network design time (search+re-training) takes \(\approx\) 9 GPU hours on a single NVIDIA GTX 1080 Ti that produces \(\approx\) 0.97 kg CO\(_2\)57. The rest of the setup follows51.

Table 2 The configuration setup of the signal processing and neural architecture search hyper-parameters.

Baseline for comparison

Auto-Sklearn58

Auto-Sklearn is a state-of-the-art library for automated machine learning (AutoML) that is compatible with the scikit-learn library59. Auto-Sklearn automatically selects appropriate hyperparameters for a given database by leveraging Bayesian optimization60 as the search method. Auto-Sklearn uses four data preprocessing techniques, 14 feature preprocessing techniques, 15 classifiers, and a structured hypothesis space with 110 hyperparameters. Auto-Sklearn considers the past performance of similar databases and constructs ensembles from the machine learning models evaluated during the optimization to improve the optimization quality. Due to the high efficiency of Auto-Sklearn in customizing the machine learning pipeline61,62, we consider Auto-Sklearn as the second comparison baseline.

Deep residual network (ResNet)63

ResNet is a family of handcrafted architectures that won the ILSVRC competition challenge in 2015. ResNet is constructed by several back-to-back residual blocks connected to a final linear fully-connected layer. In this study, we used ResNet as the third comparison baseline since ResNet has been widely used in automated clinical diagnosis of various diseases4,64,65,66.

Performance measurement

This section introduces common quantitative metrics used for presenting how well synthetic data generation and classification methods work.

GAN performance

For evaluating the performance of GAN, we use a database containing GAN output data and original data to train a model, which is then tested on a held-out set of true examples. This requires the generated data to have labels—an expert physician provides labels to GAN output data. We statistically analyze the distribution of read ECGs and fake ECGs using Kolmogorov–Smirnov test (K–S test). Plus, we will show the Q–Q plot to look at the skewness of fake data from real data.

Classifier performance

The formulas for quantifying measurements are listed below:

$$\begin{aligned} Accuracy= & {} \frac{TP + TN}{TP + TN + FP + FN} \end{aligned}$$
(4)
$$\begin{aligned} Specificity= & {} \frac{TN}{TN + FP} \end{aligned}$$
(5)
$$\begin{aligned} Sensitivity= & {} \frac{TP}{TP + FN} \end{aligned}$$
(6)

where TP, TN, FP, and FN denote true positives, true negatives, false positive, and false negative, respectively.

Experimental results

The synthetic ECGs

The previously-described GAN is trained with 8000 epochs and a learning rate of 0.0001. Figure 5 depicts the loss function of the generator and the discriminator of the GAN. Both of the losses converge to a similar low margin implying the learning relevance. The outcomes of the GAN generator constitute our synthetic ECGs.

Figure 5
figure 5

Loss of the discriminator and generator during GAN training.

The quality of the synthetic ECGs is evaluated based on the statistical measures, separately applied to the entire original and synthetic populations, once using the outcomes of the certified-GAN and once using the GAN without accreditation of the expert physician. In both cases, the fidelity of the synthetic ECGs is evaluated by using the three PxAF-related parameters of ECG: heart rate, R-peak to R-peak interval (RR Interval), and QRS interval. These three parameters are independently calculated for the populations using the signal processing algorithm described in “ECG signal processing”. It is worth noting that these three parameters reflect the variation of the cardiac cycle and heart rate that is linked to arrhythmia.

In total, 10,000 synthetic ECGs were generated using the previously-described GAN, from which 539 were accredited by the expert physician. Figure 6 illustrates the histogram of the three PxAF-related parameters for the real and the synthetic ECGs resulting from the certified-GAN. The modal similarities are obviously seen for the synthetic and real populations.

Figure 6
figure 6

Distribution of (left) heart rates, (middle) RR interval, and QRS interval in all 539 certified segments (\(D_{CGAN}\)) compared to the original database (\(D_{Original}\)).

In order to explore the fidelity of the synthetic ECGs, descriptive statistics are calculated over the three populations: the real ECGs, the GAN, and the certified-GAN subjected to having PxAF condition. Table 3 represents the mean, standard deviation, and percentile values corresponding to the three populations. From the population perspective, the three PxAF-related parameters of the certified synthetic ECGs demonstrate very good fitness to the population of the real ECGs, with a marginal deviation of less than 2% for the mean value. This value is almost 4% for the data from the GAN. The deviation of the percentile values is less than 10%. The certified-GAN provides clear improvements in all the statistics, but the 2.5% percentile which corresponds to the outlier data.

Table 3 Mean, standard deviation (STD), 2.5%, and 97.5% percentile for HR, RR interval, and QRS interval parameters in real and synthetic ECGs.

In order to obtain a better understanding of the outperformance of the certified-GAN, the quantile distribution of the real and synthetic data, the so-called Q–Q plot, is investigated. Figure 7 illustrates the resulting Q–Q plot.

Figure 7
figure 7

Illustration of the Q–Q-plot for (left) heart rate, (middle) RR interval, and QRS interval (right).

It is obviously seen that the certified-GAN provides closer statistical distribution to the real one, as compared to the plain GAN. This is also explored by using the Kolmogorov–Smirnov Test.

Table 4 presents the results of the Kolmogorov–Smirnov (K–S) test for heart rate and QRS interval. As seen in the table, the certified-GAN improves the K–S statistics as well as the p value, showing a closer distribution to the real population. This distribution is closer to the real population than the one for the GAN, confirming the effectiveness of the certified-GAN.

Table 4 Kolmogorov–Smirnov test results.

PxAF classification performance

Table 5 compares the results of Deep-PxAF with the state-of-the-art and state-of-practice classification methods. Results show that Deep-PxAF provides the most accurate classification result with 99.0% accuracy compared to all counterparts. The analysis of the best DARTS cells searched by Deep-PxAF is provided in Section A.2.

Table 5 Comparing the results of Deep-PxAF with state-of-the-art and state-of-practice methods.

This study proposed an accurate method for screening PxAF. In this application, the trade-off between sensitivity and specificity is made by assigning the threshold of the output layer, where sensitivity and specificity are defined as:

  • Sensitivity is the probability of PxAF condition when the classification result is positive

  • Specificity is the probability of normal condition when the classification result is negative

Receiver operating characteristics (ROC) is a plot of the Sensitivity against \((1{-}Specificity)\), in which the optimal point is the point with maximal Sensitivity and specificity. Figure 8 illustrates the ROC curve for the proposed method in comparison with the ResNet-18 classification method. As can be seen in Fig. 8, Deep-PxAF provides a more favorable characteristic in terms of the compromise between Sensitivity and Specificity with a closer curve to the ideal case of the straight angle. The Area Under the Curve (AUC) of ROC for Deep-PxAF trained on \(D_{CGAN}\) is improved by 0.32% and 0.47% compared to Deep-PxAF trained on \(D_{Original}\) and ResNet-18 trained on \(D_{CGAN}\), respectively.

Figure 8
figure 8

Comparing the ROC curve of Deep-PxAF trained on \(D_{Original}\) and \(D_{CGAN}\) to the ResNet-18 trained on \(D_{CGAN}\) baseline method.

Robustness against the background noise

The robustness of the classifiers is investigated by adding background noise to the input signals and calculating the accuracy of the classifiers. Two different background noises are simulated for the investigation: a white noise with normal distribution and a color noise which is indeed a filtered white noise. We employed a Butterworth lowpass filter with a cutoff frequency of about 50 Hz, as often used in ECG acquisition systems. The accuracy of the classifiers is explored for various signal-to-noise ratios (SNR) of the white noise and the color noise, separately. Figure 9 demonstrates the profile of the accuracy and the SNR. As can be seen, the superiority of the Deep-PxAF is well preserved under the noisy conditions of the background noises.

Figure 9
figure 9

(Left) Gaussian white noise and (right) noise with Butterworth filter with 50 Hz cut-off frequency.

Discussion

This study suggested an original framework for PxAF classification using a novel combination of a GAN and NAS in conjunction with an advanced signal processing method. The proposed framework introduces a phase of generating synthetic ECG using GAN, to enhance the accuracy of the classification method by enriching the training data and overcoming class imbalance. Generating valid synthetic ECGs through a certified procedure was the main objective of this phase. Using a rich training dataset with consistent class size can evidently enhance the learning process. The experimental results showed that the enhancement in the accuracy is considerable, which was confirmed by the ROC graph (see Fig. 8). Besides, the classifier employs NAS, as a reliable architecture designer to boost the classification performance. The resulting classification method was optimized and implemented to detect patients with PxAF arrhythmia, which is regarded as an important case study with vital importance. The proposed method improved the screening accuracy by 6.1% compared to the state-of-the-art automated machine learning method58. The baseline for comparison was ResNet-18 and Auto-Sklearn which are well-known benchmarks for the machine learning method. These benchmarks were noticeably outperformed by the proposed method.

Synthetic data generation

This study employed a GAN architecture to generate synthetic ECGs and meanwhile invoked an expert physician to accredit the synthetic data. The application of GAN in generating synthetic ECG has been already explored32,35, however, the effectiveness of the generated ECGs in the training process is questionable since inappropriate synthetic data can evidently mislead the classifier. The certified-GAN which was proposed by this study effectively pruned the inappropriate signals. Results showed a noticeable improvement in the learning process using the certified-GAN. We made these synthetic signals publicly available to any researcher to explore for any scientific purposes.

Another exciting aspect of this study is the statistical techniques employed to study the fidelity of synthetic ECGs. Heart rate and R-R interval were employed as the measures for the PxAF. The statistical techniques mainly perform population-based evaluations which fit well into the scope of the learning process. The certified-GAN showed incapability to generate appropriate outliers, as reflected by the 2.5 percentile in Table3. Such outlier data cannot play an important role in the learning process performed by the proposed deep learning architecture.

ECG signal processing

In this study, the rhythmic contents of the heartbeats are innovatively preserved at the feature extraction level through signal processing and the recurrence images. Like other methods sufficing to the temporal features, there are a number of design parameters associated with the method at this level, such as the window’s length for obtaining the recurrence graph as well as the wavelet transformation. These parameters were empirically obtained based on prior knowledge of the signal. Integration of finding the optimal values for these design parameters with the optimization process might provide further improvements.

CNN architecture search

Although several NAS methods have been proposed to detect various arrhythmias13,14,15,16,43, the area is still unexplored for designing an efficient method for PxAF detection based on an optimized architecture of CNN. Moreover, the optimization process was not performed at the feature extraction level.

Design parameters

Deep-PxAF learns the dynamic variation of the heartbeats at the feature learning level by designing customized architectures for recurrence images. Several design parameters are associated with the method at this level, such as the number of training epochs. We empirically obtained these parameters based on prior knowledge about the neural architecture search. PxAF yields higher performance compared to the results of conventional machine learning techniques that are automatically tuned by Auto-Sklearn. This primarily results from our custom-designed CNN architecture’s higher feature extraction performance. On the other hand, manually tuning a generic CNN architecture6 may result in lower accuracy in comparison with Auto-Sklearn.

Clinical relevance

The classifier proposed by Deep-PxAF showed very high accuracy in detecting the pathological condition PxAF from the heart rate variability seen in normal conditions such as sinus rhythm. It is obvious that the classifier can be trained for detecting other pathological conditions. However, in order to be able to undertake the study in the patient level, we need a rich dataset of ECG signals in conjunction with comprehensive meta data of patient information. The resulting methods can be ultimately implemented in wearable ECG devices for detecting pathological conditions, i.e. PxAF, in a real-time manner. Nevertheless, a phase of clinical validation with a large number of individuals is necessitated to meet the standardization requirements. It is evident that pathological conditions like PxAF can lead to cardiac stroke, and hence, monitoring such a life-threatening condition can effectively reduce the aftermath consequences. Although the resulting classifier demands computational power in the training phase, the testing phase is light enough to be implemented in any kind of mobile technology, e.g. patch ECG, to be used for screening and patient monitoring in the clinical setting.

Statement of reproducibility

To foster reproducibility:

  • Reproducibility analysis Many works on NAS have issues regarding reproducibility due to intrinsic stochasticity. To guarantee the reproducibility of results, we follow the Reproducibility checklist proposed by Lindauer et al.46 (see Appendix A.1).

  • Code release Deep-PxAF is an open-source project. Code is made available on the GitHub repository through www.github.com/0mehdi0/Deep-PxAF.

  • Availability of database In this study, we evaluated our networks using the PhysioNet PAF database17 that is available through https://physionet.org/content/afpdb/1.0.0/. Thus, this work does not involve any new data collection or human subject evaluation. The synthetic ECGs with the corresponding ground truth labels can be downloaded from the GitHub project repository: www.github.com/0mehdi0/Deep-PxAF/tree/main/datasets. The Deep-PxAF may be freely used for scientific use or commercial algorithm development if this paper is properly cited.

Conclusion and future work

This paper suggested an original combination of certified synthetic data generation in conjunction with the NAS method for classifying a vital pathological sign of ECG signal: Paroxysmal Atrial Fibrillation (PxAF). To overcome privacy and ethical concerns for data sharing, a GAN model was used to generate synthetic data. The synthetic ECGs were purified by an expert physician to discard the irrelevant ones. We employed a CNN for the classification, for which the optimal was found by the NAS. The input images to the CNN were extracted from the ECGs using recurrence graphs of the wavelet transform. It is found that the proposed framework offers a noticeable improvement in classification performance compared to the state-of-the-art as well as the existing benchmarks. In future work, the performance of the classifier resulting from this study will be practically explored on the general population after being implemented in an appropriate platform of wearable ECG.