1 Introduction

Congenital heart disease (CHD) is the leading cause of stillbirths worldwide, and it is the most common major congenital malformation [1, 2]. Over the last 80 years, diagnostic and therapy capacities for CHD have significantly increased due to improved diagnostic techniques [1, 3]. Regular fetal heart monitoring can enhance CHD diagnosis during pregnancy, enabling appropriate medical treatment to be delivered to the fetus to minimize detrimental consequences. Ultrasonography (USG), cardiotocography (CTG) and fetal phonocardiography (PCG) are the most common techniques for monitoring fetal well-being [4,5,6]. The techniques have their advantages and drawbacks. USG is a noninvasive imaging test using high-frequency sound waves to create real-time images or videos of fetal internal organs [7]. To monitor contraction and fetal heart rate (FHR), CTG employs an ultrasound transducer and a uterine contraction pressure-sensitive transducer [4]. In contrast to USG and CTG, fetal heart sounds are observed by auscultation using PCG, which obtains an acoustic recording from a mother’s abdomen [4]. However, the techniques mentioned above require extensive training and limited availability of experts to train providers due to the relatively high costs [2, 4]. Furthermore, PCG has the lowest signal-to-noise ratio (SNR), making it difficult to find the fetus due to acoustic disturbance.

Due to the limitations of the techniques, the noninvasive fetal electrocardiogram (fECG) is a promising alternative tool for monitoring fetal heart activity. It has significant promise for providing morphological information on the pathological status of the fetal heart. Furthermore, it can offer an accurate assessment of FHR. The fECG provides vital data on the physiological status of fetal heart activity and the diagnosis of fetal cardiac disorders such as arrhythmia. It is achieved by measuring the electrical activity of the fetal heart over time using mother’s skin electrodes [8,9,10]. The risk associated with measuring the noninvasive fECG signal is not observed, therefore it is more effective than the internal fECG signal [11].

In noninvasive fECG monitoring, the measurement of FHR is highly considered to detect abnormalities such as fetal hypoxia. Also, an irregular rhythm of the fetal heart is the most common cause of fetal arrhythmia, which most are caused by frequent ectopic beats [6]. Hence, the first step to diagnose those abnormalities is the effective techniques needed to detect the location of fetal QRS-complexes. Changes in the width of the fetal QRS-complexes indicate more sophisticated fetal heart states [11]. However, detecting fetal QRS-complex is arduous due to the characteristics of fECG having small signal energy relative to the maternal electrocardiogram (mECG), so undesired signal components appear [12]. In addition, the overlap of mECG and random electrical noise during acquisition presents an issue with fetal QRS-complex detection [13, 14].

Numerous computerized algorithms have been proposed to detect the fetal QRS-complexes, such as wavelet transform [8, 9], independent component analysis [10], principal component analysis [12], adaptive neuro-fuzzy inference system [15], adaptive filtering [16] and so on. The algorithms above can be categorized based on signal processing techniques: adaptive filtering, linear decomposition and non-linear decomposition [12]. The drawbacks of those techniques still require a reference mECG signal to recreate the morphological shape of the mECG, and the morphology of mECG is highly dependent on the location of the electrode [17]. Also, high computational complexity in non-linear decomposition algorithms has limited their use in real-time circumstances [4]. A feature analysis of human intervention is still needed, significantly involved in detecting fetal QRS-complexes [18, 19].

Deep learning (DL) rises with excellent power, characterized by learning features directly from data without human intervention [20]. At the same time, conventional signal processing algorithms fail to process natural data in their raw form [21]. Most QRS-complex detectors preprocess the input ECG signal before extracting deep neural features or models and extract signal characteristics before extracting deep neural features. As a result, they do not get the benefits of employing DL. In this paper, we propose a hybrid convolution layer as a part of convolutional neural networks (CNN) and recurrent network algorithms, i.e., long short-term memory (LSTM). Both combined DL architectures have proven their superiority in many applications, specifically biomedical signal processing [22,23,24,25,26,27,28]. We have a hyperparameter tuned to generate the best model for QRS-complex processing. Lee et al. [11] have proposed CNN for fetal QRS-complex classification. They proposed seven convolution layers and achieved a sensitivity of 89.06%. Zhong et al. [4] have also explored CNN for fetal QRS-complex classification. The differences are the total of used convolution layers. They have used three convolution blocks. The accuracy was achieved by 77.38%. Both previous studies show that using CNN for fetal QRS-complex classification can be considered. To clarify and highlight the strong features of this study, we have revised its contributions as follows:

  • Proposes DL model with stacked a convolutional layer and bidirectional long short-term memory to enhance the accuracy of fetal QRS-complex classification;

  • Generate the segmentation process of the fetal QRS-complex based on the knowledge of QRS segmentation in adult ECG signal; and

  • Experiences 68 fECG records for training, and validation. We have used five records as the unseen data to measure an objective evaluation of the best model.

The rest of this paper is organized as follows: Sect. 2 presents the material and methods of this study. Section 3 presents the results and discussion, respectively. Finally, the conclusions are presented in Sect. 4.

2 Materials and Methods

The research methodology of this study consisted of three main steps; (i) the acquisition of fECG raw data from Noninvasive fECG: The PhysioNet/Computing in Cardiology Challenge 2013 [29]; (ii) the fECG preprocessing (noise cancellation, beat and QRS-complex segmentation); (iii) hyperparameter tuning of DL model; and (iv) the performance evaluation based on classification metric (accuracy, sensitivity, specificity, precision and F1-score). The visualization of research methodology of this study can be presented in Fig. 1.

Fig. 1
figure 1

The research methodology of fetal QRS-complex classification

2.1 Data Preparation

Noninvasive fECG: The PhysioNet/Computing in Cardiology Challenge 2013 has been widely explored to generate the fetal QRS-complexes classification algorithm [29]. The dataset consisted of three data collection sets of one-minute fECG recordings (sets A, B, and C). However, in this study, only set A was used. The total is 75 ECG records which are digitized at 1000 samples per signal per second. The sample records can be presented in Fig. 2. Figure 2 shows the four leads of fECG records. We have only used a single lead for fetal QRS classification in this investigation.

Fig. 2
figure 2

The sample of fECG records

2.2 fECG Preprocessing

fECG recordings have low signal quality due to the effect of noninvasive abdominal recordings. Extracting the clean fECG recordings is arduous in fetal monitoring. Therefore, eliminating noise or artefacts is essential in clinical sign preparation. Discrete wavelet transform (DWT) has been implemented to remove fECG recordings from noise or commotion. DWT has powerful and excellences in denoising the signal [30, 31]. The dominant frequency is often used to calculate the wavelet disintegration levels. It uses high and low pass elimination to reduce fECG recordings to the frequency range needed to construct progressive coils [32]. In this investigation, we have explored wavelet families, i.e., symlets, daubechies, haar, and bior, to determine which wavelet type would produce the best results for signal denoising. Among the greatest signal-to-noise ratio (SNR) outcomes, daubechies wavelet was chosen for ECG signal denoising with SNR of 3.27882 decibels (refer to Table 1).

Table 1 The SNR results of wavelet families

The next preprocessing of fECG processing is beat segmentation. It is the important step before classification using the DL algorithm. In this step, we detect the R-peak from reference annotation marking the location of true fetal QRS-complex (ground truth). From R-peak, we segment the 0.25 s (s) to the left and 0.45 s to the right [33]. For QRS-complex segmentation, we have segmented as 0.10 s, as its normal width is around 70–100 ms. We segment left to R-peak as 0.5 s (R-peak to QRS-onset), and right to R-peak as 0.5 s (R-peak to QRS-offset). The segmentation process can be presented in Fig. 3. In this step, the labelling process has also done, which consists of QRS-complex and non-fetal QRS-complex. A labelling process was required to supervise learning of the proposed DL model. As a result, the input data 317,000 was labelled as QRS-complex, and the input data 126,800 was labelled as class non-fetal QRS-complex.

Fig. 3
figure 3

The segmentation process of beat and QRS-complexes

2.3 Hyperparameter Tuning DL Model

The one-dimensional forward propagation of CNN can be expressed as follows [34]:

$$x_{k}^{l} \, = \,b_{k}^{l} \, + \,\sum\limits_{i = 1}^{{N_{l - 1} }} {conv\,1\,d\,\left( {w_{ik}^{l - 1} ,\,s_{i}^{l - 1} } \right)}$$
(1)

where \(x_{k}^{l}\) is the input, \(b_{k}^{l}\) is the bias of the \(k^{th}\) neuron at layer l, \(s_{i}^{l - 1}\) is the output of the \(i^{th}\) neuron at layer l – 1, and \(w_{ik}^{l - 1}\) is the kernel from the \(i^{th}\) neuron at layer l – 1 to the \(k^{th}\) neuron at layer l.

The BiLSTM can be expressed as follows:

$$h_{t}^{f}=\hbox{tan}\hbox{h}\left( {W_{xh}^{f} xt\, + \,W_{hh}^{f} h_{l - 1}^{f} + b_{h}^{f} } \right)$$
(2)
$$h_{t}^{b}=\hbox{tan}\hbox{h}\left( {W_{xh}^{b} xt\, + \,W_{hh}^{b} h_{l - 1}^{b} + b_{h}^{b} } \right)$$
(3)
$$y_{t} \, = \,W_{hy}^{f} h_{t}^{f} + W_{hy}^{b} h_{t}^{b} + b_{y}$$
(4)

where to generate the output \(y_{t}\), the forward hidden layer \(h_{t}^{f}\) and the backward hidden layer \(h_{t}^{b}\) are combined.

The performance of the DL model depends on its hyperparameters. There are no specific rules or general ways to choose the best hyperparameters. The hyperparameters are set in a trial-and-error manner to minimize a validation error. In this investigation, we have generated three DL models. With the same parameter of learning rate of 10–5, batch size of 2, 300 epochs, Adam optimizer, and binary cross-entropy as loss function, we are firstly concerned with tuning the number of convolution layers (Table 2). For the first model, we experimented with CNN. The second model is a hybrid of one convolution layer and LSTM. The third model is a hybrid of one convolution layer and BiLSTM. From this experiment, we have investigated the effect of CNN, LSTM, and BiLSTM performance.

Table 2 The hyperparameter tuning DL model

3 Results and Discussion

A 68 of 75 fECG records have been used to generate the fetal QRS-complexes classification model. The rest unused fECG records are not correctly annotated; therefore, they are excluded from DL model generation. A total of 68 fECG records consisted of 50, 13 and five fECG records for training, validation and testing set (unseen), respectively. To reduce the data leakage problems, we have split fECG records based on patient (record-based). Table 3 presents the performance results of the DL model comparison using a validation set. Models 2 and 3 have 100% performance in accuracy, sensitivity, specificity, precision and F1-score. To determine the classification task's bias and variance, we plotted the learning curve (refer to Fig. 4). The learning curve is related to a learning process that shows the progress over a specific metric's experience (accuracy and loss curves). The learning curve visualization indicates how well the model fits the training data, while the validation indicates how well the model fits new data. The other issues in the behaviour of model can be detected to avoid high bias (underfitting) and high variance (overfitting) if the biased model does not take into account relevant information, which leads to underfitting. Then, if the algorithm captures the training data well but performs poorly in new data so that it cannot generalize, it leads to overfitting.

Table 3 The performance results of DL model comparison using validation set
Fig. 4
figure 4

The plot of accuracy curves (a, c, e) and loss curves (b, d, and f) of DL models

Figure 4 visualizes the accuracy curves (Figs. 4a, c, and e) and loss curves (Fig. 4 b, d and f). Figures 4a and b show the right bias or good linear model for the data because there is no-showed overfitting or underfitting. Model 1 captures well the training and validation set. This is also seen in Figs. 4c–f in Models 2 and 3. The training and validation set converge to a value with an epoch of 300. However, the learning curve of Model 2 tends to fluctuate from epoch 0 to 50. The training and validation functions move noisily. It could be the case that the validation set struggles to model the training set. Unlike Model 3, the model performs a perfect linear model from the initial epoch. Hence, for the model of fetal QRS-complex classification, Model 3 is proposed.

To validate Model 3 as the proposed model, we have tested the model to a testing set (unseen). The unseen set is not used for the DL model generation. Measuring how the proposed model performs well to new data is fair. Five fECG recording records have used which consisted of 34,000 fetal QRS-complex and 13,600 non-fetal QRS-complex. The results can be presented in Fig. 5. Figure 5 shows the confusion matrix (CM) for unseen set. It shows the proposed model (Fig. 5c) is well-classified fetal QRS-complex and non-fetal QRS-complex. The CM is a performance measurement for classification tasks where output can be two or more classes. It combined predicted and actual values. As represented in Fig. 5 (c), the label 0 as fetal QRS-complex can predict all actual values, as does label 1 as non-fetal QRS-complex. We have the visualization of actual (expert annotation) and predicted (Models 1–3) presented in Fig. 6. Green is presented as fetal QRS-complex, and yellow is as non-fetal QRS-complex. The classification results of the proposed model are almost the same as the expert annotation.

Fig. 5
figure 5

The plot of CM for unseen set for the best model (Models 1–3)

Fig. 6
figure 6

The sample plot of fetal QRS-complex classification between expert annotation and DL predicted models 1–3

This study examined fetal QRS-complex classification between our proposed model and existing DL approaches based on the same dataset (PhysioNet/computing in cardiology challenge 2013) (see Table 4). Krupa et al. [2] proposed Internet of Things (IoT)-based DL using time–frequency image of abdominal signals and combined to pre-trained models MobileNet and ResNet18 for detection of fetal QRS-complex. The proposed model of fetal QRS-complex classification based on DL (segmentation, training and classification) runs on the IoT cloud, and the output is communicated to the experts at the hospital remotely. Overall, they achieved above 89% in all performance metrics. Zhong et al. [4] and Lee et al. [11] have explored CNN for fetal QRS-complex classification. Zhong et al. [4] proposed three convolution blocks (64, 128 and 256 filters) and a dense block that takes a time series of 100 ms long fECG signals as input. They compared a proposed CNN model to K-Nearest Neighbours (KNN), naïve Bayes (NB) and support vector machine (SVM). As a result, CNN outperformed those conventional algorithms with 77.38% accuracy. Difference to Zhong et al. [4], Lee et al. [11] proposed seven convolutional layers with two fully connected layers (SoftMax). They achieved average sensitivity of 89.06% using a test set.

Table 4 The benchmark studies of QRS-complex classification based on DL

Among the benchmark mentioned above, our proposed model is superior, with the outstanding results 100% for the unseen set. Though the results are promising, the limitation of this study is only used a single fECG database. We did not generalize the other fECG databases yet. In addition, the proposed architecture can be extended for complete fetal P-QRS-T wave classification. This additional insight can be valuable in understanding the overall fetal ECG patterns and potentially aid in diagnosing various fetal heart conditions, like arrhythmia. Therefore, future work requires extended exploration to generate a robust DL model for fetal P-QRS-T classification.

4 Conclusion

We proposed the DL model from a hyperparameter tuning task. We generated the three DL models and compared the results between the DL models to obtain the best model for fetal QRS-complex classification. As a result, our proposed model has successfully classified fetal QRS-complex and non-fetal QRS-complex with 100% accuracy, sensitivity, specificity, precision and F1-score. In this study, we have also managed noise and inferences of fECG records using the DWT method. A sequence-to-sequence DL framework using a stacked convolutional layer and BiLSTM can be highly considered for implementation for noninvasive fECG monitoring in clinical practice.