Multimodal Deep Learning for Predicting Adverse Birth Outcomes Based on Early Labour Data

Cardiotocography (CTG) is a widely used technique to monitor fetal heart rate (FHR) during labour and assess the health of the baby. However, visual interpretation of CTG signals is subjective and prone to error. Automated methods that mimic clinical guidelines have been developed, but they failed to improve detection of abnormal traces. This study aims to classify CTGs with and without severe compromise at birth using routinely collected CTGs from 51,449 births at term from the first 20 min of FHR recordings. Three 1D-CNN and LSTM based architectures are compared. We also transform the FHR signal into 2D images using time-frequency representation with a spectrogram and scalogram analysis, and subsequently, the 2D images are analysed using a 2D-CNNs. In the proposed multi-modal architecture, the 2D-CNN and the 1D-CNN-LSTM are connected in parallel. The models are evaluated in terms of partial area under the curve (PAUC) between 0–10% false-positive rate; and sensitivity at 95% specificity. The 1D-CNN-LSTM parallel architecture outperformed the other models, achieving a PAUC of 0.20 and sensitivity of 20% at 95% specificity. Our future work will focus on improving the classification performance by employing a larger dataset, analysing longer FHR traces, and incorporating clinical risk factors.


Introduction
Cardiotocography (CTG) is a continuous and simultaneous measurement of fetal heart rate (FHR) and maternal uterine contraction signals. CTG is commonly performed during or preceding labour to assess fetal wellbeing and reduce its mortality and morbidity [1]. Interpretation of the CTG patterns requires assessing the FHR baseline, variability, accelerations, and decelerations by a trained clinician. However, due to the complexity of CTG signals, visual interpretation is often challenging and imprecise [2], leading to miss diagnoses [3,4]. In the United Kingdom (UK), each year between 2015 to 2018, on average 125 intrapartum still births, 154 neonatal deaths, and 854 severe injuries were registered [5]. These adverse outcomes frequently lead to litigation. In England, in 2020/2021, over £4.1 billion was spent on settling obstetric claims, 59% of which were clinical negligence payments [6]. Enhancing the accuracy of CTG interpretation has the potential to enable clinicians to intervene earlier, thereby potentially preventing some of these adverse outcomes. This, in turn, can alleviate the substantial financial burden on the healthcare system. Globally in 2019, an estimated 2 million babies were stillborn [7]. Most these adverse outcomes that occurred during intrapartum periods are potentially preventable with CTG monitoring and appropriate interventions.
CTG remains at the center of the decision-making process in intrapartum fetal monitoring despite its limitations, as there is no other technology or method that has been shown Overall, previous studies on analysis of CTG using deep learning approaches focused on the last hour recordings mostly using small datasets.
In this study, we present three deep learning models for prediction of birth outcome using FHR traces recorded around the onset of labour in both: the time domain, implementing a combination of 1D CNNs and LSTMs; and in the frequency domain, employing a 2D CNNs [33,39]. The models are trained to classify new-borns with and without severe compromise at birth. To our knowledge, this study represents a pioneering effort in the application of deep learning techniques for analyzing CTG traces during early labour. Given the absence of published results from computer-based methods, we conducted a comparison of our findings with the existing standards of clinical care. We hypothesise that DL methods trained with a large clinical dataset of CTGs from around the onset of labour could hold the potential to ultimately assist clinicians in identifying foetuses who are already compromised or are vulnerable at labour onset and may thus be at high risk for further injury during labour.

Description of the Dataset
This was a retrospective cohort study of infants delivered at the John Radcliffe Hospital in Oxford, UK, using a clinical data collection system between 1993 and 2012. The study received ethical approval from the Newcastle & North Tyneside 1 Research Ethics Committee, Reference 11/NE0044 (data before 2008), and from the South Central Ethics Committee, Reference 13/SC/0153 (for data after 2008). Informed consent by the participants was not required.
The clinical protocol has been to administer intrapartum CTG only to pregnancies deemed at 'high-risk'. From them, 51,449 CTG tracings include births at gestation ≥36 weeks, longer than 1 h, and have no second stage trace in the first hour ( Figure 1). In these records, 452 are births with a severe compromise-a composite outcome of intrapartum stillbirth, neonatal death, neonatal encephalopathy, seizures, and resuscitation followed by over 48 h in the neonatal intensive care unit. The rest of the cohort samples are labelled as no severe compromise.
proaches focused on the last hour recordings mostly using small datasets.
In this study, we present three deep learning models for prediction of birth outcom using FHR traces recorded around the onset of labour in both: the time domain, imple menting a combination of 1D CNNs and LSTMs; and in the frequency domain, employin a 2D CNNs [33,39]. The models are trained to classify new-borns with and without sever compromise at birth. To our knowledge, this study represents a pioneering effort in th application of deep learning techniques for analyzing CTG traces during early labou Given the absence of published results from computer-based methods, we conducted comparison of our findings with the existing standards of clinical care. We hypothesis that DL methods trained with a large clinical dataset of CTGs from around the onset o labour could hold the potential to ultimately assist clinicians in identifying foetuses wh are already compromised or are vulnerable at labour onset and may thus be at high ris for further injury during labour.

Description of the Dataset
This was a retrospective cohort study of infants delivered at the John Radcliffe Hos pital in Oxford, UK, using a clinical data collection system between 1993 and 2012. Th study received ethical approval from the Newcastle & North Tyneside 1 Research Ethic Committee, Reference 11/NE0044 (data before 2008), and from the South Central Ethic Committee, Reference 13/SC/0153 (for data after 2008). Informed consent by the partic pants was not required.
The clinical protocol has been to administer intrapartum CTG only to pregnancie deemed at 'high-risk'. From them, 51,449 CTG tracings include births at gestation ≥3 weeks, longer than 1 h, and have no second stage trace in the first hour ( Figure 1). In thes records, 452 are births with a severe compromise-a composite outcome of intrapartum stillbirth, neonatal death, neonatal encephalopathy, seizures, and resuscitation followe by over 48 h in the neonatal intensive care unit. The rest of the cohort samples are labelle as no severe compromise.   The CTG datasets are obtained with standard fetal monitors at a 4 Hz sampling rate for the FHR and 2 Hz for the uterine contraction signals. Consistently with our prior work [37], the FHR signals are down sampled to 0.25 Hz. We apply a two-stage pre-processing procedure to deal with the noise and missingness: signal cleaning; and gap imputation.

Signal Cleaning
A bespoke algorithm is applied to remove artefacts from the CTG signal, for example, erroneous maternal heart rate capture and extreme outliers (FHR measurements >230 or <50 beats per minute (bpm)). The start time of the signal is adjusted to ensure adequate signal quality: the CTG is analysed using a sliding 5-min window (with a one-minute stride) to ensure that the signal loss is less than 50% in the first five minutes. The extracted cleaner 20-min FHR tracing has less than 50% signal loss, i.e., any sample with signal loss greater than the threshold is discarded. The cleaning process reduced the signal loss of the no severe compromise group (mean, ±std) from 26.9% (26.7, 27.2%) to 12.3% (12.2, 12.4%); and for the severe cases from 29.1% (26.6%, 31.5%) to 13.4% (12.3, 14.5%).

Gap Imputation
Signal noise and loss are common in CTG tracings, resulting in both short (few seconds) and long (many minutes) gaps in the signal [40]. Following efficient noise removal, reliable gap imputation is an important task in data analysis and pre-processing phases, which is expected to improve the classifier performance at the later learning stage. There are a number of techniques proposed for inferring and imputing the gaps in FHR signals recorded by CTG, including Linear interpolation [37], Cubic spline interpolation [41], Sparse representation with dictionaries [42], Gaussian processes (GP), and others [43,44]. We compared the performance of the Linear, GP, and Autoregressive (AR) imputation techniques (example shown in Figure 2) based on their effect on the performance of the CNN, which is evaluated on the testing set. Since the AR consistently outperformed the Linear and GP gap imputation techniques, and achieved the highest accuracy the CNN's classification accuracy, we used it to impute the gaps in the FHR signals of our dataset (results of the comparison between the gap imputation techniques is reported in [45]).

Pre-Processing
The CTG datasets are obtained with standard fetal monitors at a 4 Hz sampling r for the FHR and 2 Hz for the uterine contraction signals. Consistently with our prior wo [37], the FHR signals are down sampled to 0.25 Hz. We apply a two-stage pre-process procedure to deal with the noise and missingness: signal cleaning; and gap imputation

Signal Cleaning
A bespoke algorithm is applied to remove artefacts from the CTG signal, for examp erroneous maternal heart rate capture and extreme outliers (FHR measurements >230 <50 beats per minute (bpm)). The start time of the signal is adjusted to ensure adequ signal quality: the CTG is analysed using a sliding 5-min window (with a one-min stride) to ensure that the signal loss is less than 50% in the first five minutes. The extrac cleaner 20-min FHR tracing has less than 50% signal loss, i.e., any sample with signal l greater than the threshold is discarded. The cleaning process reduced the signal loss the no severe compromise group (mean, ±std) from 26.9% (26.7, 27.2%) to 12.3% (12 12.4%); and for the severe cases from 29.1% (26.6%, 31.5%) to 13.4% (12.3, 14.5%).

Gap Imputation
Signal noise and loss are common in CTG tracings, resulting in both short (few s onds) and long (many minutes) gaps in the signal [40]. Following efficient noise remov reliable gap imputation is an important task in data analysis and pre-processing phas which is expected to improve the classifier performance at the later learning stage. Th are a number of techniques proposed for inferring and imputing the gaps in FHR sign recorded by CTG, including Linear interpolation [37], Cubic spline interpolation [4 Sparse representation with dictionaries [42], Gaussian processes (GP), and others [43,4 We compared the performance of the Linear, GP, and Autoregressive (AR) imputat techniques (example shown in Figure 2) based on their effect on the performance of CNN, which is evaluated on the testing set. Since the AR consistently outperformed Linear and GP gap imputation techniques, and achieved the highest accuracy the CNN classification accuracy, we used it to impute the gaps in the FHR signals of our data (results of the comparison between the gap imputation techniques is reported in [45]).

Transformation of the FHR Signal to a 2D Image
The raw FHR signals (each sample 20 min long) are transformed into time-frequen using Fourier and Wavelet transforms, and the resulting images are analysed using well-established 2D-CNNs. The spectrogram of Short-Time Fourier Transform (STF represents the normalised, squared magnitude of short-time Fourier transform coe cients [46]. To convert the input 1D FHR signal using spectrograms, the time domain s nals are divided into shorter segments (windows), and Fourier transform is computed each segment to obtain the frequencies. We used a 1 Hz FHR signal in the spectrogr

Transformation of the FHR Signal to a 2D Image
The raw FHR signals (each sample 20 min long) are transformed into time-frequency using Fourier and Wavelet transforms, and the resulting images are analysed using the wellestablished 2D-CNNs. The spectrogram of Short-Time Fourier Transform (STFT) represents the normalised, squared magnitude of short-time Fourier transform coefficients [46]. To convert the input 1D FHR signal using spectrograms, the time domain signals are divided into shorter segments (windows), and Fourier transform is computed for each segment to obtain the frequencies. We used a 1 Hz FHR signal in the spectrogram and scalogram (Wavelet transform) analysis since it produced better accuracy in our preliminary experiment. The spectrogram is the STFT of each short signal segment, computed by sliding the window with a constant stride and an overlap through the entire record. In this work, we investigate the effect of different window strides and overlapping sizes on the classifier's performance. The FHR signals are converted into spectrogram images by applying STFT given with (1).
is the window, and L is the window length. X(n, ω) STFT of a windowed data centred at time point n and the log values of X(n, ω) are represented as spectrogram (128 × 128) images. Since the window length L is a hyperparameter, we investigated the effect of different window sizes on the classification performance. In our initial experiments the 128 × 128 spectrogram images lead to better classifier performance than 64 × 64 or 256 × 256 image sizes. Thus, we used the 128 × 128 image size for the rest of our analysis. In addition to the STFT spectrograms, we also investigated the wavelet scalograms. A scalogram is a time-frequency representation with a wavelet basis instead of sinusoidal functions. Like the Fourier spectrogram, the scalogram analyses compute the coefficients using sliding windows called wavelets, i.e., the input signal is multiplied with the wavelet at different time locations. The process is repeated by increasing the scale of the wavelet (also known as the mother wavelet). This dilation and contraction operation captures longand short-time events from the input, where the dilated wavelet is sensitive to long-time events, and the contracted wavelet to short-time events [47]. The wavelet transform of a signal x(t) is defined as the integration of the x(t) with the shifted or scaled shapes from a mother wavelet REVIEW 5 of 17 and scalogram (Wavelet transform) analysis since it produced better accuracy in our preliminary experiment. The spectrogram is the STFT of each short signal segment, computed by sliding the window with a constant stride and an overlap through the entire record. In this work, we investigate the effect of different window strides and overlapping sizes on the classifier's performance. The FHR signals are converted into spectrogram images by applying STFT given with (1).
where is the input FHR signal, is the window, and L is the window length. , STFT of a windowed data centred at time point n and the log values of , are represented as spectrogram (128 × 128) images. Since the window length L is a hyperparameter, we investigated the effect of different window sizes on the classification performance. In our initial experiments the 128 × 128 spectrogram images lead to better classifier performance than 64 × 64 or 256 × 256 image sizes. Thus, we used the 128 × 128 image size for the rest of our analysis.
In addition to the STFT spectrograms, we also investigated the wavelet scalograms. A scalogram is a time-frequency representation with a wavelet basis instead of sinusoidal functions. Like the Fourier spectrogram, the scalogram analyses compute the coefficients using sliding windows called wavelets, i.e., the input signal is multiplied with the wavelet at different time locations. The process is repeated by increasing the scale of the wavelet (also known as the mother wavelet). This dilation and contraction operation captures long-and short-time events from the input, where the dilated wavelet is sensitive to longtime events, and the contracted wavelet to short-time events [47]. The wavelet transform of a signal x(t) is defined as the integration of the x(t) with the shifted or scaled shapes from a mother wavelet ѱ , as shown in (2).
where a is a scale parameter, b is a translation parameter, and ѱ is the mother wavelet function. By using different scale factors of the wavelet transform, computes wavelet coefficients of the signal at different scales. The absolute values of these continuous coefficients define the scalogram (in our case, represented as 128 × 128 image). The choice of the mother wavelet is important as the time-frequency analysis represents the match between the wavelet and the FHR signal. Here, we compare the Gaussian of order 8 ('gaus'), Morlet ('morl'), Shannon ('shan'), and Mexican Hat ('mexh') wavelets. [47].

Data Augmentation
The dataset used in this study is substantially imbalanced: there are fewer positive samples (n = 384) than the negative ones (n = 43,293). The class imbalance in the dataset can make the learning process very challenging for any classifier and usually leads to poor prediction performance [48]. In our initial model training, we observed signs of overfitting. To mitigate this problem, we augmented the data of the severe compromise class, which we expected to act as a regulariser and improve the generalisation performance of the classifier. Several data augmentation methods have been proposed for time series signals, such as flipping the signal, adding noise, masking the segment of the signal [49]. We developed a simple data augmentation approach, tailored to our specific task, which involved extracting additional 20-min FHR segments from the first 1-h FHR data with 50% overlap, thereby increasing the size of positive samples by a factor of 4. Only the 20-min segments with less than 50% signal loss were augmented. The positive instances are then further oversampled by a factor of 2, which led to their overall increase by a factor of 8 in the training dataset. In CTG analysis, the under-sampling of the negative samples is a a,b (t) as shown in (2).
Bioengineering 2023, 10, x FOR PEER REVIEW 5 of 17 and scalogram (Wavelet transform) analysis since it produced better accuracy in our preliminary experiment. The spectrogram is the STFT of each short signal segment, computed by sliding the window with a constant stride and an overlap through the entire record. In this work, we investigate the effect of different window strides and overlapping sizes on the classifier's performance. The FHR signals are converted into spectrogram images by applying STFT given with (1).
where is the input FHR signal, is the window, and L is the window length. , STFT of a windowed data centred at time point n and the log values of , are represented as spectrogram (128 × 128) images. Since the window length L is a hyperparameter, we investigated the effect of different window sizes on the classification performance. In our initial experiments the 128 × 128 spectrogram images lead to better classifier performance than 64 × 64 or 256 × 256 image sizes. Thus, we used the 128 × 128 image size for the rest of our analysis.
In addition to the STFT spectrograms, we also investigated the wavelet scalograms. A scalogram is a time-frequency representation with a wavelet basis instead of sinusoidal functions. Like the Fourier spectrogram, the scalogram analyses compute the coefficients using sliding windows called wavelets, i.e., the input signal is multiplied with the wavelet at different time locations. The process is repeated by increasing the scale of the wavelet (also known as the mother wavelet). This dilation and contraction operation captures long-and short-time events from the input, where the dilated wavelet is sensitive to longtime events, and the contracted wavelet to short-time events [47]. The wavelet transform of a signal x(t) is defined as the integration of the x(t) with the shifted or scaled shapes from a mother wavelet ѱ , as shown in (2).
where a is a scale parameter, b is a translation parameter, and ѱ is the mother wavelet function. By using different scale factors of the wavelet transform, computes wavelet coefficients of the signal at different scales. The absolute values of these continuous coefficients define the scalogram (in our case, represented as 128 × 128 image). The choice of the mother wavelet is important as the time-frequency analysis represents the match between the wavelet and the FHR signal. Here, we compare the Gaussian of order 8 ('gaus'), Morlet ('morl'), Shannon ('shan'), and Mexican Hat ('mexh') wavelets. [47].

Data Augmentation
The dataset used in this study is substantially imbalanced: there are fewer positive samples (n = 384) than the negative ones (n = 43,293). The class imbalance in the dataset can make the learning process very challenging for any classifier and usually leads to poor prediction performance [48]. In our initial model training, we observed signs of overfitting. To mitigate this problem, we augmented the data of the severe compromise class, which we expected to act as a regulariser and improve the generalisation performance of the classifier. Several data augmentation methods have been proposed for time series signals, such as flipping the signal, adding noise, masking the segment of the signal [49]. We developed a simple data augmentation approach, tailored to our specific task, which involved extracting additional 20-min FHR segments from the first 1-h FHR data with 50% overlap, thereby increasing the size of positive samples by a factor of 4. Only the 20-min segments with less than 50% signal loss were augmented.
The positive instances are then further oversampled by a factor of 2, which led to their overall increase by a factor of 8 in the training dataset. In CTG analysis, the under-sampling of the negative samples is a where a is a scale parameter, b is a translation parameter, and Bioengineering 2023, 10, x FOR PEER REVIEW 5 of 1 and scalogram (Wavelet transform) analysis since it produced better accuracy in our pre liminary experiment. The spectrogram is the STFT of each short signal segment, computed by sliding the window with a constant stride and an overlap through the entire record. I this work, we investigate the effect of different window strides and overlapping sizes o the classifier's performance. The FHR signals are converted into spectrogram images b applying STFT given with (1).
is the input FHR signal, is the window, and L is the window length , STFT of a windowed data centred at time point n and the log values of , ar represented as spectrogram (128 × 128) images. Since the window length L is a hyperpa rameter, we investigated the effect of different window sizes on the classification perfor mance. In our initial experiments the 128 × 128 spectrogram images lead to better classifie performance than 64 × 64 or 256 × 256 image sizes. Thus, we used the 128 × 128 image siz for the rest of our analysis.
In addition to the STFT spectrograms, we also investigated the wavelet scalograms A scalogram is a time-frequency representation with a wavelet basis instead of sinusoida functions. Like the Fourier spectrogram, the scalogram analyses compute the coefficient using sliding windows called wavelets, i.e., the input signal is multiplied with the wavele at different time locations. The process is repeated by increasing the scale of the wavele (also known as the mother wavelet). This dilation and contraction operation capture long-and short-time events from the input, where the dilated wavelet is sensitive to long time events, and the contracted wavelet to short-time events [47]. The wavelet transform of a signal x(t) is defined as the integration of the x(t) with the shifted or scaled shape from a mother wavelet ѱ , as shown in (2).
where a is a scale parameter, b is a translation parameter, and ѱ is the mother wavele function. By using different scale factors of the wavelet transform, computes wavele coefficients of the signal at different scales. The absolute values of these continuous coef ficients define the scalogram (in our case, represented as 128 × 128 image). The choice o the mother wavelet is important as the time-frequency analysis represents the match be tween the wavelet and the FHR signal. Here, we compare the Gaussian of order 8 ('gaus' Morlet ('morl'), Shannon ('shan'), and Mexican Hat ('mexh') wavelets. [47].

Data Augmentation
The dataset used in this study is substantially imbalanced: there are fewer positiv samples (n = 384) than the negative ones (n = 43,293). The class imbalance in the datase can make the learning process very challenging for any classifier and usually leads to poo prediction performance [48]. In our initial model training, we observed signs of overfit ting. To mitigate this problem, we augmented the data of the severe compromise class which we expected to act as a regulariser and improve the generalisation performance o the classifier. Several data augmentation methods have been proposed for time series sig nals, such as flipping the signal, adding noise, masking the segment of the signal [49]. W developed a simple data augmentation approach, tailored to our specific task, which in volved extracting additional 20-min FHR segments from the first 1-h FHR data with 50% overlap, thereby increasing the size of positive samples by a factor of 4. Only the 20-mi segments with less than 50% signal loss were augmented.
The positive instances are the further oversampled by a factor of 2, which led to their overall increase by a factor of 8 i the training dataset. In CTG analysis, the under-sampling of the negative samples is (t) is the mother wavelet function. By using different scale factors of the wavelet transform, WT x computes wavelet coefficients of the signal at different scales. The absolute values of these continuous coefficients define the scalogram (in our case, represented as 128 × 128 image). The choice of the mother wavelet is important as the time-frequency analysis represents the match between the wavelet and the FHR signal. Here, we compare the Gaussian of order 8 ('gaus'), Morlet ('morl'), Shannon ('shan'), and Mexican Hat ('mexh') wavelets. [47].

Data Augmentation
The dataset used in this study is substantially imbalanced: there are fewer positive samples (n = 384) than the negative ones (n = 43,293). The class imbalance in the dataset can make the learning process very challenging for any classifier and usually leads to poor prediction performance [48]. In our initial model training, we observed signs of overfitting. To mitigate this problem, we augmented the data of the severe compromise class, which we expected to act as a regulariser and improve the generalisation performance of the classifier. Several data augmentation methods have been proposed for time series signals, such as flipping the signal, adding noise, masking the segment of the signal [49]. We developed a simple data augmentation approach, tailored to our specific task, which involved extracting additional 20-min FHR segments from the first 1-h FHR data with 50% overlap, thereby increasing the size of positive samples by a factor of 4. Only the 20-min segments with less than 50% signal loss were augmented. The positive instances are then further oversampled by a factor of 2, which led to their overall increase by a factor of 8 in the training dataset. In CTG analysis, the under-sampling of the negative samples is a common practice [12,30], but in our experiments, these techniques did not improve the generalisation performance of the trained models.

Deep Learning Architectures
The proposed architectures primarily constitute 1D-CNN and LSTMs. Three variations are investigated: (i) 5-layer 1D-CNN network, in which the encoder is composed of five 1D-CNN layers and two fully connected (FC) layers; (ii) CNN-LSTM sequential architecture, in which the network has 5-layer 1D-CNN followed by 2-layer LSTM component and two FC layers; (iii) 5-layer CNN-LSTM parallel architecture, in which a 5-layer 1D-CNN and two-layer LSTM networks are connected in parallel, followed by 2 FC layers.
The architectures of the three models are shown in Figure 3. We also employed a 2D-CNN of the FHR signal to analyse the spectrograms and scalograms. The network architecture comprises of 2D-CNNs (with ReLU activation), along with five residual blocks, followed by an average pooling layer. The residual blocks are a 2D-CNN with skip connections, a widely used network architecture for various image recognition tasks (28). Each residual block in the skip connection has a 2D-CNN component, followed by a sequence of batch normalisation, dropout, and max pooling operations ( Figure 4). In addition, we investigated the effect of different kernel sizes on classification performance as well.
Bioengineering 2023, 10, x FOR PEER REVIEW common practice [12,30], but in our experiments, these techniques did not im generalisation performance of the trained models.

Deep Learning Architectures
The proposed architectures primarily constitute 1D-CNN and LSTMs. T tions are investigated: (i) 5-layer 1D-CNN network, in which the encoder is composed of five 1D-C and two fully connected (FC) layers; (ii) CNN-LSTM sequential architecture, in which the network has 5-layer 1D lowed by 2-layer LSTM component and two FC layers; (iii) 5-layer CNN-LSTM parallel architecture, in which a 5-layer 1D-CNN and LSTM networks are connected in parallel, followed by 2 FC layers.
The architectures of the three models are shown in Figure 3. We also empl CNN of the FHR signal to analyse the spectrograms and scalograms. The netw tecture comprises of 2D-CNNs (with ReLU activation), along with five resid followed by an average pooling layer. The residual blocks are a 2D-CNN with nections, a widely used network architecture for various image recognition Each residual block in the skip connection has a 2D-CNN component, follow quence of batch normalisation, dropout, and max pooling operations (Figure tion, we investigated the effect of different kernel sizes on classification perfo well.  Table 1).
( Figure 4) to extract spectral features from the sample. This makes the architecture c of analysing and capturing the signal's temporal and spectral characteristics. Th FHR signal (1D signal) and the corresponding spectrogram or scalogram (2D) are taneously fed to the respective channels. Subsequently, the output of the two chan concatenated and fed to an FC layer.

Training Procedure
We split the data randomly into training (85%) and test (15%) sets, while pres the class ratio in each subset. Ten-fold cross-validation was performed using the t dataset, in which 90% of the samples are used for training the model and the rem 10% for validation. During each fold, the model is trained for a maximum of 400 with early stopping to mitigate overfitting (with a window size of 50 epochs) by m ing the partial area under the receiver characteristic curve (PAUC). After the o  The combined 1D-CNN-LSTM and 2D-CNN deep learning network architecture is a two-channel network: one channel to analyse the input 1D-FHR signal using 1D-CNN-LSTM parallel topology, while the other channel uses a 2D-CNN with skip connections (Figure 4) to extract spectral features from the sample. This makes the architecture capable of analysing and capturing the signal's temporal and spectral characteristics. The input FHR signal (1D signal) and the corresponding spectrogram or scalogram (2D) are simultaneously fed to the respective channels. Subsequently, the output of the two channels is concatenated and fed to an FC layer.
The input to each of the 1D-CNN-LSTM based models is prepared as (B, T, F), where B, T and F represent the batch size, the length of times slices, and the signal dimension, respectively. In a 20-min FHR signal sampled at 0.25 Hz, there are 300 time steps (15 per minute, T = 300, F = 1) in each CTG sample. Similarly, the input to the 2D-CNN network is arranged as (B, H, W, C), where B is the number of samples or the batch size, while W (128), H (128) and C (3) are the width, height and number of channels of the image respectively. A Sigmoid activation function is used at the last layer of all the networks to obtain the class probability prediction of each sample.

Training Procedure
We split the data randomly into training (85%) and test (15%) sets, while preserving the class ratio in each subset. Ten-fold cross-validation was performed using the training dataset, in which 90% of the samples are used for training the model and the remaining 10% for validation. During each fold, the model is trained for a maximum of 400 epochs with early stopping to mitigate overfitting (with a window size of 50 epochs) by monitoring the partial area under the receiver characteristic curve (PAUC). After the optimal model parameters are obtained, the PAUC is evaluated on the test set. We report the average performance of the ten models evaluated on the test set.
Since our class distribution is unbalanced, weighted binary cross-entropy loss is used for training, where the weight is based on the inverse class frequency, i.e., during training, the loss function penalises more (by factor of 14) the misclassification of severe compromised cases. As mentioned above, the network is trained for 400 epochs, using Adam optimiser with an initial learning rate of 0.001, decayed by a factor of 2 every 50 epochs. The batch size is set to 128 and batch normalisation is used with default parameters as recommended in [50]. Furthermore, we used dropout with a probability of 0.3 in all layers and early stopping based on the PAUC. The tuned hyperparameter values, including the CNN module's filter size, the number of filters used in each CNN layer, and the unit number of LSTM modules, are summarised in Table 1. The model is implemented using TensorFlow on an NVIDIA GTX 2080 Ti 12GB GPU machine.
To ensure that every feature in the data has the same level of importance, features are standardized using the z-score: where x i is the original value of sample i in the dataset; z i is the normalized value, µ is the mean, and σ is the standard deviation.

Performance Metrics
The performance of the models is evaluated using the partial area under the receiver operating characteristic curve (PAUC) and the true positive rate (TPR = TP/(TP + FN)) at a 5% false-positive rate (FPR). The PAUC is the AUC between 0 and 10% false-positive rates (FPR = FP/(FP + TN)). The metrics are selected to assess the accuracy of the model only at a very low specificity, i.e., to detect adverse birth outcomes as accurately as possible, while minimising the rate of false positives. It is crucial to minimise unnecessary interventions, particularly early in the labour. ROC curve is a commonly accepted graphical plot that shows the performance of a binary classifier for all classification thresholds. It is based on TPR and FPR values, which are calculated from true positive (TP), false positive (FP), true negative (TN), and false-negative (FN) values. The AUC measures the area underneath the entire ROC curve from (0, 0) to (1,1). We also considered the Precision, Recall and F1-score values of the three 1D-CNN and LSTM based models.

Hyperparameter Tuning
Optimal model hyperparameters are tuned using the Bayesian optimisation (BO) and Hyperband (HB), (BOHB) optimiser [51]. The selected values for the batch size, the number of filters in CNN each layer, the kernel size, the number of layers, and the optimisation functions are all summarised in Table 1.

Performance of the Proposed Models
The performance of the three 1D-CNN and LSTM based models is shown in Figure 5. The 1D-CNN-LSTM parallel architecture achieved higher Sensitivity and PAUC on the testing set than the 1D-CNN-LSTM sequential architecture. The difference between PAUC values of the 1D-CNN-LSTM parallel and 1D-CNN-LSTM sequential models are statistically significant (Mann-Whitney U tests, two-tailed, U = 10, p = 0.002). All other differences Bioengineering 2023, 10, 730 9 of 17 between the three models in terms of PAUC and Sensitivity at 0.95 specificity values are not statistically significant. The comparison of the three models in terms of Precision, Recall, and F1-score is shown in Table 2. The best performing 1D-CNN and LSTM based models (from the 10 models trained using a 10-fold cross-validation) are shown in Figure 6.    Table 3 shows the performance of the 2D-CNN and the multimodal architecture (combined 1D-CNN-LSTM and 2D-CNN). The 2D-CNN using scalogram analysis produced slightly better results compared to the case of using the spectrograms. However, the performance of the 2D-CNN (alone) and the multimodal architecture was inferior compared to the 1D-CNN-LSTM parallel architecture. This indicates that using the raw FHR (temporal representation) as input can lead to a better classification performance than the case relying on the time-frequency representations.    Table 3 shows the performance of the 2D-CNN and the multimodal architecture (combined 1D-CNN-LSTM and 2D-CNN). The 2D-CNN using scalogram analysis produced slightly better results compared to the case of using the spectrograms. However, the performance of the 2D-CNN (alone) and the multimodal architecture was inferior compared to the 1D-CNN-LSTM parallel architecture. This indicates that using the raw FHR (temporal representation) as input can lead to a better classification performance than the case relying on the time-frequency representations.   Table 3 shows the performance of the 2D-CNN and the multimodal architecture (combined 1D-CNN-LSTM and 2D-CNN). The 2D-CNN using scalogram analysis produced slightly better results compared to the case of using the spectrograms. However, the performance of the 2D-CNN (alone) and the multimodal architecture was inferior compared to the 1D-CNN-LSTM parallel architecture. This indicates that using the raw FHR (temporal representation) as input can lead to a better classification performance than the case relying on the time-frequency representations. Table 3. Classification performance (mean of the 10-fold cross-validation on the test set) of the 2D CNN alone and when combined with 1D CNN-LSTM parallel architectures. The performance metrics when varying the window size of the spectrograms and the kernel sizes of the 2D CNN model are given in Table 4. Three different kernel sizes are considered for the comparison: 3 × 3; 5 × 5; and 7 × 7. The rest of the hyperparameters, such as, batch size, number of layers, and number of filters are selected using crossvalidation. The highest PAUC and the sensitivity performance at a Specificity of 0.95 is achieved using a window size (overlapping size) of 64 (32) with a 3 × 3 kernel. This indicates that the two classes can be better separated when the input FHR signals' frequency content is analysed using a window size of about 1 min. Nevertheless, the performance of the spectrogram analysis is inferior compared to the 1D-CNN and LSTM based models. Table 4. Impact of the window length on the classification performance of the spectrograms. The results are performances of the best models from the 10-fold cross-validation, evaluated on the test set (the best classification performances are given in bold).  Table 5 demonstrates the performance metrics and the influence of the different kernel sizes on the scalograms generated using a variety of wavelet functions. The proposed 2D-CNN achieved the highest classification results using a 3 × 3 kernel on scalograms generated with Mexican hat wavelet functions. This indicates a greater similarity between the input FHR signal and the Mexican hat, making it a preferable wavelet. The scalograms analysis showed slightly better separation between the two classes than the spectrograms. However, the wavelets' performances were inferior to those of the 1D-CNN and LSTM based models. Table 5. Classification performance of the different wavelet functions and kernels sizes. The results are performances of the best models from the 10-fold cross-validation, evaluated on the test set (the best classification performances are given in bold).

Comparison with Clinical Practice and OxSys
We compared the 1D-CNN-LSTM parallel model to OxSys 1.5 (3) and clinical benchmark (11). OxSys uses two FHR features and two clinical risk factors to analyze the entire FHR trace with a 15-min sliding window. While in clinical practice, clinicians consider not only the findings of CTG interpretation but also consider clinical risk factors when making the diagnosis. The TPR of detecting severe adverse outcomes by the Clinical Practice, OxSys 1.5, and our model is presented in Table 6. The TPR in clinical practice reported in [3], which is defined as the number of emergency deliveries, is based on a clinical decision for "presumed fetal compromise" as a proportion of the total number of babies with compromise (5/162) within 2-h of start of CTG recording. The FPR is the number of emergency deliveries, based on a clinical decision for "presumed fetal compromise", where there was no compromise as a proportion of the total number of normal cases (108/27,652). The results show that the OxSys 1.5 which is based on the entire CTG and clinical risk factors achieved highest sensitivity. We also compared the sensitivity between our 1D CNN and LSTM-based models and clinical practice, focusing on a similar FPR value of 0.4%. The results, presented in Figure 7, indicate that the sensitivity of the 1D CNN-LSTM parallel model falls slightly below the optimal sensitivity achieved in clinical practice (2.4% vs. 3.1%). However, it is important to interpret these findings with caution, as the results of Clinical Practice are based on clinical risk factors and the initial 2-h CTG recording, whereas our analysis is solely based on the initial 20 min of FHR recording.

Effect of the Pre-Processing on the Model's Output
We investigated the relationship between the model's output and the location and magnitude of signal loss within CTG segments. Table 7 shows Spearman's correlation efficient between the model predictions (range between 0 and 1) and the percentag signal loss, number of gaps in the signal, the longest gap length, and its location. The nal loss summary statistics are computed before the gap imputation. The results sh weak correlation between the model's outputs and the signal loss summary statistics dicating that our model predictions are largely independent of the magnitude and location of signal loss.

Post-Hoc Analysis of the Models' Prediction
We utilized concept attribution as a technique to enhance the explainability of deep learning model [52]. Concept attribution seeks to identify the key features or c cepts in the input data that exert the greatest influence on the model's decision-mak In this case, we investigated whether the predictions generated by the deep learn model were related to the clinical features of FHR used for evaluating initial FHR tra In the initial stages of labour, the FHR baseline and variability play a crucial role in interpretation of CTG (11). Table 8 shows the relationship between the predictions of model (1D-CNN-LSTM parallel architecture) and the standard clinical features of FHR, such as FHR baseline and FHR short-term variability (STV). The testing set sam

Effect of the Pre-Processing on the Model's Output
We investigated the relationship between the model's output and the location and/or magnitude of signal loss within CTG segments. Table 7 shows Spearman's correlation coefficient between the model predictions (range between 0 and 1) and the percentage of signal loss, number of gaps in the signal, the longest gap length, and its location. The signal loss summary statistics are computed before the gap imputation. The results show weak correlation between the model's outputs and the signal loss summary statistics, indicating that our model predictions are largely independent of the magnitude and the location of signal loss. Table 7. Spearman correlation between the 1D-CNN-LSTM network predictions and signal loss (testing set data).

Post-Hoc Analysis of the Models' Prediction
We utilized concept attribution as a technique to enhance the explainability of our deep learning model [52]. Concept attribution seeks to identify the key features or concepts in the input data that exert the greatest influence on the model's decision-making. In this case, we investigated whether the predictions generated by the deep learning model were related to the clinical features of FHR used for evaluating initial FHR traces. In the initial stages of labour, the FHR baseline and variability play a crucial role in the interpretation of CTG (11). Table 8 shows the relationship between the predictions of the model (1D-CNN-LSTM parallel architecture) and the standard clinical features of the FHR, such as FHR baseline and FHR short-term variability (STV). The testing set samples are divided into three groups, based on the quartiles of the predicted values from the 1D-CNN-LSTM network: low (≤25 percentile), medium (>25 and <75 percentile), and high (≥75 percentile). The samples with clinically low STV (STV ≤ 3 ms) are more likely to have high DL predictions signifying increased risk (39.1% vs. 12.1%), which is in line with current clinical understanding that diminished STV is a significant risk for fetal compromise. On the other hand, the model's predictions do not appear to be associated with the FHR baseline.

Discussion
This study investigates the potential of implementing different deep learning architectures for predicting births with and without severe compromise outcomes, using the first 20 min of the FHR signals of more than 51,000 births, recorded as per clinical practice in a UK hospital during 1993-2012. From the designed and proposed DL architectures, the 1D-CNN-LSTM parallel topology model achieved superior classification performance compared to the other two developed models: 2D-CNN; and the multimodal architectures (combined 1D-CNN-LSTM and 2D-CNN). The suboptimal performance of the 2D-CNNs could be attributed to the lack of informative features in the time-frequency representations of the FHR signal. The post-hoc analysis also indicates that the performance of the 1D-CNN-LSTM model is not biased to signal loss, and its predictions are related to a degree to low STV values, which aligns with the clinical expectation of what is important in the initial FHR [11].
The sensitivity of CTG, based on initial hour in detecting severely compromised births, is not well established and there is limited evidence available. Lovers et al. [11] reported that the sensitivity of admission CTG is approximately 3.1% at 0.4% FPR (Figure 7). Our best model, the parallel 1D-CNN-LSTM model, achieved a slightly reduced sensitivity of 2.4% at a 0.4% FPR. This outcome is promising, particularly considering that clinical sensitivity in practice relies on evaluating the initial 2-h CTG data and incorporates various clinical risk factors. Nonetheless, the findings imply that our model has the potential to serve as a valuable aid for clinicians in identifying fetal distress and averting adverse birth outcomes.
Previous studies have also explored the potential of data-driven approaches in detecting abnormalities in CTG traces, focusing on the last hour CTG recording. For instance, Petrozziello et al., [37], demonstrated that a 1D CNNs model that employs more than 35,000 CTGs of the last hour recording can achieve higher TPR in predicting birth acidemia (pH < 7.05) than the clinical diagnosis (53% vs. 31% at about 15% FPR). Other studies (12,30,31), using a much smaller dataset and pH < 7.15 as an abnormal outcome, also implemented 1D CNNs to classify FHR signals. However, these works have focused on detecting birth acidemia based on the last hour CTG recording. When similar outcome groups are investigated (as in this work: with and without severe compromise), the OxSys 1.5 (3), achieved slightly higher TPR (43% at 14% FPR) than the clinical diagnosis (35% at 16% FPR) and our 1D-CNN-LSTM model (35% at 16% FPR) on a dataset of more than 22,000 CTGs. Nevertheless, this relatively higher accuracy results from analysing the entire FHR trace (in our case, it is based on the first 20 min only).
The main contributions of our work are: proposing and implementing DL models, based on uniquely large and detailed dataset allowing their successful training, validation, and testing; the clinically relevant definition of a rare severe compromise; and the focus on the first 20 min of the FHR, seeking an early warning for those fetuses that are unlikely to sustain the stress of labour due to pre-existing vulnerability. We capped the false positive rate to 5% and achieved sensitivity of 20%-an encouraging result given that most infants who would sustain severe compromise, are expected to do so later in labour. Also, given the fact that the false positive rate cannot be precisely defined, due to the routine nature of our data, which includes high rates of clinical intervention and censuring the data. The achieved performance, as compared to the clinical benchmark (as shown in Figure 7), is highly encouraging. This is particularly noteworthy because predicting adverse outcomes in clinical practice relies not only on CTG patterns but also on various risk factors such as abnormal fetal growth, antepartum hemorrhage, prolonged rupture of membranes, and meconium staining of the amniotic fluid [9]. Consequently, a model that provides an objective assessment of CTG, without imposing a significant computational burden (with trace prediction in under a second with the ready trained model), can serve as an integral part of a clinical decision support tool. By doing so, it could contribute to optimizing the allocation of clinical resources, allowing clinicians to focus on other crucial responsibilities. Finally, in our future work, we expect to further improve the model accuracy by incorporating clinical risk factors into the analysis.
Some limitations of our approach are also worth noting. While data augmentation has demonstrated some effectiveness in addressing label imbalance and reducing model overfitting, it is important to consider the potential risk of amplifying label noise within the dataset. Thus, it is necessary to acknowledge that data augmentation alone may not offer a complete solution to the challenges associated with learning from imbalanced datasets. This limitation is evident in the variance of the ten cross-validated models, as depicted in Figure 5. Future work should address this by employing other techniques, perhaps creating synthetic data using generative adversarial networks [53]. The classification performance should also be improved by analysing longer traces and incorporating uterine contraction signals and clinical risk factors (such as fetal gestation and maternal co-morbidities) into the model. Finally, our approach does not explain which segment of the input leads to a particular prediction. Therefore, future work will consider applying an attention layer to provide better explainability [54].

Conclusions
We developed and evaluated three different deep neural network architectures to classify a 20-min FHR segment, recorded around the onset of labour, to investigate their potential in providing very early warning and triage women in labour into high or low-risk groups for further monitoring and/or review. We achieved superior performance using the proposed 1D-CNN-LSTM parallel architecture: the best model achieved a sensitivity of 20% at 95% specificity. The results are clinically encouraging, considering the fact that the majority of the compromised babies are not expected to have demonstrated problems, and if any, they are challenging to detect at the onset of labour. It is important to note that there is also room to improve the model classification performance by analysing the entire FHR trace and incorporating clinical risk factors.
Although the proposed DL architecture achieved encouraging results on the holdout test set, it could also be tested on an external dataset to further validate its generalisation performance. In addition, the investigated models do lack explainability, and future work will incorporate an attention mechanism that will introduce it.

Institutional Review Board Statement:
The study received ethical approval from the Newcastle & North Tyneside 1 Research Ethics Committee, Reference 11/NE0044 (data before 2008), and from the South Central Ethics Committee, Reference 13/SC/0153 (for data after 2008).

Informed Consent Statement:
Informed consent by the participants was not required.