Multi-centre deep learning for placenta segmentation in obstetric ultrasound with multi-observer and cross-country generalization

Andreasen, Lisbeth Anita; Feragen, Aasa; Christensen, Anders Nymark; Thybo, Jonathan Kistrup; Svendsen, Morten Bo S.; Zepf, Kilian; Lekadir, Karim; Tolsgaard, Martin Grønnebæk

doi:10.1038/s41598-023-29105-x

Download PDF

Article
Open access
Published: 08 February 2023

Multi-centre deep learning for placenta segmentation in obstetric ultrasound with multi-observer and cross-country generalization

Lisbeth Anita Andreasen¹,
Aasa Feragen²,
Anders Nymark Christensen²,
Jonathan Kistrup Thybo²,
Morten Bo S. Svendsen¹,
Kilian Zepf²,
Karim Lekadir³ &
…
Martin Grønnebæk Tolsgaard^1,4

Scientific Reports volume 13, Article number: 2221 (2023) Cite this article

1850 Accesses
2 Citations
12 Altmetric
Metrics details

Subjects

Abstract

The placenta is crucial to fetal well-being and it plays a significant role in the pathogenesis of hypertensive pregnancy disorders. Moreover, a timely diagnosis of placenta previa may save lives. Ultrasound is the primary imaging modality in pregnancy, but high-quality imaging depends on the access to equipment and staff, which is not possible in all settings. Convolutional neural networks may help standardize the acquisition of images for fetal diagnostics. Our aim was to develop a deep learning based model for classification and segmentation of the placenta in ultrasound images. We trained a model based on manual annotations of 7,500 ultrasound images to identify and segment the placenta. The model's performance was compared to annotations made by 25 clinicians (experts, trainees, midwives). The overall image classification accuracy was 81%. The average intersection over union score (IoU) reached 0.78. The model’s accuracy was lower than experts’ and trainees’, but it outperformed all clinicians at delineating the placenta, IoU = 0.75 vs 0.69, 0.66, 0.59. The model was cross validated on 100 2nd trimester images from Barcelona, yielding an accuracy of 76%, IoU 0.68. In conclusion, we developed a model for automatic classification and segmentation of the placenta with consistent performance across different patient populations. It may be used for automated detection of placenta previa and enable future deep learning research in placental dysfunction.

Evaluation of deep convolutional neural networks for automatic classification of common maternal fetal ultrasound planes

Article Open access 23 June 2020

A multimodal deep learning model for predicting severe hemorrhage in placenta previa

Article Open access 13 October 2023

Transfer learning for accurate fetal organ classification from ultrasound images: a potential tool for maternal healthcare providers

Article Open access 20 October 2023

Introduction

Within many fields of medical imaging, methods like convolutional neural networks (CNN) are being used to improve visual diagnostics¹. In maternal–fetal medicine, CNNs have been developed for recognizing and optimizing standard planes for fetal weight estimation, echocardiography, and neurosonography^2,3,4,5.

One area of fetal ultrasound imaging that remains difficult to evaluate systematically is the placenta, making it an obvious target for neural network analysis. The placenta supplies the fetus with oxygen and nutrients and is the key to several pregnancy complications, including placenta previa, gestational hypertensive disorders, fetal growth restriction, and intrauterine fetal death. Placental volume is the only known morphological factor that can predict adverse events. A small placenta at the first-trimester scan is associated with an increased risk of impaired fetal growth and preeclampsia^6,7. However, estimating placental volume is time-consuming, and operator dependency is problematic.

For placental volume estimation, semi-automatic 3D placenta segmentation has been successfully performed in previous studies. Yang et al. used a CNN to perform automatic segmentation of placenta, amniotic fluid, and fetus⁸. Looney et al. trained a deep neural network for automated 3D segmentation of the placenta for volume estimation in first trimester placentas⁹. These placentas were segmented automatically using a random walker algorithm after initial seeding by two operators. This was further improved in a recent study by applying a multiclass CNN¹⁰. Anterior placentas are generally more easily segmented, and classification accuracy improves when the CNN is trained on either anterior or posterior placentas¹¹. Studies examining 2D ultrasound placenta segmentation are scarce. Among these, Hu et al. used manual segmentation as ground truth and showed that including a layer containing shadow detection improves performance in images with acoustic shadows¹².

Most existing studies on placenta classification and segmentation have used structured protocols for image acquisition, which may inflate accuracy at the expense of robustness and generalizability. Few studies have examined how placenta classification and segmentation models perform with different patient populations and datasets as most models are based on single-institution datasets. Finally, there is minimal data on how well these models compare to the performances of clinicians with different levels of ultrasound competence.

To fill this gap, we aimed to develop a model using CNNs to identify the location of placental tissue and perform segmentation of the placenta based on images acquired during routine ultrasound examinations. To examine how well our model performed with different populations, we trained, tested, and validated the model using data from several hospitals across two different European countries, and we determined baseline performance for different groups of clinicians.

Results

Results based on the entire test set

Image classification

The overall image classification accuracy on the entire test set across trimesters, was 81.42%. Accuracy, precision, recall, and F1 across trimesters are included in Table 1.

Table 1 Performance measures and IoU for the model performed on the entire test set across trimesters.

Full size table

Classification performance improved with gestational age and was stable from the second trimester. Most mistakes made in the first trimester were due to failure to detect placental tissue as the precision remains stable in the first trimester.

Image segmentation

For those test images where placental tissue was detected, the IoU between the prediction and the annotated placenta can be found in Table 1. We note that performance increases from the first trimester and that performance is stable in the second and third trimesters.

Classification of anterior/posterior placental tissue

The model for distinguishing between anterior and posterior placental tissue yielded an accuracy of 71% (Fig. 1). The model performed better at recognizing the anterior placenta, which is to be expected based on previous research¹³.

Comparison with different groups of clinicians on a data subset

For image classification, where the model detected whether the image contained placental tissue or not, Table 2 compares the model’s accuracy with the performance of different groups of clinicians for a subset of 24 randomly selected ultrasound images. Table 3 demonstrates the segmentation performance of the model and different groups of clinicians on segmentations where both the ground truth and the prediction/clinician indicated that the image contained placental tissue. These comparisons are summarized visually in the violin plots in Fig. 2. Example segmentations along with the ground truth are shown in Fig. 3.

Table 2 Classification accuracy on a subset of 24 randomly selected ultrasound images of the model compared to mean and standard deviation of classification accuracy distributions for different groups across trimesters. P-values of t-test are reported in parentheses, using a one-sided, one-sample t-test.

Full size table

Table 3 Segmentation performance measured by IoU across trimesters for those subset cases, where images were correctly classified as containing placental tissue by the model or clinician.

Full size table

For the classification problem of detecting whether or not the image contained placental tissue, the model obtained an accuracy comparable to that of the midwives. While this might seem low, the results in Table 3 and Fig. 2 show that when the model detects placental tissue, it consistently performs better and more robustly than all groups of experts (Supplementary Information).

Cross-country generalization

Finally, we evaluated the generalization of the final model on a dataset from another country (Spain) by externally testing it on 100 annotated images from a publicly available dataset from Barcelona. The performance was slightly lower than in the dataset from Copenhagen concerning accuracy but with similar F1 values (Table 4).

Table 4 Cross-validation performance on a dataset from Barcelona containing 100 2nd trimester ultrasound images.

Full size table

For the 25 images where both the prediction and the ground truth detected placental tissue, the IoU was 0.68 ± 0.20.

Discussion

We developed a model to identify and segment the placenta in ultrasound images. Our model identified the presence of placental tissue with an accuracy of 81%, and in images with correctly identified placental tissue, it reached an average IoU of 0.78.

We benchmarked the model's performance to obstetric staff with different levels of ultrasound experience—fetal medicine experts, obstetrician trainees, and midwives. The model performed at the level of midwives (Table 2) for the classification task, and it consistently yielded better and more robust segmentation of the placenta than all groups of clinicians (Table 3).

Finally, identification of placental location related to the posterior or anterior uterine wall reached an accuracy of 71%.. In alignment with other studies¹³, we found that the model performed better on anterior placentas (Fig. 1).

We have shown that a deep neural network for placenta analysis can perform as well as—and better than—certain groups of clinicians. As such, we have shown that it is possible to remove the most time-consuming part of placenta segmentation—the annotation process. This will facilitate future research on placenta-mediated disorders as it enables efficient analysis of large datasets, thereby making future image analysis of placental disease possible.

This is important as it may help improve prediction of adverse pregnancy outcomes. In particular, hypertensive pregnancy disorders, compromised fetal growth, and fetal death often stem from placental dysfunction—either due to insufficient initial placentation or diminishing placental function with increasing gestational age^14,15. However,these conditions that compromise both maternal and fetal well-being present themselves without warning, most often too late for preemptive measures. A huge effort has been put into researching predictors of these conditions, yielding some promising but preliminary results. Biomarkers in the maternal bloodstream have been investigated extensively and an association has been found between certain biomarkers and the development of preeclampsia, impaired fetal growth and stillbirth^16,17,18. Recently, placental microRNAs coding for genes related to angiogenesis, growth, and immunomodulation have been found to be measurable in maternal blood lending promise to yet another potential source of early detection of placenta-related pathology^19,20. However, these are limited by small sample size and expensive methodology.

Ultrasound can be used to measure Doppler velocimetry in the uterine artery, which is associated with development of preeclampsia and placental pathology^21,22, but when it comes to morphological markers of impaired placental function, reliable indicators are scarce. It is well-documented that placental volume is often smaller in hypertensive pregnancies and cases with growth restriction^23,24. A study by Looney et al. has already demonstrated that it is possible to estimate placental volume in the first trimester using deep learning and that low placental volume by this method is a predictor of low birth weight⁹. However, our results offer interesting perspectives to this. In placental insufficiency and preeclampsia cases, placental tissue often shows histological and macroscopic changes²⁵. These structural changes are not visible to the naked eye on ultrasound—and computers may be able to identify such minuscule changes and predict disorders that manifest at a much later point in pregnancy. In fact, Gupta et al. found that AI image texture analysis of the placenta differed significantly between hypertensive and non-hypertensive mothers²⁶, and in a much larger study by Hu et al.¹² authors achieved an accuracy of 81% on the classification of placentas as either normal or abnormal defined by the presence of preeclampsia or fetal growth restriction at birth. Placenta segmentation was achieved using a CNN.

Our findings add to these existing studies and may lead to a novel approach to early prediction of adverse pregnancy outcomes. CNNs alone or in combination with biomarkers may ensure that preventive measures and increased surveillance can be established in time to avoid adverse outcomes. This would revolutionize prenatal care, and our study paves the way to more studies on AI-based prediction of placenta-related disorders using automated segmentation of routine ultrasound images.

Another important future application of our model lies in low-resource settings where the use of ultrasound is limited due to suboptimal access to equipment and qualified personnel. For example, our model can help medical staff identify patients with placenta previa. If this condition is not recognized prior to the onset of labor, both mother and fetus are likely to die from excessive blood loss, as safe delivery is only possible by caesarean section. Our model may enable screening in such settings and save lives. However, studies are needed to determine how best to apply models such as ours for real-time ultrasound imaging using a variety of ultrasound equipment and users with different levels of ultrasound experience.

Strengths and limitations

Our results are based on a large dataset of 7,500 annotated ultrasound images. Images were obtained as part of routine ultrasound examinations and were highly heterogeneous. A particular challenge was that many images were acquired to present an anatomical structure within the fetus and the placental tissue visualized was often an accidental finding in the image. In other words, most placentas were only visible in a small area of the image. In addition to the inherent complexities of ultrasound images such as acoustic artifacts, image optimization settings, differences in mother and fetus, the examinations were performed by a number of different sonographers using different ultrasound machines.

However, we chose to include all types of images regardless of image quality in our analysis. To achieve higher model performance, more high-quality images dedicated to identifying placental tissue could have been included. However, such training data would have limited generalizability of the model considering real-life’s ‘messy’ data and applications. Consequently, the heterogeneity of our dataset also represents a strength.

Moreover, the large variance in image quality also exposed the fact that placenta identification and segmentation was very difficult for clinicians, including experts, when presented with images of sub-optimal quality. It is a strength that we obtained data on the performance of clinicians with a wide range of clinical experience, which allowed us to better benchmark our model. However, the small number of images in both the cross-validation test set and the dataset presented to the clinicians (24 images) represent a limitation.

Finally, a major strength of our study was the use of cross-validation using a dataset from another country. This cross-validation confirmed that our model was able to perform consistently with different patient populations as well as different groups of clinicians.

Conclusion

We developed and used CNNs for placenta identification, classification and segmentation. Our model performed slightly worse than experienced clinicians at a classification task but more consistently for the segmentation task than all groups of clinicians.

Our findings have implications for the use of ultrasound diagnostics by untrained health care providers and future research in AI-based placental pathology detection, which may be of paramount importance for predicting a wide range of maternal–fetal placenta-related diseases.

Methods

Image annotation

A total of 7,500 ultrasound images were utilized, sampled randomly from obstetric ultrasound examinations recorded in Copenhagen, Denmark from January 1 to December 31 2018. Sample size was based on feasibility. The images were taken as a part of the Danish screening program, including first-trimester aneuploidy screening, second-trimester screening for fetal malformations, and other examinations performed on indication. All ultrasound images are routinely stored locally in the Astraia database (Astraia Software GmbH, Munich, Germany) and we had access to all of these examinations. Finally, images for annotation were anonymized and stored in a secure server.

The presence or absence of a placenta in the images was recorded. Then, for images containing placenta, the annotation tool Labelme²⁷ was used to outline the placenta to create the ground truth dataset. This manual annotation was performed by the lead author (LA) for all 7500 ultrasound images.

In addition, in 500 randomly selected images, the placental position was labeled as belonging to either the anterior or posterior uterine wall.

In order to compare model performance to that of clinical staff, five fetal medicine experts, ten OB-gyn residents, and ten midwives annotated 24 randomly selected ultrasound images representing all trimesters (8/9/7 images from first/second/third trimester respectively). The participants gave written informed consent and participation was voluntary. Fetal medicine experts were specialists in fetal medicine, OB-gyn residents were employed in an OB-gyn department but without a specialist degree and midwives were authorized midwives without prior ultrasound experience. There was no subject overlap between this additional test set and the training set.

Ethics

This study was approved by The Danish Data Protection Agency (protocol no P-2019-310) and The Danish Patient Safety Authority (protocol no 3-3031-2915/1). Under Danish law The Danish Patient Safety Authority can give permission to use data from patient charts for research purposes without consent from the individual patient. This was done prior to data acquisition.

This study has been conducted in accordance with the relevant guidelines and regulations.

Model architecture and training

The Mask R-CNN model proposed by He et al.²⁸ was used in this study due to its wide use in instance segmentation in many different domains. The model was initialized with a backbone pre-trained on the ImageNet database and trained on our data using PyTorch²⁹. During training, the model was supplied with images and corresponding binary masks, encoding the location of each region of placental tissue in the image on a per-pixel basis. The masks were extracted from the polygons created during the annotation process. During inference, the model located all regions predicted to contain placental tissue and outputted a mask for each region. In addition, the model predicted each predicted confidence score between 0 and 1 for the presence of placental tissue.

Images and partitions

Out of the 7500 images that were screened, 2130 images were found to contain placental regions during the annotation process. These were randomly split into a training set, a validation set, and a test set containing 1704, 208 and 218 images. In addition, 208 and 218 images with no placental regions were added to the validation- and test-sets to estimate the robustness of the model to false negatives. The training/validation/test partitions were created to ensure that no subject’s images would appear in more than one partition.

Of the 500 images screened and annotated for anterior or posterior placental position, 164 images showed placental tissue. These were divided into training-, validation- and test-partitions containing 116, 24 and 24 images respectively. The validation- and test-sets were made with an equal number of anterior and posterior placental regions. No images without placental regions were added in this experiment.

The validation sets were used for finding optimal hyperparameters, including the acceptance threshold, training duration, and the amount of Random Augmentation to be applied.

An additional test was performed on 100 annotated images from a publicly available dataset from Barcelona. These images were acquired using Voluson E6, Voluson S8 and Voluson S10 (GE Medical Systems, Zipf, Austria), and Aloka (Aloka CO., LTD), while the Danish dataset was recorded on Voluson E6, Voluson S8 and Voluson S10 (GE Medical Systems, Zipf, Austria).

Random augmentation of training data

The model was trained both with and without the addition of Random Augmentation. The Random Augmentation was added to the training data to make the model more robust. Geometry-transforming augmentations such as shearing and rotation were added to both the images and corresponding masks, whereas per-pixel augmentations such as brightness or color inversions were only applied to the images. The amount of augmentation applied constitutes a hyperparameter adjusted by tuning the number of random augmentations applied to each image.

For each image in the training partition, a number of randomly augmented images were created and added to the set. The non-augmented version was kept in the training set as well.

Training

The model was trained on the training set for 100 epochs using the Adam optimization algorithm. The model performance was computed on the validation set after each epoch, and the best-performing model was saved as the final output model. The model was trained on an Nvidia Quadro RTX 6000 24 GB GPU with a batch size of 5 and a training time of approximately 6 min per epoch.

Model prediction processing

During inference, the output of the model was processed in two steps:

Confidence

All predicted placental regions with a confidence score below a set threshold (the acceptance threshold = 0.5) were discarded.

Overlapping predictions

If multiple placental regions were predicted by the model in an image and one of the regions was located entirely inside another region, the region with the lowest confidence was discarded.