Effect of dataset size, image quality, and image type on deep learning-based automatic prostate segmentation in 3D ultrasound

Nathan Orlando; Igor Gyacskov; Derek J Gillies; Fumin Guo; Cesare Romagnoli; David D’Souza; Derek W Cool; Douglas A Hoover; Aaron Fenster

doi:10.1088/1361-6560/ac5a93

1. Introduction

Prostate biopsy is the current clinical standard for prostate cancer (PCa) diagnosis, but the conventional two-dimensional (2D) transrectal ultrasound (TRUS)-guided biopsy has been reported to have a false negative rate up to 30% (Leite et al 2009). Improved tumour sampling can be achieved with the addition of three-dimensional (3D) TRUS imaging using a magnetic resonance imaging (MRI)-3D TRUS fusion guided biopsy approach, which utilizes the superior soft-tissue contrast of MRI to identify suspicious tissue regions for targeting with real-time TRUS guidance (Cool et al 2016). For PCa treatment, high-dose-rate (HDR) brachytherapy is a common treatment modality for intermediate and high-risk localized PCa (Yamada et al 2012, Yoshioka et al 2016). 3D TRUS imaging provides spatial context through visualization of the anatomy in 3D, in addition to improving intraoperative needle tip identification and patient-specific dose optimization (Hrinivich et al 2016). While 3D TRUS imaging offers many benefits, it also necessitates accurate segmentation of the prostate to perform the biopsy and brachytherapy clinical tasks. Segmentations are often completed manually by the physician during the procedure, which can be time-consuming and highly variable, extending procedure times and increasing risk due to anesthesia exposure (Reich et al 2005).

Multiparametric MRI is quickly becoming a standard of care imaging modality for both biopsy and brachytherapy procedures based on the results of recent randomized controlled clinical trials (Kasivisvanathan et al 2018, Alayed et al 2019, Eklund et al 2021). While MRI offers high soft-tissue contrast, limitations including high-cost limit widespread adoption. For MRI-guided brachytherapy, patient movement to the MRI scanner after needle implant may cause needle shifts (Holly et al 2011), while in-bore procedures require highly specialized magnet-safe tools. In hospitals where MRI is utilized for MRI-3D TRUS fusion-guided biopsy or brachytherapy, registration between the modalities is required, often utilizing a surface-based approach, which requires accurate segmentation of the prostate in both image modalities. Thus, even with an MRI-based approach, accurate and fast prostate segmentation in 3D TRUS images is critical.

With the increasing prevalence of deep learning in medicine (Piccialli et al 2021), and specifically of convolutional neural networks (CNNs) for medical imaging tasks, many deep learning-based automatic prostate segmentation approaches have been proposed for TRUS imaging, promising reduced procedure time and similar performance compared to manual approaches (Anas et al 2018, Ghavami et al 2018, Karimi et al 2019, Lei et al 2019, Wang et al 2019). Recently, Lei et al proposed an anchor-free mask CNN for multi-organ segmentation in 3D TRUS volumes, trained using data from 83 PCa patients with five-fold cross-validation (Lei et al 2021). They reported prostate segmentation accuracy with a Dice similarity coefficient (DSC) of 0.93 ± 0.03 and 95% Hausdorff distance (HD₉₅) of 2.27 ± 0.79 mm. van Sloun et al propose a U-Net based approach for zonal prostate segmentation trained on a large multi-institutional dataset of 436 3D TRUS volumes from 181 men, reporting a median (95% confidence interval) accuracy of 98 (95–99)%, Jaccard index of 0.93 (0.80–0.96), and Hausdorff distance (HD) of 3.0 (1.3–8.7) mm (van Sloun et al 2021).

We previously proposed an automatic segmentation algorithm involving deep learning prediction with a modified U-Net on 2D TRUS images radially sliced from 3D TRUS volumes followed by reconstruction into a 3D surface (Orlando et al 2020). The algorithm was trained on a large dataset with nearly 7000 2D images resliced from 206 clinically variable 3D TRUS images from various acquisition methods, procedure types (biopsy and HDR brachytherapy), ultrasound machines, and ultrasound transducers. On a testing set of 40 unseen 3D TRUS volumes from different acquisition methods, we demonstrated high performance with a median [quartile 1, quartile 3] DSC, mean surface distance (MSD), and HD of 94.1 [92.6, 94.9]%, 0.89 [0.73, 1.09] mm, and 2.89 [2.37, 4.35] mm, respectively. This algorithm outperformed a fully 3D V-Net and state-of-the-art methods in the literature (Orlando et al 2020).

However, large and diverse medical image datasets are rare, especially for ultrasound, with recent papers reporting utilization of datasets with 2238 2D TRUS images (Anas et al 2018), and 40 (Wang et al 2019), 44 (Lei et al 2019), 86 (Lei et al 2021), and 109 (Ghavami et al 2018) 3D TRUS volumes. Generation of large clinical datasets is time-consuming and costly, so generalizable and accurate automatic segmentation approaches suitable for small datasets are critical for the widespread integration of deep learning in minimally invasive PCa procedures. The U-Net++ architecture, an evolution of the standard U-Net, has recently been proposed and may help accomplish this goal, introducing multiple CNN backbones as well as nested, dense skip connections (Zhou et al 2018, 2020). These redesigned skip connections attempt to reduce the semantic gap between feature maps in the encoder and decoder sections of the network, resulting in an easier optimization problem and thus higher performance with small training datasets.

Image quality is highly variable between 3D TRUS volumes, including differences in acquisition methods, image acquisition artifacts, patient anatomy artifacts such as gas, calcifications, catheters, and prostate boundary visibility. These factors may influence prostate segmentation results, so a 3D TRUS image quality grading scale is required to aid in comparing results from different datasets and identify key image quality factors that will influence segmentation performance.

Our work aims to demonstrate that our 2D radial deep learning plus 3D reconstruction approach offers efficient utilization of training data and thus high segmentation performance when trained with smaller datasets and datasets split based on 3D TRUS acquisition type (end-fire and side-fire). To assess the impact of image quality on segmentation performance, we propose an image quality grading scale containing three distinct image quality factors. By rigorously evaluating our deep learning segmentation approach in the context of image quality, training dataset size, and 3D TRUS acquisition type (end-fire and side-fire), we intend to provide a widely accessible, robust, generalizable, and efficient prostate segmentation algorithm. This approach may allow for reduced clinical procedure time and increased efficiency for minimally invasive PCa procedures, allowing for optimization to a clinic's local preferences, without requiring access to large ultrasound image datasets.

2. Materials and methods

2.1. Complete ultrasound dataset

The complete ultrasound dataset consisted of 246 3D TRUS volumes of the prostate (Fenster et al 2011). This dataset contained 104 end-fire 3D TRUS volumes, obtained from clinical prostate biopsy procedures, and 142 side-fire 3D TRUS volumes, obtained from clinical prostate brachytherapy procedures. Patient clinical information such as age, stage of prostate cancer, and Gleason score were not recorded. The methods to acquire 3D TRUS volumes have been described previously, but are briefly summarized here (Tong et al 1996, Bax et al 2008, 2011). To generate these images, a TRUS transducer was mechanically rotated using a motorized fixture about its long axis. 2D TRUS images were acquired at set angular intervals, which were then reconstructed to generate 3D TRUS volumes. The choice of TRUS transducer leads to geometrically variable images: the end-fire transducer used for prostate biopsy was rotated 180° while 2D TRUS images were acquired at 1.0° intervals and reconstructed into a 3D volume; the side-fire transducer used for prostate brachytherapy was rotated 140° while 2D TRUS images were acquired at 0.5° intervals and reconstructed into a fan-shaped 3D TRUS volume.

These 3D TRUS volumes were acquired with 3 transducers used with 3 different ultrasound systems of different ages and from two manufacturers. Specifically, an 8848 transducer was used with the Profocus 2202 ultrasound system (BK Medical, Peabody, MA, USA), C9-5 and BPTRT9-5 transducers were used with the ATL HDI-5000 ultrasound system (Philips, Amsterdam, the Netherlands), and a C9-5 transducer was used with the iU22 ultrasound system (Philips, Amsterdam, the Netherlands). 3D TRUS image sizes ranged from [448 × 350 × 448] to [692 × 520 × 692] voxels with dimensions of [0.115 × 0.115 × 0.115] to [0.190 × 0.195 × 0.190] mm³/voxel for end-fire images and from [300 × 400 × 784] to [408 × 441 × 870] voxels with dimensions of [0.094 × 0.154 × 0.154] to [0.183 × 0.186 × 0.186] mm³/voxel for side-fire images. Manual prostate segmentations in the 3D TRUS volumes, excluding the seminal vesicles, were completed by an observer experienced with 3D TRUS imaging (IG). 20 end-fire 3D TRUS volumes and 20 side-fire 3D TRUS volumes were randomly selected from the complete dataset and reserved as a testing dataset, thus were not included during training.

As outlined in Orlando and Gillies et al, the complete training dataset of 206 3D TRUS volumes was resliced at randomized axial, sagittal, coronal, radial, and oblique image planes, resulting in a final training dataset of 6761 2D TRUS images with matched manual segmentations (Orlando et al 2020). This reslicing allowed for more efficient use of the TRUS data, demonstrating improved performance compared to a fully 3D V-Net approach (Milletari et al 2016, Orlando et al 2020). 2D images were resampled to 256 × 256 pixels with no other applied preprocessing. The complete training dataset of 2D TRUS images was split for deep learning, with 80% (5409 images) used for training and 20% (1352 images) used for validation.

2.2. Reduced-size datasets

To evaluate our method's efficiency in utilizing the training data, we generated smaller datasets by splitting and reducing the complete dataset of 6761 2D TRUS images. In all smaller datasets, we maintained the 80/20 training/validation split for deep learning.

2.2.1. Split end-fire and side-fire datasets

We first split the complete dataset into an end-fire training dataset of 2738 2D TRUS images and a side-fire training dataset of 4023 2D TRUS images (table 1). This allowed for an assessment of generalizability by training two sets of parameters and testing on the opposite 3D TRUS acquisition type, which was unseen during training.

Table 1. Breakdown of the clinical 3D TRUS training dataset of 206 volumes resliced into 2D images for training 2D neural networks. Rows from top to bottom show the complete and reduced-size side-fire, end-fire datasets, and mixed datasets.

Complete dataset	6761 2D TRUS images
Split dataset	4023 Side-fire		4023 Mixed
[Images]		2738 End-fire	2738 Mixed

	1000 Side-fire	1000 End-fire	1000 Mixed
Smaller datasets	500 Side-fire	500 End-fire	500 Mixed
[Images]	250 Side-fire	250 End-fire	250 Mixed
	100 Side-fire	100 End-fire	100 Mixed

2.2.2. Smaller end-fire, side-fire, and mixed datasets

Smaller datasets were generated by reducing the split end-fire and side-fire datasets to assess how segmentation performance depends on the size of the dataset used for training, with an aim to find the smallest dataset which still maintains high segmentation performance. Using the split end-fire and side-fire datasets, images were removed at random to create training datasets with 1000, 500, 250, and 100 2D TRUS images of each acquisition type (table 1). These smaller datasets were generated by reslicing from 36, 18, 9, and 4 3D TRUS volumes, respectively. Thus, variation in image quality and anatomical features, as determined by the 3D TRUS volume, were similarly reduced. This resulted in eight reduced-size datasets (four end-fire and four side-fire).

Similarly, smaller mixed datasets were generated by reducing the complete 2D TRUS dataset. Images were removed at random to create training datasets with 4023, 2738, 1000, 500, 250, and 100 mixed 2D TRUS images (table 1), resliced from 119, 86, 36, 18, 9, and 4 3D TRUS volumes, respectively. The segmentation performance of a network trained using 4023 mixed acquisition images was compared to a network trained using 4023 side-fire images; similarly, the segmentation performance using a training dataset of 2738 mixed images and 2738 end-fire images were compared. In all reduced-size mixed datasets, the ratio between end-fire and side-fire images matched the complete dataset, with 40.5% end-fire images and 59.5% side-fire images. This resulted in six reduced-size mixed datasets.

2.3. Image quality assessment

3D TRUS image quality varies across patients and 3D acquisition methods, and so it is expected to impact segmentation performance. To explore this effect, an experienced interventional and genitourinary radiologist (DC) developed a 3D TRUS image quality grading scale, provided in table 2. To ensure the scale was not biased and was generalizable, it was developed before the physician viewed our 3D TRUS dataset. Image quality was graded using three factors: acquisition quality, artifact severity, and prostate boundary visibility. Acquisition quality rated the quality of the 3D TRUS acquisition itself, ignoring the anatomy artifacts and visibility, ranging from 1 (poor) to 5 (ideal). Examples of poor acquisition quality included image shadowing due to inadequate transducer contact, transducer translation during 3D TRUS acquisition causing anatomy distortion, and issues with ultrasound gain or depth. Artifact severity estimated the degree of image degradation caused by artifact-generating items within the prostate gland, such as calcifications, gas, urinary catheters, and brachytherapy seeds, ranging from 1 (major artifacts) to 5 (no artifacts at all). Prostate boundary graded the visibility or clarity of the prostate boundary with the adjacent periprostatic soft tissue, a key factor in the prostate segmentation task, ranging from 1 (more than 75% of the boundary is indistinguishable) to 3 (40% of the boundary is indistinguishable) to 5 (the entire boundary is clearly visible). The test dataset of 20 end-fire and 20 side-fire 3D TRUS volumes was graded by the same radiologist who was blinded to the qualitative and quantitative segmentation performance. Only the test dataset was graded; as the test dataset was randomly selected from the complete dataset, its images quality distribution was representative of the complete dataset. Five-point numerical grading allowed for a quantitative comparison between end-fire and side-fire 3D TRUS volumes, including the calculation of means and statistical testing.

Table 2. Image quality grading scale for 3D TRUS images of the prostate.

Image quality factor	Description	Scale
Acquisition quality	Quality of the 3D TRUS image acquisition regardless of anatomy	1 (poor)–5 (ideal)
Anatomy artifacts	Severity of anatomy artifacts (calcification, gas, catheter, etc)	1 (major artifacts)–5 (no artifacts)
Prostate boundary	Visibility/clarity of the prostate boundary	1 (>75% of boundary indistinguishable)–3 (40% of boundary indistinguishable)–5 (entire boundary visible)

2.4. 3D segmentation algorithm

Our radial prostate segmentation algorithm was first described in Orlando and Gillies et al and will be briefly summarized here (figure 1). This method utilized a radial segmentation approach, first proposed by Qiu et al for a prostate segmentation algorithm based on convex optimization with shape priors (Qiu et al 2015). In this approach, a 3D TRUS volume is resliced radially about the approximate center of the prostate gland at 15° intervals, generating 12 2D TRUS images. The extracted 2D TRUS images appear very similar, as each plane passes through the mid-gland of the prostate, resulting in similar prostate size and shape regardless of the 3D TRUS acquisition method. This radial approach has been shown to improve segmentation performance in the apex and base of the prostate compared to alternative approaches such as transverse reslicing (Qiu et al 2015). The 12 radial 2D TRUS images were automatically segmented using neural networks trained with the 2D datasets described in sections 2.1 and 2.2 to generate 12 segmented prostate boundaries, which were used to reconstruct the 3D surface of the prostate (figure 1).

2.5. 2D neural networks

Two neural network architectures were used in this work, which were trained with identical 2D TRUS datasets (See sections 2.1 and 2.2). Detailed network diagrams are provided in figures A1 and A2 in appendix A for the modified U-Net and U-Net++, respectively. Data augmentation using random combinations of horizontal flips, shifts up to 20%, rotations up to 20°, and zooms up to 20% were applied to double the training datasets. A personal computer with an i7-9700 K central processing unit (CPU) at 3.60 GHz (Intel Corporation, Santa Clara, CA, USA), 64 GB of RAM, and a 24 GB NVIDIA TITAN RTX (NVIDIA Corporation, Santa Clara, CA, USA) graphics processing units (GPU) was used for training all 2D neural networks and for subsequent prediction on unseen testing data.

2.5.1. Modified U-net

A five-layer deep modified version of the widely prevalent U-Net (Ronneberger et al 2015) was implemented using Keras with TensorFlow (Abadi et al 2016). First, 50% dropouts were applied at the last block on the contracting section of the network and at every block on the expansion section of the network to increase regularization and prevent overfitting (Orlando et al 2020). In addition, transpose convolutions were applied in the expansion section of the network instead of the standard upsampling followed by convolution (upconvolution), as this allowed for improved performance (Orlando et al 2020). Padding and ReLU activation were applied in each (3 × 3) convolution operation, with sigmoid activation used in the final (1 × 1) convolution operation. Additional hyperparameter selection based on preliminary experiments included the use of an Adam optimizer, a learning rate of 0.0001, a Dice-coefficient loss function, 100 epochs, and 200 steps per epoch.

2.5.2. U-Net++

A state-of-the-art U-Net++ architecture (Zhou et al 2018, 2020) was also implemented using Keras with TensorFlow (Abadi et al 2016). We used a standard ResNet-50 architecture (He et al 2016) with batch normalization and a batch size of 10 as our CNN backbone, as it balanced the number of parameters and overfitting risk for the scale of our training datasets. As described in section 2.5.1, the convolution operations and hyperparameters matched the modified U-Net implementation, including the use of transpose convolutions, Adam optimizer, 0.0001 learning rate, Dice-coefficient loss function, and number of epochs.

2.6. Evaluation and comparison

All trained models were evaluated using a testing dataset which consisted of 20 end-fire plus 20 side-fire 3D TRUS volumes unseen by the networks during training. The evaluation metrics included Dice Similarity Coefficient (DSC), recall, precision, absolute volume percent differences (VPD), mean surface distances (MSD), and Hausdorff distances (HD), computed for both 2D radial slice and reconstructed 3D segmentations for each prostate. Computation times were recorded for 2D slice segmentation, 3D reconstruction, and overall 3D segmentation time. We have previously demonstrated significantly improved performance with a 2D radial deep learning plus 3D reconstruction approach compared to fully 3D CNNs; consequently, no 3D CNNs were used for comparison in this work. A detailed list of comparisons and corresponding statistical tests is provided in table 3.

Table 3. List of comparisons and corresponding statistical tests. Parametric tests are shown with corresponding nonparametric alternative tests presented in parentheses.

Comparison	Statistical test
Complete Dataset (6761 2D images)
Modified U-Net versus U-Net++ with complete training dataset	Paired t-test (Wilcoxon matched-pairs signed-rank test)
Segmentation performance versus prostate volume	Pearson (Spearman) correlation coefficient
Split end-fire and side-fire datasets ( table 1)
End-fire training dataset: U-Net versus U-Net++ tested on both end-fire and side-fire images	Paired t-test (Wilcoxon matched-pairs signed-rank test)
Side-fire training dataset: U-Net versus U-Net++ tested on both side-fire and end-fire images	∣ ∣
Split end-fire or side-fire network versus complete training dataset	∣ ∣
Smaller end-fire, side-fire, and mixed datasets (section 2.2.2)^a
Performance with reduced-size datasets versus complete dataset	Repeated measured one-way ANOVA with Tukey's correction (Friedman Test with Dunn's correction)
Mixed training dataset versus end-fire or side-fire dataset of equal size	Paired t-test (Wilcoxon matched-pairs signed-rank test)
Image quality (section 2.3)
End-fire versus side-fire image quality grades for each factor	Unpaired t-test (Mann-Whitney U test)
Segmentation performance for U-Net and U-Net++ versus image quality grade in each factor and overall	One-way ANOVA with Tukey's correction (Kruskal-Wallis test with Dunn's correction)
Correlation between segmentation performance and image quality	Pearson (Spearman) correlation coefficient

^aDue to superior performance of the U-Net++ when trained using split datasets, only the U-Net++ was used for reduced-size dataset experiments.

Statistical calculations were performed in GraphPad Prism 9.2 (Graphpad Software, Inc., San Diego, CA, USA). The Shapiro–Wilk test was used to evaluate the normality of distributions. Failure of the Shapiro–Wilk test led to the use of nonparametric statistical tests and the reporting of median [quartile 1, quartile 3] results. The significance level for statistical analysis was chosen such that the probability of making a type I error was less than 5% (p < 0.05), with statistically significant differences denoted simply as 'significant' for the remainder of this manuscript.

3. Results

3.1. Complete dataset

Example U-Net, U-Net++, and manual segmentations for median end-fire and side-fire cases are shown in figure 2. The evaluation metric results comparing the modified U-Net to the U-Net++ when trained using the full dataset of 6761 images are shown in table 4. No significant differences were observed between the U-Net and U-Net++ for any metric for the full testing dataset. When considering the side-fire and end-fire test datasets separately, no significant differences were observed for the end-fire testing images, while only the precision and recall metrics were significantly different for side-fire testing images, with the U-Net demonstrating higher precision and the U-Net++ demonstrating higher recall. The mean computation time per 2D segmentation was 0.028 s for the modified U-Net and 0.088 s for the U-Net++. The mean 3D reconstruction time was 0.27 s, resulting in a total 3D segmentation time of 0.61 s for the modified U-Net and 1.33 s for the U-Net++.

Table 4. Median [Q1, Q3] 3D results showing the modified U-Net and U-Net++ trained using the complete dataset and tested on an unseen test dataset of 20 end-fire and 20 side-fire 3D TRUS volumes.

Test Dataset	Segmentation	DSC (%)	Recall (%)	Precision (%)	VPD (%)	MSD (mm)	HD (mm)
End-fire	U-Net	94.3 [93.1, 95.2]	96.0 [93.2, 98.7]	94.6 [88.8, 95.8]	5.18 [1.62, 11.2]	1.01 [0.80, 1.21]	3.44 [2.50, 4.65]
	U-Net++	94.5 [92.8, 95.5]	96.0 [93.3, 98.1]	94.1 [90.4, 96.5]	3.54 [1.80, 9.28]	0.93 [0.77, 1.20]	3.39 [2.81, 4.97]
	p-value	0.2904^a	0.9932^a	0.4749	0.0696	0.4284^a	0.8695

Side-fire	U-Net	93.5 [91.1, 94.6]	96.2 [92.5, 98.4]	91.6 [87.8, 94.8]	5.89 [3.17, 11.9]	0.86 [0.71, 0.97]	2.73 [2.42, 4.28]
	U-Net++	93.3 [91.7, 95.0]	97.7 [95.5, 98.4]	89.8 [86.0, 94.1]	6.27 [3.89, 12.8]	0.84 [0.69, 1.03]	3.19 [2.69, 4.33]
	p-value	0.9563	0.0153	0.0495^a	0.3884	0.7562	0.1536

Overall	U-Net	94.1 [92.6, 94.9]	96.0 [93.1, 98.5]	93.2 [88.8, 95.4]	5.78 [2.49, 11.5]	0.89 [0.77, 1.10]	3.14 [2.49, 4.47]
	U-Net++	94.0 [92.2, 95.1]	96.7 [94.1, 98.3]	92.9 [87.1, 95.1]	4.79 [2.93, 10.9]	0.90 [0.73, 1.14]	3.27 [2.71, 4.33]
	p-value	0.6179	0.0641	0.2265	0.6656	0.5657^a	0.2317

DSC, Dice similarity coefficient; VPD, volume percent difference; MSD, mean surface dist.; HD, Hausdorff dist.^aNormal distribution = paired t-test; Bolded metrics highlight the method with significantly reduced relative error.

Of note, a comparison of segmentation performance relative to prostate volume for the U-Net and U-Net++ demonstrated significant correlations between prostate size and the DSC and VPD metrics. The DSC metric showed a Spearman r coefficient of 0.58 and 0.61 for the U-Net and U-Net++, respectively, while the VPD metric showed a Spearman r coefficient of −0.44 and −0.51 for the U-Net and U-Net++, respectively.

3.2. Split end-fire and side-fire datasets

Qualitative segmentation results comparing the modified U-Net and U-Net++ to manual segmentations for networks trained with only end-fire and only side-fire images are shown in figures 3 and 4, respectively, and the corresponding quantitative comparisons are shown in tables 5 and 6. Plots showing DSC for the modified U-Net and U-Net++ trained using only end-fire and only side-fire datasets are shown in figure 5. For both the end-fire and side-fire networks evaluated on the complete testing dataset, which included images from both acquisition methods, the U-Net++ significantly outperformed the modified U-Net for all metrics except VPD for the side-fire networks. When evaluated on the end-fire and side-fire testing datasets separately, the U-Net++ also significantly outperformed the U-Net for all metrics aside from VPD when tested on the same image type it was trained on.

**Figure 3.** Example end-fire (top row) and side-fire (bottom row) median DSC prostate segmentation results comparing manual (red), modified U-Net (blue), and U-Net++ (yellow) 3D surfaces for networks trained only using end-fire images. The columns from left to right show the prostate surface in the axial plane, sagittal plane, and an oblique radial plane, respectively.
Download figure:
Standard image High-resolution image

**Figure 4.** Example side-fire (top row) and end-fire (bottom row) median DSC prostate segmentation results comparing manual (red), modified U-Net (blue), and U-Net++ (yellow) 3D surfaces for networks trained only using side-fire images. The columns from left to right show the prostate surface in the axial plane, sagittal plane, and an oblique radial plane, respectively.
Download figure:
Standard image High-resolution image

Table 5. Median [Q1, Q3] 3D results showing the modified U-Net and U-Net++ trained using only end-fire images and tested on an unseen test dataset of 20 end-fire and 20 side-fire 3D TRUS volumes.

Test Dataset	Segmentation	DSC (%)	Recall (%)	Precision (%)	VPD (%)	MSD (mm)	HD (mm)
End-fire	U-Net	86.8 [82.6, 89.9]	89.3 [82.7, 90.8]	84.5 [80.5, 89.4]	7.80 [2.30, 12.2]	1.96 [1.63, 2.39]	6.43 [5.25, 9.11]
	U-Net++	94.2 [92.4, 95.0]	95.8 [92.1, 98.3]	93.0 [89.5, 96.2]	5.31 [3.56, 8.48]	1.00 [0.85, 1.21]	3.78 [2.95, 5.20]
	p-value	<0.0001	<0.0001^a	<0.0001^a	0.3118	<0.0001^a	<0.0001^a

Side-fire	U-Net	75.7 [63.4, 82.6]	91.7 [83.7, 96.0]	64.8 [50.0, 78.1]	35.9 [12.1, 91.9]	3.36 [1.97, 4.49]	12.4 [8.35, 16.1]
	U-Net++	87.9 [78.2, 92.0]	93.1 [89.8, 96.6]	84.0 [71.6, 90.5]	18.2 [5.40, 34.9]	1.75 [1.17, 2.89]	10.0 [5.89, 11.4]
	p-value	<0.0001	0.2611	<0.0001^a	0.0005	0.0003	0.0363^a

Overall	U-Net	82.6 [75.5, 87.4]	89.5 [83.7, 94.6]	78.9 [64.2, 86.4]	12.0 [4.24, 37.3]	2.19 [1.69, 3.43]	8.92 [6.15, 12.7]
	U-Net++	92.3 [87.7, 94.3]	93.6 [91.2, 97.4]	89.8 [82.9, 95.0]	6.84 [4.35, 18.2]	1.18 [0.90, 1.78]	5.53 [3.61, 10.1]
	p-value	<0.0001	0.0002	<0.0001	0.0006	<0.0001	<0.0001

DSC, Dice similarity coefficient; VPD, volume percent difference; MSD, mean surface dist.; HD, Hausdorff dist.^aNormal distribution = paired t-test; Bolded metrics highlight the method with significantly reduced relative error.

Table 6. Median [Q1, Q3] 3D results showing the modified U-Net and U-Net++ trained using only side-fire images and tested on an unseen test dataset of 20 end-fire and 20 side-fire 3D TRUS volumes.

Test Dataset	Segmentation	DSC (%)	Recall (%)	Precision (%)	VPD (%)	MSD (mm)	HD (mm)
End-fire	U-Net	81.4 [77.8, 86.3]	74.4 [68.3, 85.3]	90.2 [86.3, 93.9]	17.4 [9.87, 26.3]	2.48 [1.79, 3.38]	8.57 [6.80, 11.8]
	U-Net++	84.5 [66.9, 90.3]	75.3 [51.0, 86.9]	96.5 [95.7, 99.2]	23.3 [13.0, 48.3]	2.44 [1.50, 4.72]	8.90 [7.18, 13.8]
	p-value	0.9854	0.2679^a	<0.0001	0.0121	0.5459	0.2436^a

Side-fire	U-Net	84.9 [79.2, 87.8]	85.9 [78.7, 92.3]	82.2 [77.5, 85.9]	5.16 [4.13, 13.0]	1.81 [1.43, 2.37]	7.77 [6.29, 8.68]
	U-Net++	92.4 [91.8, 95.1]	95.2 [91.8, 97.9]	91.6 [87.2, 95.5]	7.11 [3.34, 12.4]	0.95 [0.66, 1.05]	3.21 [2.58, 4.75]
	p-value	<0.0001	<0.0001^a	<0.0001^a	0.4091	<0.0001	<0.0001

Overall	U-Net	82.8 [78.4, 87.4]	80.4 [71.4, 87.6]	86.3 [80.7, 90.6]	11.9 [5.03, 20.4]	2.26 [1.54, 2.83]	7.98 [6.62, 9.60]
	U-Net++	91.5 [84.3, 92.7]	89.3 [75.3, 96.0]	95.6 [88.0, 96.9]	13.0 [6.78, 23.4]	1.25 [0.95, 2.51]	5.46 [3.18, 9.48]
	p-value	0.0035	0.0394	<0.0001	0.0972	0.0192	0.0436

DSC, Dice similarity coefficient; VPD, volume percent difference; MSD, mean surface dist.; HD, Hausdorff dist.^aNormal distribution = paired t-test; Bolded metrics highlight the method with significantly reduced relative error.

**Figure 5.** Plot of median DSC for the modified U-Net and U-Net++ trained using only end-fire (left) and only side-fire (right) images and tested on both end-fire and side-fire images shown on the left and right half of each graph, respectively.
Download figure:
Standard image High-resolution image

Comparing the results shown in tables 4 and 5 of the U-Net++ trained with the full 6761 image dataset to the U-Net++ trained using only end-fire images, use of the full dataset only demonstrated a significant improvement for the HD metric (0.4 mm) when tested on end-fire images. Similarly, comparing the U-Net++ trained with the full dataset to one trained using only side-fire images, only the precision metric was significantly different (1.8%) when tested on side-fire images. When tested on the unseen acquisition type, the use of the full dataset demonstrated improved performance for every metric in both cases.

3.3. Smaller end-fire, side-fire, and mixed datasets

Example segmentation results for the U-Net++ trained with varying-sized end-fire, side-fire, and mixed datasets are shown in figure 6. Plots of DSC as a function of training dataset size are shown in figure 7, highlighting the high performance of the U-Net++ when trained with small datasets. Corresponding quantitative comparisons are provided in appendix tables B1, B2, and B3.

**Figure 7.** Median DSC for the U-Net++ trained using varying-sized end-fire (left), side-fire (middle), and mixed (right) datasets, and tested on both end-fire (solid red) and side-fire (dashed line) images. The number of training images are provided along the x-axis. The dotted line denotes the maximum performance achieved by the U-Net++ trained with the full 6761 image dataset.
Download figure:
Standard image High-resolution image

As shown in table B1, for the U-Net++ trained with reduced-size end-fire datasets and tested on end-fire images, significant differences were observed between the full (2738 images) end-fire training dataset and the 250 and 100 image sets for the DSC and MSD metric, and all reduced-size image sets for the HD metric.

When the U-Net++ was trained with reduced-size side-fire datasets and tested on side-fire images (table B2), multiple comparisons tests showed significant differences for the DSC, MSD, and HD metrics between the full (4023 images) side-fire training dataset and the 500 and 100 image sets.

As shown in table B3, for the U-Net++ trained with reduced-size mixed datasets and tested on end-fire images, multiple comparisons tests showed significant differences between the full (6761 images) mixed training dataset and the 500, 250, and 100 image sets for the DSC, MSD, and HD metrics. When tested on side-fire images significant differences were observed between the full mixed training dataset and the 1000 through 100 image sets for the DSC, MSD, and HD metrics.

Comparing the U-Net++ trained with 2738 mixed images to the U-Net++ trained with 2738 end-fire images, no significant differences were observed when tested on end-fire images, but when tested on side-fire images, use of the mixed training dataset demonstrated significantly improved performance for all metrics. Similarly, for the U-Net++ trained with 4023 mixed images compared to the network trained with 4023 side-fire images, only the precision and recall metrics were significantly different when tested on side-fire images, with all metrics except precision significantly improved with use of the mixed training dataset when tested on end-fire images.

3.4. Image quality

A comparison of average image quality grading results for side-fire and end-fire 3D TRUS images of the prostate is shown in table 7. There were no significant differences between end-fire and side-fire image quality for any image quality factor or for the total averaged image quality.

Table 7. Mean plus/minus standard deviation image quality grading results for side-fire and end-fire 3D TRUS images of the prostate. Mean grades are provided out of five for each individual image quality factor and for the total image quality.

Image quality factor	Description	End-fire 3D TRUS	Side-fire 3D TRUS
Acquisition quality	Quality of the 3D TRUS image acquisition regardless of anatomy	4.4 ± 0.7	4.2 ± 1.2
Anatomy artifacts	Severity of anatomy artifacts (calcification, gas, catheter, etc)	4.0 ± 0.7	4.1 ± 1.0
Prostate boundary	Visibility/clarity of the prostate boundary	4.2 ± 0.7	4.2 ± 1.0

Total	Mean of three individual image quality factors	4.2 ± 0.5	4.1 ± 0.9

A graph of DSC as a factor of grade for each individual image quality factor is shown in figure 8. For end-fire testing images, image quality grade did not have a significant effect on segmentation performance in any metric. For side-fire testing images, only the boundary visibility grade had a significant effect for the modified U-Net, while all image quality factors except anatomy artifact grade had a significant effect on the DSC metric for the U-Net++.

Analysis of plots of DSC as a function of total image quality grade for the U-Net and U-Net++ (figure 9) showed no significant correlation for the end-fire testing dataset for any metric, with Spearman r coefficients less than 0.4. For the side-fire testing images, the modified U-Net showed a significant correlation between total image quality grade and DSC, recall, and HD metrics, with Spearman r coefficients of 0.60, 0.61, and −0.56, respectively, while the U-Net++ showed a significant correlation for the DSC and recall metrics with Spearman r coefficients of 0.46 and 0.55, respectively.

4. Discussion

4.1. Complete dataset

To provide a baseline maximum performance level, we first compared the segmentation accuracy of the modified U-Net to the U-Net++ for both networks trained on the complete dataset. The results shown in table 4 demonstrate the nearly equivalent performance of the networks. This highlights that with a large training dataset of nearly 7000 2D images, the more advanced U-Net++ network with significantly more parameters did not offer any improvement in performance, motivating the experiments described in sections 3.2 and 3.3 focused on reduced-size datasets. Using the same 24 GB NVIDIA TITAN RTX GPU, the modified U-Net demonstrated a segmentation time that was three times faster, with speeds of 0.028 s per 2D slice compared to 0.088 s per slice for the U-Net++. After reconstruction of the 2D predictions into a 3D prostate surface, the total segmentation time was 0.61 s for the modified U-Net, which was half of the 1.33 s for the U-Net++. While this is a large relative difference, in a clinical setting the difference is inconsequential, as both present a significant reduction in segmentation time relative to manual segmentations, which can take 10–20 min

Correlations between segmentation performance and prostate size were only significant for the DSC and VPD metrics. This is an expected result due to the nature of these metrics, as absolute differences that would be readily apparent for smaller prostate volumes would be reduced for large volumes when considering these overlap and volume-based metrics. As expected, boundary-based metrics showed no correlation with prostate size. The correlations we did observe were still weak, however, with Spearman coefficients of roughly r = 0.6 for DSC and r = −0.4 to −0.5 for VPD, highlighting the general robustness of our approach to prostate size differences.

Recent work by Lei et al and van Sloun et al report state-of-the-art methods for automatic prostate segmentation in 3D TRUS volumes. Lei et al report best results for DSC, MSD, and HD₉₅ of 93.0 ± 3.0, 0.57 ± 0.20 mm, and 2.27 ± 0.79 mm, respectively (Lei et al 2021). Our proposed 3D segmentation method using the modified U-Net and the U-Net++ demonstrated median [Q1, Q3] DSC of 94.1 [92.6, 94.9]% and 94.0 [92.2, 95.1]%, showing improved performance. Our proposed networks demonstrated median MSD [Q1, Q3] of 0.89 [0.77, 1.10] mm and 0.90 [0.73, 1.14] mm, with Lei et al showing improved performance for this metric. We do not report an HD₉₅ metric, but our networks demonstrated a median HD [Q1, Q3] of 3.14 [2.49, 4.47] mm and 3.27 [2.71, 4.33], which while higher, is a more stringent metric showing the full HD, not excluding the worst five percent. van Sloun et al report a median (95% confidence interval) accuracy of 98 (95–99)%, Jaccard index of 93.0 (80.0–96.0), and HD of 3.0 (1.3–8.7) mm (van Sloun et al 2021). While not reported, our modified U-Net and U-Net++ demonstrated a median [Q1, Q3] accuracy of 98.8 [98.4, 99.2]% and 98.9 [98.4, 99.2]%, and a median HD [Q1, Q3] of 3.14 [2.49, 4.47] mm and 3.27 [2.71, 4.33], both of which are similar, with our approach reporting higher accuracy and van Sloun et al reporting lower HD. Our results were obtained with considerably less data, however, as van Sloun et al utilized a training dataset of 436 3D TRUS volumes compared to our complete training dataset of 206 3D TRUS volumes.

4.2. Split end-fire and side-fire datasets

Segmentation performance of the modified U-Net and U-Net++ trained with only end-fire or only side-fire images (figure 5 and tables 5 and 6) showed that the U-Net++ significantly outperformed the modified U-Net in nearly all cases. When trained using side-fire images and tested on end-fire images, no difference was observed, but the U-Net++ did have higher median performance, countered by a larger variation. These differences highlight the generalizability and efficiency of the U-Net++ in utilizing small training datasets. The modified U-Net had boundary errors due to shadowing artifacts, even when tested on the same image type as seen in the top row of figure 3. When tested on the image type not seen during training of the network, the U-Net++ still performed better, although it also had difficulties with shadowing artifacts (e.g. the bottom row of figure 3, with the heavily shadowed region seen near the top of the prostate). The modified U-Net had a depth of five layers compared to 50 for the U-Net++. This reduction in depth and number of parameters for the U-Net compared to the U-Net++ may alleviate the overfitting problem, which is important as training dataset size is reduced.

When assessing how the U-Net++ trained with only end-fire or only side-fire images compared to one trained with the full dataset, we found that there was little difference when tested on the same TRUS acquisition type the networks were trained with. This highlights a potentially practical finding that the presence of other image types in the training dataset do not add a significant benefit to the segmentation performance when only one image type is required to be segmented. However, when the U-Net++ trained with only end-fire or only side-fire images were tested on the TRUS acquisition type they had never seen before, use of the full dataset significantly improved performance. This demonstrates the necessity of including all image types in the training dataset, especially when generalizability and widespread application is important. DSC performance in these cases was still in the range of 85%–89% for the U-Net++, however, demonstrating the generalizability of our approach.

Differences between end-fire and side-fire images, including image quality and artifact prevalence, may explain the observed segmentation performance differences between TRUS image types. The differences in acquisition method between end-fire and side-fire 3D TRUS may result in artifacts in side-fire images such as air gaps due to lack of transducer contact or distal shadowing due to transducer distance from the prostate. Due to the nature of end-fire image acquisition, the radial plane used for deep learning segmentation matches closely the acquisition plane, resulting in improved segmentation accuracy. For side-fire images, only one of the twelve radial planes is the acquisition plane and the other eleven are interpolated slices resulting in reduced resolution, potentially explaining some of the observed differences in segmentation performance. In HDR brachytherapy procedures where side-fire 3D TRUS is utilized, urinary catheters are commonly used, which create artifacts that are not seen in end-fire images used for prostate biopsy. The appearance of other organs such as the rectum and bladder also differ between end-fire and side-fire leading to increased prostate segmentation error where the algorithm included parts of the rectum or bladder when tested on the 3D TRUS type unseen by the network. Furthermore, due to differences in patient selection and the prevalence of hormone therapy prior to HDR brachytherapy treatment, the prostate sizes in patients presenting for end-fire TRUS-guided biopsy are typically larger than the prostate sizes of patients undergoing side-fire TRUS-guided HDR brachytherapy. This led to underpredictions for side-fire networks tested on end-fire images and overpredictions for end-fire networks tested on side-fire images, limiting generalizability and necessitating the presence of both 3D TRUS types in the training dataset so the network can learn differences in size and shape.

4.3. Smaller end-fire, side-fire, and mixed datasets

For as small as 500 end-fire images used in the training dataset, which is just over 7% of the full dataset, DSC performance was within 1% of the U-Net++ trained with the full dataset of 6761 images. Results were similar for the U-Net++ trained with reduced-size side-fire datasets and tested on side-fire images. Networks trained with end-fire images performed better when tested on side-fire images compared to networks trained with side-fire images and tested on end-fire images, suggesting features the network learns from end-fire images are more generalizable to side-fire images. As expected, mixed training datasets had high segmentation performance when tested on both image types even as the dataset size was reduced. This improved performance and generalizability is apparent in figure 7, highlighting the benefit of including all image types in the training dataset.

For a segmentation task involving only one image type, performance plateaus at a training dataset size of 1000 2D training images of that type, which were obtained from approximately 36 3D volumes. A dataset of this size is achievable at even small hospitals or research centers, showing that deep learning segmentation in 3D may be possible even with limited data by utilizing organ symmetry and a radial approach. The reduced training data requirement reduces the amount of manual segmentation required, a key benefit as accurate manual segmentation is a difficult and time-consuming process that is often a bottleneck in supervised machine learning. These results also show that for a segmentation task involving multiple image types, the presence of all image types in the training dataset is critical. Segmentation performance for mixed training datasets also plateaus at approximately 1000 training images, suggesting that deep learning segmentation in two image types is possible even if data is scarce.

4.4. Image quality

We developed a 5-point image quality grading scale based on three factors specifically for 3D TRUS prostate images. This grading scale helps provide transparency regarding the image quality of our clinical dataset, helping to contextualize our results. A numerical scale with clearly defined image quality factors rated from one to five may enable an easier comparison of segmentation performance between networks trained using different datasets. Designing the image quality grading scale independently of our dataset should allow it to be successfully applied to 3D TRUS datasets of varying quality.

Mean image quality grades for each individual factor provided in table 7 highlight the overall high quality of our dataset and the general similarity in image quality between end-fire and side-fire images, with no statistically significant differences observed and a maximum difference in mean of only 0.2. Side-fire images did have an increased standard deviation for each individual factor, highlighting the larger range of image qualities, including the presence of grades of 2 in each factor, which was not seen in the end-fire images. Our dataset contained no images with a grade of 1.

For end-fire images in our testing dataset, image quality had no significant effect on segmentation performance. In contrast, for side-fire images in our testing dataset, the boundary visibility grade and the acquisition quality, boundary visibility, and total averaged image quality grades significantly impacted segmentation performance for the U-Net and U-Net++, respectively. Boundary visibility showed to be a key factor in the algorithm's ability to accurately segment the prostate boundary for both networks, as expected. These results were further confirmed with the correlation analysis shown in figure 9, highlighting the significant effect of image quality on segmentation performance for side-fire images, but not for end-fire images. Correlations were not strong with Spearman r coefficients in the range of 0.46–0.6 for the DSC metric for both the U-Net and U-Net++.

The lack of significant differences observed when comparing how segmentation performance varies with image quality, especially considering the end-fire images, may be attributed to the high mean image quality and subtle variation between the poorest quality image and the highest quality image. A dataset with more variation in image quality may better demonstrate the dependence of segmentation performance on image quality. In addition, due to the testing set size of 20 end-fire and 20 side-fire 3D TRUS volumes, some individual image quality grades had a very small sample size, which likely factored into the lack of significant differences observed for some of the image quality factors.

The differences in image quality and its effect on segmentation performance for end-fire compared to side-fire images may be explained in part due to the nature of image acquisition. Ultrasound transducer orientation is one critical component; during end-fire image acquisition the transducer contacts the rectal wall at the end of the transducer pointing towards the prostate. During side-fire acquisition, however, the transducer is positioned horizontally inside the rectum, requiring a much larger contact area, which can result in increased prevalence of air gaps due to lost contact, reducing image quality. Furthermore, due to differences in transducer position based on the intended application the side-fire transducer is further away from the prostate, leading to hypoechoic regions away from the transducer due to issues with time-gain compensation.

4.5. Limitations and future work

Only one observer provided manual gold standard segmentations, thus inter- and intra-observer variability were not directly assessed; however, these considerations were addressed in Orlando et al (2020). In addition, only one observer defined the image quality grading scale and graded the testing dataset, which did not assess the impact of inter- or intra-observer variability. Future work will include validation of our image quality grading scale and its reliability, including an assessment of inter- and intra-observer variability. Image quality of the training dataset may play a critical role in segmentation performance, and although image quality of the testing dataset should have been representative of the training dataset, direct grading of the training images would allow for confirmation of this assumption. As shown in table 7, our 3D TRUS dataset was of high quality on average. A wide range in image quality is important for algorithm generalizability. Future work should investigate our segmentation approach when trained and tested with a lower quality dataset, ideally from a different center.

Patient clinical information, such as age, stage of prostate cancer, and Gleason score was not recorded for our dataset, and thus an assessment of how segmentation performance is impacted by these measures could not be completed. While this has not been assessed in previous work to our knowledge, differences in such measures could manifest as differences in image quality, potentially captured by our image quality grading scale as anatomy artifact severity for example. Future work could explicitly investigate the influence of patient clinical information on segmentation quality.

For our U-Net++ implementation, only one type of CNN backbone (ResNet) was used. Future work will utilize a U-Net++ ensemble network with results from multiple CNN backbones combined into one segmentation result using a method such as averaging, majority vote, or the STAPLE algorithm (Warfield et al 2004). Finally, a leave-one-out vendor study examining the impact of ultrasound machine vendor on segmentation performance would offer a strong assessment of generalizability, which is critical for widespread clinical translation.

5. Conclusions

This study investigated the effect of training dataset size, image quality, and image type on prostate segmentation in 3D TRUS volumes using a 2D radial plus 3D reconstruction approach, comparing a modified U-Net to a U-Net++ architecture. Beginning with a large, clinically diverse dataset of TRUS images, smaller training datasets were generated by splitting and reducing the dataset. Segmentation performance for the U-Net++ plateaued at end-fire, side-fire, or mixed training dataset sizes of 1000 2D images, resliced from approximately 36 3D volumes. This high performance with small datasets highlights the potential for widespread use of our approach or similar methods, even if data is scarce, demonstrating the possibility for increased access to automated segmentation methods. The development of an image quality grading scale specifically for 3D TRUS imaging provides a quantitative tool for assessing segmentation performance, with an aim to increase transparency regarding dataset quality and aid in comparison between segmentation methods trained using different datasets.

Acknowledgments

The authors are grateful for the funding support from the Ontario Institute of Cancer Research (OICR), the Canadian Institutes of Health Research (CIHR), and the Natural Sciences and Engineering Research Council of Canada (NSERC). This work was also supported by the London Regional Cancer Program's Catalyst Grants program using funds raised by the London Health Sciences Foundation. N Orlando was supported in part by the Queen Elizabeth II Graduate Scholarship in Science and Technology. The authors would also like to thank Dr Ashley Mercado for his assistance in collecting images during prostate biopsy procedures and Dr Aaron Ward for his thoughtful discussion surrounding this work.

Disclosures

The authors have no relevant conflicts of interest to disclose.

Appendix A

**Figure A1.** Network diagram for the modified U-Net.
Download figure:
Standard image High-resolution image

**Figure A2.** Network diagram for the U-Net++.
Download figure:
Standard image High-resolution image

Appendix B

Table B1. Median [Q1, Q3] 2D results for the U-Net++ trained using end-fire datasets of varying size, from 2738 (full end-fire set) to 100 images. The networks were evaluated on an unseen test dataset of 20 end-fire and 20 side-fire 3D TRUS images of the prostate.

Test dataset	End-fire training images	DSC (%)	Recall (%)	Precision (%)	MSD (mm)	HD (mm)
End-fire	2738	94.5 [93.5, 95.1]	93.7 [91.0, 95.9]	95.8 [93.6, 97.4]	1.18 [0.98, 1.42]	3.79 [2.97, 4.58]
	1000	93.7 [91.7, 94.8]	93.8 [89.8, 95.3]	95.9 [94.0, 97.2]	1.35 [1.20, 1.97]	4.76 [3.89, 6.19]
	500	93.9 [92.0, 94.8]	95.1 [90.3, 96.2]	94.5 [91.8, 97.1]	1.38 [1.06, 1.96]	4.39 [3.54, 5.60]
	250	93.2 [91.9, 94.9]	93.1 [90.1, 95.8]	95.4 [93.1, 97.2]	1.52 [1.13, 1.70]	4.66 [3.86, 5.67]
	100	90.1 [88.5, 92.6]	87.5 [84.0, 92.2]	95.1 [92.0, 96.8]	2.04 [1.46, 2.91]	6.24 [4.55, 7.60]

Side-fire	2738	88.8 [80.3, 92.7]	89.7 [84.7, 93.4]	90.8 [82.4, 94.5]	2.00 [1.46, 3.64]	6.38 [5.01, 11.1]
	1000	86.4 [74.1, 90.7]	92.6 [89.2, 95.2]	83.2 [63.2, 91.2]	3.81 [1.96, 6.17]	12.4 [6.72, 17.5]
	500	83.8 [73.9, 90.4]	82.1 [67.1, 90.9]	88.9 [81.8, 94.8]	2.89 [1.89, 4.89]	9.83 [6.14, 13.0]
	250	82.1 [75.0, 88.2]	85.1 [70.8, 91.1]	85.9 [73.7, 92.9]	3.62 [2.64, 4.99]	12.0 [8.83, 14.5]
	100	75.3 [64.4, 84.4]	86.3 [78.6, 92.5]	71.3 [59.3, 82.9]	5.29 [3.57, 6.84]	16.1 [11.7, 18.6]

Overall	2738	93.3 [88.5, 94.7]	92.0 [88.5, 95.1]	94.0 [90.6, 96.2]	1.44 [1.07, 2.03]	4.78 [3.47, 6.94]
	1000	91.3 [86.2, 94.2]	93.2 [89.5, 95.3]	93.3 [82.9, 96.0]	1.95 [1.34, 4.55]	6.40 [4.26, 14.1]
	500	91.5 [83.3, 93.9]	90.5 [81.8, 95.3]	92.7 [86.9, 95.7]	1.94 [1.24, 2.93]	5.83 [4.08, 9.83]
	250	89.2 [81.7, 93.2]	90.6 [82.2, 93.8]	93.1 [83.8, 96.5]	1.98 [1.48, 3.75]	6.60 [4.63, 12.0]
	100	87.7 [74.8, 90.3]	87.0 [82.5, 92.3]	87.3 [70.8, 95.2]	3.01 [2.03, 5.32]	9.13 [5.88, 16.7]

DSC, Dice similarity coefficient; MSD, mean surface dist.; HD, Hausdorff dist. Bolded metrics denotes significant differences compared to the full (2738 images) dataset.

Table B2. Median [Q1, Q3] 2D results for the U-Net++ trained using side-fire datasets of varying size, from 4023 (full end-fire set) to 100 images. The networks were evaluated on an unseen test dataset of 20 end-fire and 20 side-fire 3D TRUS images of the prostate.

Test dataset	Side-fire training images	DSC (%)	Recall (%)	Precision (%)	MSD (mm)	HD (mm)
End-fire	4023	83.5 [71.8, 90.7]	74.8 [58.3, 85.6]	97.6 [96.5, 98.8]	3.07 [2.15, 4.89]	8.56 [7.32, 12.4]
	1000	77.7 [71.1, 84.2]	64.5 [57.2, 74.2]	98.8 [97.8, 99.7]	3.83 [3.13, 5.44]	10.9 [8.47, 14.2]
	500	74.4 [66.0, 81.7]	62.5 [50.0, 71.3]	98.0 [95.8, 99.4]	4.05 [3.33, 6.15]	12.4 [9.31, 16.5]
	250	73.8 [69.7, 81.1]	60.2 [57.1, 69.8]	98.6 [95.2, 99.6]	4.59 [3.76, 5.57]	12.3 [10.2, 14.6]
	100	65.3 [58.0, 71.7]	49.2 [52.6, 59.1]	97.0 [94.8, 99.6]	6.28 [4.19, 7.28]	14.8 [10.8, 16.3]

Side-fire	4023	93.3 [91.5, 95.0]	92.1 [88.5, 94.6]	95.1 [93.7, 97.7]	1.13 [0.90, 1.48]	3.50 [2.96, 4.67]
	1000	90.3 [89.0, 94.9]	93.8 [87.4, 95.5]	93.7 [87.9, 95.1]	1.59 [0.95, 2.13]	4.95 [3.59, 6.89]
	500	90.3 [84.0, 92.7]	86.6 [78.9, 90.5]	95.7 [91.1, 98.0]	1.68 [1.20, 2.79]	5.84 [3.88, 8.99]
	250	89.2 [85.4, 92.6]	87.2 [84.7, 91.8]	91.8 [89.0, 95.9]	2.21 [1.20, 2.98]	8.44 [4.17, 10.1]
	100	81.0 [73.1, 83.7]	75.9 [67.0, 85.2]	85.4 [81.9, 95.9]	3.46 [2.88, 4.27]	9.76 [8.80, 11.6]

Overall	4023	91.0 [83.3, 93.6]	86.7 [74.5, 92.8]	97.0 [94.1, 98.1]	1.65 [1.09, 3.08]	5.30 [3.48, 8.60]
	1000	85.9 [76.2, 92.9]	84.6 [63.7, 94.4]	95.9 [92.6, 99.0]	2.35 [1.26, 4.59]	7.43 [4.12, 11.8]
	500	82.4 [70.9, 90.5]	74.4 [58.3, 87.2]	97.1 [94.3, 98.3]	2.85 [1.59, 4.79]	9.32 [5.54, 14.3]
	250	83.6 [73.4, 89.5]	80.4 [60.1, 87.5]	95.5 [91.0, 98.6]	3.13 [1.85, 4.70]	10.2 [6.89, 12.7]
	100	72.1 [62.3, 81.3]	63.9 [48.8, 77.2]	93.5 [84.7, 97.1]	4.25 [3.26, 6.76]	11.5 [9.45, 15.5]

DSC, Dice similarity coefficient; MSD, mean surface dist.; HD, Hausdorff dist. Bolded metrics denotes significant differences compared to the full (4023 images) dataset.

Table B3. Median [Q1, Q3] 2D results for the U-Net++ trained using mixed (end-fire and side-fire) datasets of varying size, from 6761 (full dataset) to 100 images. The networks were evaluated on an unseen test dataset of 20 end-fire and 20 side-fire 3D TRUS images of the prostate.

Test dataset	Mixed training images	DSC (%)	Recall (%)	Precision (%)	MSD (mm)	HD (mm)
End-fire	6761	94.9 [93.8, 95.5]	94.6 [92.7, 95.6]	96.3 [94.0, 97.2]	1.10 [0.95, 1.38]	3.68 [2.87, 4.54]
	4023	94.4 [93.4, 95.6]	96.2 [93.3, 97.6]	94.0 [90.7, 96.3]	1.31 [1.04, 1.50]	4.34 [3.60, 5.33]
	2738	94.8 [93.2, 95.5]	94.1 [90.4, 95.6]	96.4 [95.5, 98.1]	1.17 [0.99, 1.41]	3.71 [3.30, 4.68]
	1000	94.3 [92.0, 95.0]	95.2 [91.4, 96.6]	94.4 [91.8, 97.3]	1.29 [1.05, 1.95]	4.05 [3.45, 5.87]
	500	92.0 [89.6, 93.8]	90.0 [84.5, 94.6]	96.4 [93.4, 98.0]	1.58 [1.38, 2.07]	5.03 [4.14, 6.65]
	250	89.5 [84.3, 91.1]	84.3 [74.2, 88.9]	96.5 [93.3, 98.2]	2.11 [1.82, 3.42]	6.16 [5.10, 9.08]
	100	75.2 [63.6, 80.2]	60.5 [48.9, 69.0]	99.6 [97.1, 99.9]	4.90 [3.88, 6.99]	11.8 [8.51, 14.9]

Side-fire	6761	94.6 [92.3, 95.1]	95.7 [91.3, 96.5]	94.2 [92.2, 96.4]	1.00 [0.83, 1.32]	3.27 [2.85, 4.19]
	4023	93.6 [91.0, 95.2]	94.4 [90.8, 96.6]	93.7 [90.1, 95.9]	1.09 [0.91, 1.80]	3.84 [3.10, 5.72]
	2738	93.1 [89.6, 94.9]	92.5 [89.3, 93.6]	95.5 [91.3, 96.7]	1.23 [0.88, 1.69]	4.29 [3.12, 5.69]
	1000	91.8 [88.2, 94.6]	92.5 [88.8, 95.6]	94.2 [88.0, 96.4]	1.44 [1.00, 2.33]	4.96 [3.63, 7.14]
	500	90.5 [86.4, 93.6]	93.8 [88.3, 95.5]	90.9 [83.8, 95.2]	1.73 [1.21, 2.74]	5.61 [3.78, 8.69]
	250	85.2 [81.4, 90.8]	92.2 [88.6, 94.9]	80.9 [74.1, 89.1]	3.05 [1.73, 4.56]	8.85 [5.85, 15.6]
	100	84.3 [77.4, 86.0]	80.9 [76.4, 87.1]	85.6 [80.2, 92.5]	2.74 [2.52, 3.88]	8.48 [6.66, 11.6]

Overall	6761	94.7 [92.6, 95.4]	95.1 [92.7, 96.4]	95.7 [92.9, 96.8]	1.07 [0.90, 1.33]	3.60 [2.87, 4.45]
	4023	94.2 [92.6, 95.4]	95.2 [92.8, 97.1]	93.9 [90.6, 96.1]	1.21 [1.00, 1.52]	4.00 [3.50, 5.33]
	2738	94.1 [92.1, 95.2]	92.9 [90.2, 94.8]	96.1 [92.4, 97.4]	1.19 [0.98, 1.56]	3.98 [3.30, 4.76]
	1000	93.2 [89.7, 94.9]	94.6 [89.4, 95.9]	94.3 [88.3, 96.5]	1.32 [1.01, 2.03]	4.45 [3.52, 6.15]
	500	91.8 [88.4, 93.8]	92.0 [87.4, 94.9]	95.1 [87.5, 96.7]	1.63 [1.35, 2.40]	5.14 [4.10, 7.99]
	250	87.0 [82.9, 91.0]	88.9 [82.9, 92.5]	91.4 [80.8, 96.5]	2.41 [1.82, 3.56]	7.84 [5.46, 10.6]
	100	79.6 [67.7, 85.8]	73.1 [58.0, 81.8]	93.2 [83.4, 99.7]	3.88 [2.61, 5.81]	9.86 [7.28, 13.4]

DSC, Dice similarity coefficient; MSD, mean surface dist.; HD, Hausdorff dist. Bolded metrics denotes significant differences compared to the full (6761 images) dataset.

Effect of dataset size, image quality, and image type on deep learning-based automatic prostate segmentation in 3D ultrasound

Article metrics

Author e-mails

Author affiliations

Author notes

Dates

Abstract

1. Introduction