Ultrasound image segmentation based on Transformer and U-Net with joint loss

Background Ultrasound image segmentation is challenging due to the low signal-to-noise ratio and poor quality of ultrasound images. With deep learning advancements, convolutional neural networks (CNNs) have been widely used for ultrasound image segmentation. However, due to the intrinsic locality of convolutional operations and the varying shapes of segmentation objects, segmentation methods based on CNNs still face challenges with accuracy and generalization. In addition, Transformer is a network architecture with self-attention mechanisms that performs well in the field of computer vision. Based on the characteristics of Transformer and CNNs, we propose a hybrid architecture based on Transformer and U-Net with joint loss for ultrasound image segmentation, referred to as TU-Net. Methods TU-Net is based on the encoder-decoder architecture and includes encoder, parallel attention mechanism and decoder modules. The encoder module is responsible for reducing dimensions and capturing different levels of feature information from ultrasound images; the parallel attention mechanism is responsible for capturing global and multiscale local feature information; and the decoder module is responsible for gradually recovering dimensions and delineating the boundaries of the segmentation target. Additionally, we adopt joint loss to optimize learning and improve segmentation accuracy. We use experiments on datasets of two types of ultrasound images to verify the proposed architecture. We use the Dice scores, precision, recall, Hausdorff distance (HD) and average symmetric surface distance (ASD) as evaluation metrics for segmentation performance. Results For the brachia plexus and fetal head ultrasound image datasets, TU-Net achieves mean Dice scores of 79.59% and 97.94%; precisions of 81.25% and 98.18%; recalls of 80.19% and 97.72%; HDs (mm) of 12.44 and 6.93; and ASDs (mm) of 4.29 and 2.97, respectively. Compared with those of the other six segmentation algorithms, the mean values of TU-Net increased by approximately 3.41%, 2.62%, 3.74%, 36.40% and 31.96% for the Dice score, precision, recall, HD and ASD, respectively.


INTRODUCTION
As a medical imaging modality, ultrasound imaging has been widely applied in clinical screening, diagnosis and treatment.Accurately segmenting ultrasound images is very important for making subsequent diagnoses.Unlike computer tomography (CT) and magnetic resonance imaging (MRI), ultrasound imaging is portable, cost-effective and uses nonionizing radiation.Nevertheless, due to their coherent nature, ultrasound images are impacted by speckle noise, missing boundaries and low signal-to-noise ratios (SNR).Therefore, ultrasound images are more difficult to segment than other medical images (Fiorentino et al., 2023;Wang et al., 2021).Some algorithms used in traditional image segmentation have been applied to ultrasound images, but they have not improved segmentation accuracy.With deep learning advancements, some approaches based on convolutional neural networks (CNNs) have been widely used in the field of ultrasound imaging.In particular, architectures based on the encoder-decoder architecture, such as U-Net, have improved segmentation accuracy (Malhotra et al., 2022;Ronneberger, Fscher & Brox, 2015).The encoder module is responsible for reducing dimensions and capturing different levels of feature information.The decoder module is responsible for gradually recovering dimensions and delineating boundaries of the segmentation target.In addition, the skip connection between the encoder and decoder can compensate for the loss of feature information caused by successive convolutions and pooling.However, due to the intrinsic locality of convolution operations, these approaches are limited by global context.The attention mechanism can be used as a resource allocation scheme, which is the main method for addressing information overload and is applied in the computer vision field (de Santana Correia & Colombini, 2022).Hence, some researchers have combined attention mechanisms with CNNs to capture global feature information.
Transformers can capture global feature information with their long-range dependency capabilities.Therefore, the Transformer-based model has achieved state-of-the-art performance in natural language processing (NLP).The subsequently proposed Vision Transformer (ViT) applied image recognition to improve results.ViT takes image patches as input and uses self-attention to learn the global information of all image patches.Some approaches based on ViT applied image segmentation and improved performance.However, ViT focuses on global feature information and lacks localization information.Similarly, some approaches based on encoder-decoder apply successive convolutions and pooling, resulting in a lack of global feature information and spatial information.Due to the various shapes, sizes and blurry boundaries of ultrasound images, segmenting objects is difficult.Considering these issues, we propose a hybrid Transformer and U-Net with joint loss (TU-Net) for the segmentation of ultrasound images.The contributions of this work can be summarized as follows: (1) TU-Net can integrate local and global feature information with CNNs and Transformers.(2) We adopt the parallel attention mechanism in the proposed TU-Net.Among the attention mechanisms, the one based on CNNs can extract multiscale local feature information of targets with varying shapes, and the one based on Transformer can capture global feature information.(3) For segmentation objects with blurry boundaries, we propose using the Dice and TopK joint loss to improve prediction accuracy.We validate the effectiveness of our proposed method; the proposed method outperforms state-of-the-art methods on ultrasound images of the brachial plexus (BP) and fetal head.
The remainder of the article is organized as follows.The related works are shown in "Related Work".The proposed method and implementation details are explored in "Materials and Methods".Extensive experiments are conducted to evaluate our proposed methods in "Experimental Details".In "Results" and "Discussion", we discuss these results and conclude this article.

RELATED WORK Segmentation networks based on CNNs
With the advancement of deep learning, some approaches based on CNNs are widely used in the field of ultrasound imaging segmentation.Based on U-Net, Zhou et al. (2019) proposed using a series of dense skip pathways to capture more feature information from images.This method can compensate for the loss of feature information caused by successive convolutions and pooling.Due to the intrinsic locality of convolution operations, approaches based on U-Net are limited by global context.Therefore, some studies have adopted multiple-channel convolution to solve this problem.Mehta & Sivaswamy (2017) proposed a novel network that used CNNs to combine and represent 3D context information for brain structure segmentation (M-Net).Javaid, Dasnoy & Lee (2018) proposed dilated convolution with U-Net to extract global feature information for breast image segmentation (dilated U-Net).Zhang et al. (2020) proposed a multiplechannel with a large kernel convolution network for ultrasound image segmentation (MA-Net).When segmenting objects with various shapes and sizes, these approaches based on CNNs have weak generalization.

Segmentation networks based on Transformer
The transformer model originally designed for NLP and the subsequently proposed ViT applied image recognition to achieve better results.Meanwhile, some approaches based on ViT have been applied in the field of image segmentation.Wang et al. (2022) proposed a novel mixed transformer module for simultaneous intra-and inter-affinity learning for medical image segmentation.Shen et al. (2022) applied Transformer with residual axial attention for breast structure segmentation of ultrasound images.Gao, Zhou & Metaxas (2021) proposed an efficient self-attention mechanism along with relative position encoding for medical image segmentation.

MATERIALS AND METHODS
The proposed TU-Net includes encoder, parallel attention mechanism and decoder modules.The structure of TU-Net is shown in Fig. 1.First, the ultrasound image is input into a successive encoder module to obtain high-dimensional feature maps.Next, these feature maps are input into the parallel attention mechanism to obtain global and multiscale local feature information.Finally, the different level feature information in the encoder module is connected to the decoder module by a skip pathway to generate the segmentation mask.

Encoder and decoder modules
Networks based on the encoder-decoder architecture have been widely used in image segmentation.The encoder module is responsible for reducing dimensions and capturing different levels of feature information of ultrasound images.The decoder module is responsible for gradually recovering dimensions and delineating the boundaries of the segmentation target.In the proposed TU-Net, ResNet-50 is used as the encoder to capture the feature information of the input image.The decoder module consists of an upsampling layer and two successive convolution layers, which are used to recover spatial dimensions and boundary information.Due to the loss of feature information caused by successive convolutions and pooling, we adopt skip pathways to deliver feature information captured by the encoder to the decoder (Drozdzal et al., 2016).

Parallel attention mechanism module
Many networks based on the attention mechanism have been widely applied in the field of image segmentation.The attention mechanism can avoid using multiple similar feature maps and focus on the most salient and informative features without additional supervision.Recently, ViT has achieved excellent performance in many computer vision tasks using a self-attention mechanism.However, it focuses on global feature information and neglects localization information.Similarly, some approaches based on encoderdecoder apply successive convolutions and pooling, resulting in a lack of global feature information and a loss of spatial information.Therefore, we adopt the self-attention mechanism of Transformer to capture the global feature information and use a series of atrous convolution and pyramid pooling to capture multiscale local feature information.
In traditional segmentation networks based on ViT, Transformer is directly used as an encoder to extract feature information from images.As shown by some experiments and studies by other researchers, this method cannot improve segmentation accuracy (Chen et al., 2021).Therefore, we utilize the high-dimensional feature maps from the successive encoder module as the input of the transformer module.We first reshape the input feature maps into a sequence of flattened two-dimensional (2D) patches.Each patch is of size P × P (P = 16), and the number of image patches is N = H × W/P 2 (H and W specify the dimensions of feature maps).Second, we use a trainable linear projection to map the patches into a latent D-dimensional embedding.To retain the positional information of the patches, we use position embeddings that are added to the embeddings of the patches.The self-attention mechanism of TU-Net consists of L (L = 12) transformer modules.Each transformer module consists of a multihead self-attention (MSA) layer and a multilayer perceptron (MLP) layer (Dosovitskiy et al., 2020).These layers are connected in turn.The output of each transformer module can be written as follows: where LN(.) denotes the layer normalization operator, and Z l is the transformer image representation.The structure of the transformer module is illustrated in Fig. 2.
To address boundary information loss and segmentation objects with various sizes, we adopt a series of atrous convolution and pyramid pooling modules to capture multiscale

Joint loss
Image segmentation is used to determine whether a pixel belongs to the foreground or background of an image.The Dice coefficient is the most commonly used segmentation evaluation metric in the field of image segmentation.This coefficient represents the overlapping region of the ground truth and the prediction.The Dice coefficient is given as: where |X| indicates the ground-truth pixels, |Y| represents the value predicted pixels and jX \ Yj is the value of shared pixels in |X| and |Y|.The Dice loss function is defined as follows: However, the size of the regions of interest in different ultrasound images greatly vary.Thus, the learning process can become trapped in local minima of the loss function, resulting in the predictions of the network being strongly biased towards the background and missing or only partially detecting the foreground.Meanwhile, most labelled pixels can easily be discriminated against others and further research on these pixels will not improve the accuracy.Therefore, we add TopK loss in the process of training.TopK loss is also a variant of cross entropy, which is used to force networks towards hard samples and automatically balance biased training data during processing (Ma et al., 2021).
The TopK loss function is defined as: where g i and s i denote the ground truth and predicted segmentation of voxel i, respectively.C is the number of classes, and N is the number of voxels.K is the set of the k% worst pixels.In TU-Net, the TopK loss with k = 10% is the default setting.In the proposed TU-Net, we adopt joint loss to optimize learning and improve prediction accuracy.L total is the sum of the Dice and TopK loss functions.The joint loss function is given as: where L Dice and L Topk are the Dice and TopK loss functions, respectively.

Evaluation metrics
We also use precision, recall, Hausdorff distance (HD) and average symmetric surface distance (ASD) as evaluation metrics in addition to the Dice scores.Precision and recall are defined according to Eqs. ( 7) and ( 8), respectively.
where TN, TP, FN and FP are the true-negative, true-positive, false-negative and falsepositive values, respectively (Chang et al., 2009).Because there are considerable noise and outliers in the ultrasound images, we also include metrics based on the surface distance.
The HD represents the maximum value of misalignment between two objects, which is mainly used to evaluate the structural difference between two targets.Smaller HD values represent higher segmentation accuracy (Zhang et al., 2020).
The sets of points of A and B are S(A) and S(B), respectively.The shortest distance of S (A) to an arbitrary point v is given by: The shortest distance of S(B) to an arbitrary point v is given by: where kÁk represents the Euclidean distance.The HD is defined as: The ASD represents the average value of surface distances of A to B and B to A. Similarly, smaller ASD values represent higher segmentation accuracy.The ASD is defined as:  (Chen et al., 2021).These algorithms are derived from their applications in the original works, and the hyperparameters used for training are given in Table 1.In addition, the Friedman test is applied as a statistical analysis to evaluate the segmentation performance of the algorithms.In this analysis, 5% is a significant level.

RESULTS
To demonstrate the superiority of TU-Net segmentation performance, we compare it to other segmentation algorithms, including U-Net, U-Net++, M-Net, dilated U-Net, MA-Net and TransUNet.We use the mean value and standard deviation of Dice scores, precision, recall, HD and ASD to evaluate algorithm performance.Finally, SPSS 23.0 is used for statistical analysis of the above algorithms.

Branchia Plexus datasets
Because BP is an important motor and sensory nerve of the upper limb, blocking BP can relieve much pain in upper limb surgery.Therefore, accurately segmenting the structure of BP is very important for anaesthesia during upper limb surgery.The BP datasets of ultrasound images are taken from the 2016 Kaggle competition (Montoya et al., 2016;Zhang & Zhang, 2022).This dataset includes segmentation objects with various sizes.Table 2 shows the mean and standard deviation of five evaluation metrics for different segmentation algorithms.TU-Net achieves a value of 79.59 ± 0.16 for Dice (%), 81.25 ± 0.17 for precision (%), 80.19 ± 0.18 for recall (%), 12.44 ± 11.23 for HD (mm) and 4.29 ± 2.57 for ASD (mm).The segmentation accuracy of our proposed method is better than that of the other methods.The segmentation results of the seven networks are shown in Fig. 4.

Fetal head datasets
Genetic factors and nutrition intake can affect the growth of fetus.Obstetricians generally monitor fetal health and development by measuring the head circumference of foetuses from ultrasound images.Therefore, the accuracy of head structure segmentation affects the accuracy of head circumference measurement.In the subsequent experiments, the fetal head image datasets are from the medical image segmentation challenge (van den Heuvel et al., 2018b).It contains 788 ultrasound images of fetal heads at various stages.These images are randomly divided into training and test datasets.The training datasets include 718 samples, and the test datasets include 70 samples.Because the number of training samples is small, we apply horizontal flipping, vertical flipping, random scaling and rotation to increase the number of training samples.Ultimately, the number of training samples is 3,590.Meanwhile, all samples are cropped to 320 × 320 for our experiments.The best result is highlighted with bold.
Table 3 shows the mean and standard deviation of five evaluation metrics for different segmentation algorithms.TU-Net achieves a value of 97.94 ± 0.01 for Dice (%), 98.18 ± 0.01 for precision (%), 97.72 ± 0.01 for recall (%), 6.93 ± 2.15 for HD (mm) and 2.97 ± 0.94 for ASD (mm).Our proposed method significantly improves segmentation accuracy.The segmentation results of the seven networks are shown in Fig. 5.

Ablation study
To evaluate the effectiveness of each module in TU-Net, we conducted ablation studies on the two types of datasets.Dice scores are used as the evaluation metric of the following experiments.

Joint loss
Using different loss functions with the same architecture of the network greatly impacts segmentation performance.Therefore, we adopt joint loss to optimize learning efficiency and improve segmentation accuracy.We test different combinations of loss functions to determine the optimal joint loss in the experiments (e.   methods is the maximum median.Meanwhile, we observe that segmentation performance is strongly dependent on the combination loss function and that the Dice-related compound loss function has better segmentation accuracy than the other functions.Therefore, the compound loss function of Dice and TopK is the optimal loss function.
Compared with other loss functions, our proposed method can improve the mean Dice values of the BP and fetal head datasets by approximately 1.08% and 0.17%, respectively.

Parallel attention mechanism module
TU-Net adopts a parallel attention mechanism to capture the global and multiscale local feature information of ultrasound images.The parallel attention mechanism consists of a transformer module and a series of atrous convolution and pyramid pooling modules.To demonstrate the advantages of the parallel attention mechanism in TU-Net, we conduct the following experiments on the BP and fetal head datasets.First, we remove the parallel attention mechanism from TU-Net (U-Net).Similarly, we remove the transformer module from the parallel attention mechanism in TU-Net (AU-Net).Finally, we remove the atrous convolution and pyramid module from the parallel attention mechanism in TU-Net (T-Net).Figure 7 displays boxplots for the mean Dice scores of different attention mechanism modules.Figure 7 indicates that the median of our proposed methods is the maximum and our proposed method can improve segmentation accuracy.These experiments indicate that the parallel attention mechanism module can improve the mean Dice scores of the BP and fetal head datasets by approximately 0.80% and 0.13%, respectively.

Input feature
Traditionally, ViT takes image patches as the input of Transformer to learn the global relation of all image patches.In TU-Net, we first obtain high-dimensional feature maps from the successive encoder module.Next, these feature maps are input into the parallel attention mechanism to obtain global and multiscale local feature information.Finally, the

Statistical analysis
We apply statistical analysis to evaluate the performance of different networks.Since the Dice scores are not a Gaussian distribution, we use the nonparametric Friedman test to evaluate segmentation performance (Friedman, 2012).The mean rank and p-value are shown in Table 4.A p-value less than 0.05 was considered to indicate a significant difference across the compared algorithms.Higher mean rank values indicate higher segmentation performance.The results of the statistical analysis are shown in Table 4.These results show that TU-Net significantly improves segmentation performance compared with other algorithms.

DISCUSSION
In this article, we propose a hybrid Transformer and U-Net with a joint loss algorithm for the segmentation of ultrasound images.The proposed algorithm is based on the encoderdecoder architecture and includes encoder, parallel attention mechanism and decoder modules.Meanwhile, we adopt a compound loss function with Dice and TopK to optimize learning efficiency and improve segmentation accuracy.Finally, we use comparison experiments and ablation studies to verify our proposed algorithm on two types of ultrasound image datasets.
In the comparison experiments, we compare two types of segmentation algorithms: one is segmentation algorithms based on CNNs, and the other is segmentation algorithms based on hybrid CNNs and Transformer.Figure 9 shows the bar plots of the mean metric scores of the seven algorithms.It can be observed that the hybrid CNNs and Transformer algorithms perform better than other algorithms based on CNNs because CNNs and Transformers can capture local and global feature information, respectively.This fusion of local and global feature information can improve segmentation accuracy.Meanwhile, in the ablation studies, we find the key components for improved segmentation performance.First, the high-dimensional feature maps obtained by the encoder module can be used as the input of the parallel attention mechanism to improve segmentation accuracy.Second, the parallel attention mechanism can capture the global and multiscale local feature information of ultrasound images.This method can address segmentation targets with various sizes, improving the generalization of the algorithm.Third, the Dice-related compound loss function has better segmentation accuracy than the other loss functions.However, our proposed method outperforms other methods on the segmentation tasks of ultrasound images, and the number of trainable parameters significantly increases.The bar plots of various algorithm parameter sizes are shown in Fig. 10.Therefore, we require more The best result is highlighted with bold.
computational resources for training the algorithms.However, we believe this problem can be alleviated as modern GPUs rapidly develop.

CONCLUSIONS
In this article, we propose a hybrid Transformer and U-Net with joint loss to segment ultrasound images.TU-Net is based on the encoder-decoder architecture and includes three modules: the encoder, parallel attention mechanism and decoder modules.We use two types of ultrasound image datasets to verify our proposed method.As shown by comparative experiments, TU-Net significantly improves segmentation performance.Compared with the other six algorithms, on average, TU-Net shows improvements of approximately 3.89%, 1.03%, 4.31%, 39.97% and 32.06% for Dice scores, precision, recall, HD and ASD, respectively.In addition, we verify the performance of different algorithms using the Freidman test of nonparametric statistical analysis.TU-Net obtains the best mean rank in this test.This result indicates that the different algorithms significantly differ.Meanwhile, we conducted a series of ablation studies to verify the effectiveness of the TU-Net.These experimental results show that the parallel attention mechanism, joint loss and input of feature maps can effectively improve segmentation accuracy.

Figure 4
Figure 4 Segmentation results of different algorithms on BP datasets.Full-size  DOI: 10.7717/peerj-cs.1638/fig-4 g., Dice, Dice & CE, Dice & Focal, Dice & CE & TopK and Dice & TopK).Figure 6 displays boxplots for the mean Dice scores of different loss functions in TU-Net.These plots indicate that the median of our proposed

Figure 6
Figure 6 Boxplots of dice scores for different joint losses on BP and fetal head datasets.The central mark indicates the median, the bottom and top edges of the box indicate the interquartile range and the whiskers indicate the minimum and maximum.Full-size  DOI: 10.7717/peerj-cs.1638/fig-6

Figure 7 Figure 8
Figure 7 Boxplots of dice scores for different attention mechanisms on BP and fetal head datasets.The central mark indicates the median, the bottom and top edges of the box indicate the interquartile range and the whiskers indicate the minimum and maximum.Full-size  DOI: 10.7717/peerj-cs.1638/fig-7

Figure 9 Figure 10
Figure 9 Bar plots of evaluation metrics for seven segmentation algorithms.The coloured bar represents the mean Dice scores, precision, recall, HD and ASD of each algorithm on the BP and fetal head datasets.Full-size  DOI: 10.7717/peerj-cs.1638/fig-9 PyTorch 1.7.0 was applied as the framework to train TU-Net.The optimizer adopts minibatch stochastic gradient descent (SGD) with a weight decay of 0.0001 and momentum of 0.9.The initial learning rate was set to 0.01 and fine-tuned every 100 epochs.The workstation used to train TU-Net was a 2080Ti graphic card with 11 GB of memory.The proposed TU-Net is compared with other image segmentation algorithms, including U- Cai et al. (2023), PeerJ Comput.Sci., DOI 10.7717/peerj-cs.16388/18 EXPERIMENTAL DETAILS Net (Ronneberger, Fscher & Brox, 2015), U-Net++ (Zhou et al., 2019), M-Net (Mehta & Sivaswamy, 2017), dilated U-Net (Javaid, Dasnoy & Lee, 2018), MA-Net (Zhang et al., 2020) and TransUNet Because the test datasets were not released in the competition, the collected training datasets are randomly divided into training datasets and test datasets in our experiments.The training dataset includes 1,710 samples, and the test dataset includes 448 samples.
Because the number of training samples is small, we apply horizontal flipping, vertical flipping, random scaling and rotation to increase the number of training samples.Ultimately, the number of training samples is 8,550.Meanwhile, all samples are cropped to 320 × 320 for our experiments.

Table 1
The hyperparameters of algorithms.

Table 2
The mean and standard deviation of five evaluation metrics for BP datasets. Note:

Table 3
The mean and standard deviation of five evaluation metrics for the fetal head datasets.
Note:The best result is highlighted with bold.

Table 4
The mean rank of the dice scores of different networks on BP and fetal head datasets. Note: