DSCA-Net: A depthwise separable convolutional neural network with attention mechanism for medical image segmentation

: Accurate segmentation is a basic and crucial step for medical image processing and analysis. In the last few years, U-Net, and its variants, have become widely adopted models in medical image segmentation tasks. However, the multiple training parameters of these models determines high computation complexity, which is impractical for further applications. In this paper, by introducing depthwise separable convolution and attention mechanism into U-shaped architecture, we propose a novel lightweight neural network (DSCA-Net) for medical image segmentation. Three attention modules are created to improve its segmentation performance. Firstly, Pooling Attention (PA) module is utilized to reduce the loss of consecutive down-sampling operations. Secondly, for capturing critical context information, based on attention mechanism and convolution operation, we propose Context Attention (CA) module instead of concatenation operations. Finally, Multiscale Edge Attention (MEA) module is used to emphasize multi-level representative scale edge features for final prediction. The number of parameters in our network is 2.2 M, which is 71.6% less than U-Net. Experiment results across four public datasets show the potential and the dice coefficients are improved by 5.49% for ISIC 2018, 4.

DeepLabV3+ [31] applied it to the ASPP module, which achieved a faster and more powerful network for semantic image segmentation. X-Net [32] adopted it to scale the network size down and performed well. MobileNetV3-UNet [33] created a lightweight encoder and decoder architecture based on depthwise separable convolution, which achieved high accuracy on medical image segmentation tasks.
Combining the advantages of attention mechanism and depthwise separatable convolution into U-shaped architecture, a lightweight DSCA-Net is proposed in this paper for medical image segmentation. Three novel attention modules are proposed and integrated into the encoder and decoder of U-Net, separately. The chief contributions of our work are summarized as follows: 1) A Pooling Attention module is proposed to reduce the feature loss caused by down-sampling.
2) A Context Attention module is designed to exploit the concatenation feature maps from the encoder and decoder, which combines the spatial and channel attention mechanisms to focus on useful position features.
3) To make better use of multi-scale information from different level stages of the decoder, a Multiscale Edge Attention module is proposed to deal with combined features for the final prediction. 4) We integrate all proposed modules into DSCA-Net for medical image segmentation and all convolution operations are implemented by depthwise separable convolution. The proposed network was evaluated on four public datasets and the experimental results reveal that our proposed network outperforms previous state-of-the-art frameworks.
The remainder of our paper is structured below. Section 2 goes over detailed information of proposed DSCA-Net architecture, and Section 3 describes the experimental settings and results. Finally, some discussions and conclusions are given in Sections 4 and 5. By combining attention mechanism and depthwise separable convolution with the architecture of U-Net, we propose DSCA-Net, which is shown in Figure 1. The network is composed of encoding part, decoding part, and multiscale edge part. Firstly, we replace the stacked 3 × 3 convolution layers of U-Net with DC module. The depth of encoder is 128, which enables our proposed model better extracting abundant features while reducing parameter amount. Secondly, to reduce the feature loss, PA module is embedded in place of maximum pooling layer, which has almost no effect on the number of parameters. Then, long-range skip connections are utilized to transfer feature maps from encoder to symmetrical decoder stage after passing through CA module, which fuses and recalibrates the context information at five different resolution levels. Finally, MEA module reemphasizes the salient scale information from concatenating multiscale feature maps, which enable the last CNN layer to be aware of segmenting target edge. Recent studies show that extending the network depth leads to better segmentation performance [28,34]. Based on depthwise separable convolution operation [28] and DenseNet [35], the dense convolution module is proposed. We utilize it in encoder to extract high-dimensional feature information and recover segmented target details in decoder. As shown in Figure 2, every depthwise separable convolution layer is followed by one group normalization [36] and LeakyReLU [37], which improves nonlinear expression capability of model. For convenience, we assume the input as ∈ ℝ × × . , , denote channel, height, and weight, respectively. At beginning, one 1 × 1 convolution layer × expands channel numbers 2 times. Then, multiple residual connections from former layers are summed to all subsequent layer with two continuous 3 × 3 convolution layers × . The elementwise summation operations are used for fusing extracted information without adding parameters. DC module is described by the following equation:

Dense convolution module
where ∈ ℝ × × denotes the input feature map and ∈ ℝ × × represents the feature maps in layer .

Pooling attention module
Consecutive pooling operation in encoder enlarges the reception of convolution operation but lose certain features. Therefore, we rethink SE-Net [22] and ECA-Net [23], and propose PA module to replace the original pooling layer, as shown in Figure 3. PA module mainly takes in a two-branch structure. One branch tries to obtain an attention channel feature vector and the other rescales height and width of feature maps. First, a 1D convolution × × with shared kernel weights of 5 is used to extract more abundant feature information after adaptive maximum pooling × and average pooling × layers. Then, the vector is summed element-by-element and activated by function. Finally, the output is multiplied by rescaled feature maps . PA module can be expressed as follows: where ∈ ℝ × × denotes input feature maps, ⨁ and ⨂ denote element-wise summation and element-wise production, respectively.

Context attention module
In the process of context information extraction, simple concatenation of U-Net is not sufficient to gradually restore needed information. Drawing lessons from dynamic weight similarity calculation, we propose CA module to fuse context information, as shown in Figure 4.
∈ ℝ × × and ∈ ℝ × × represent feature maps from encoder and decoder, respectively. At first, we obtain ∈ ℝ × × via concatenating and from upper decoder layer. Then, to capture detailed context information, CA module adopts a three-branch structure, including a spatial attention branch, a channel attention branch, and a convolution branch, which has the same dimensions of and . The learned feature maps from spatial attention branch ∈ ℝ × × and channel attention branch ∈ ℝ × × multiply convolutional feature maps ∈ ℝ × × , separately. Finally, feature maps are concatenated and one 1 × 1 convolution × reconstructs ∈ ℝ × × . The relevant formula can be stated as follows: where [•] denotes the concatenation operation along with channel dimension.
× denotes adaptive average pooling operation and × denotes adaptive maximum pooling operation. For the process of bottom feature information, we use a deformable CA module to capture context information with single input.

Multiscale edge attention module
U-Net uses decoder to restore the categories of each pixel. However, segmented objects with large variant scales and blurred edges increase the difficulty of accurate segmentation. The pixel position of target edge in feature maps of union scales from decoder is slightly different and the high-level feature maps in decoder contain more sufficient edge information. Learning the scale-dynamic weights of all fused feature map pixels for calibrating object edge is desirable. To utilize multiscale feature maps, we propose MEA module, as shown in Figure 5. First, we use bilinear up-sampling layers of different scale factors ( = 1, 2, 4, 8) to unify feature map scale obtained from decoder to the final output size and concatenate them. Then, for learning scale-dynamic features, one 1 × 1 convolution and function generate calibrated weights and multiplied original input to obtain ∈ ℝ × × . MEA module can be described as follows: where (⋅) denotes resampled function with different scale factors. denotes concatenated feature map and denotes Group Normalization.

Experiments and results
To assess our proposed network, we validated DSCA-Net and compared with other state-of-the-art methods on four public datasets: ISIC 2018 dataset [5,38], thyroid gland segmentation dataset [39], lung segmentation (LUNA) dataset, and nuclei segmentation (TNBC) dataset [17]. Each dataset poses its separate challenge, and the corresponding samples are shown in Figure 6. On each task, we compared the results with state-of-the-art networks and implemented ablation studies to demonstrate effectiveness of modules, which will be discussed in Sections 3.3-3.6.

Experiment setup
During the experimental period, all models in this paper were achieved based on Pytorch and the experimental planform was supported by Linux 18.04 operating system, which was equipped with Intel Xeon CPU @2.30 GHz and 27GB RAM. The GPU was 16 GB Nvidia Tesla P100-PCIE. The Adam optimizer [40] was used with learning rate 10 , and weight decay 10 . The dynamic learning rate was decayed by 0.5 every 100 epochs. We utilized the Soft Dice loss for model training and kept optimal result upon validation dataset. Quantitative results were obtained in test.
To maximize the use of GPU, the batch sizes are set to 8, 4, 12, and 2 for ISIC, thyroid gland, LUNA, and TNBC datasets, respectively. For better fitting data, the number of iterations for TNBC dataset is 500, and 300 for others. The training process stops automatically after the maximum epoch. We utilized Fivefold cross-validation for result to assess the stability and effectiveness of DSCA-Net. Every input image was normalized from [0, 255] to [0, 1]. During model training, random rotation and flipping of the angle in (− , ) with the probability of 0.5 were applied for data augmentation.

Evaluation metrics
In this paper, Dice coefficient (Dice), Intersection over Union (IoU), accuracy (Acc), specificity (Spec), sensitivity (Sens) and average symmetric surface distance (ASSD) are used as evaluation metrics. The formula for all metrics can be expressed as follows: where , , , represent the predicted pixel numbers of true positive, true negative, false positive, and false negative, respectively. Assuming and are the set of border points from prediction result and corresponding label, individually, is defined as: where ( , ) = ∈ (| − |) represents the shortest Pythagorean distance between point and .  The skin lesion segmentation dataset has 2594 images and their corresponding label in 2018 [5,38]. We randomly divided the dataset by the ratio of 7:2:1 into 1815, 261, and 520 used for training, validation, and testing, respectively. The original size of images in dataset varies from 720 × 540 to 6708 × 4439 . To facilitate the training process of our proposed network, all images and corresponding masks were cropped to 256 × 256.

Skin lesion segmentation
Some skin lesion segmentation samples of our proposed network and U-Net are shown in Figure 7. U-Net performs unsatisfactorily compared with DSCA-Net in regular skin lesion segmentation images. When the skin lesion has a similar color to surroundings or occluded by hair and tissue fluid, U-Net gets error segmentation results. The more blurred boundary of skin lesion, the more incorrect segmentation is obtained by U-Net. Comparatively, DSCA-Net performs better. To fully confirm the validity of our method, we compared DSCA-Net with U-Net [10], Attention U-Net [27], RefineNet [41], EOCNet [42], CA-Net [11], DeepLabv3+ [31], MobileNetV3-UNet [33] and IBA-U-Net [44] on this dataset. The results are listed in Table 1. Our proposed model performs an Acc of 0.9532, 0.0755 higher than U-Net, 0.0053 higher than second-place method MobileNetV3-UNet. Although Dice is 0.0002 less than DeepLabV3+, the difference is not significant. Our model has 1/3.53, 1/15.85, 1/24.86, 1/1.27, 1/3.77 and 1/6.32 times fewer parameters than U-Net, attention U-Net,DeepLabV3+, CA-Net, MobileNetV3-UNet, and IBA-U-Net with better segmentation performance, respectively.  Table 2 lists the comparison results. Lightweight U-Net were achieved by depthwise separable convolution instead of original stacked convolution layers of U-Net. DSCA-Net is the network adding all designed modules. The quantitative results show that our proposed modules strengthen the feature extraction ability. Every proposed module improves segmentation performance. At the same time, Backbone + DC + PA, and Backbone + DC + PA + CA shows better segmentation results than U-Net.

Thyroid gland segmentation
The thyroid public dataset [39] was acquired by a GE Logiq E9 XDclear 2.0 system equipped with a GE ML6-15 ultrasound probe with Ascension driveBay electromagnetic tracking. It took from lnput image Label U-Net DSCA-Net healthy thyroid records and the volumes were taken straight from the ultrasound imaging instrument, which was recorded in DICOM format. The matching label, which was produced by a medical expert, contains the isthmus as part of the segmented region. To train our model, we split the volume into 3998 individual slices with label. We randomly used 2798 images for training, 400 images for validation and 800 for testing, with a ratio of 7:1:2. The shape of input was randomly cropped in 256 × 256.   Figure 8 presents several test segmenting results on thyroid gland dataset. The edge of thyroid gland and background information usually have some outliers and similarities in vision, but not relate to our interest. Observations show that U-Net under-segmented thyroid isthmus while DSCA-Net better.
We tested DSCA-Net against three methods: SegNet [8], SUMNet [14], and Attention-UNet [27]. Quantitative evaluation results present in Table 3. The Dice increases from 0.9332 to 0.9727 by 4.2%, the Sens increases from 0.9526 to 0.9873 by 3.6% and Spec increases from 0.9169 to 0.9921 by 8.2% compared with U-Net. Our model has 1/51.05 times fewer parameters than SegNet and performs better through evaluation metrics.
Additionally, Table 4 presents the quantitative analysis results of ablation study on thyroid segmentation. DSCA-Net scored the best performance in every metric. The Dice increased significantly after adding CA module, which indicates that CA module can efficiently extract context information for thyroid segmentation performance.

Lung segmentation
Lung segmentation requires segmenting the lung structure from a competition called Lung Nodule Analysis (LUNA). It contains 534 2D CT samples with corresponding label. The original resolution of images is 512 × 512, and we randomly cropped them into 256 × 256. Separately, 70, 10, and 20% of dataset is allocated for training, validation, and testing with corresponding number 374, 53, and 107.
From the visualization results shown in Figure 9, DSCA-Net performs better than U-Net in detailed edge processing. Affected by the noise of lung CT images, U-Net produces some erroneous segmented areas. DSCA-Net has a greater tolerance to noise than U-Net. We demonstrate the validation of our approach by achieving a promising improvement despite the relatively simple task.    [4], RU-Net [15], and R2U-Net [15]. Table 5 demonstrates that all methods achieve excellent performance in four metrics, and our network reached 0.9828 in Dice, 0.9920 in Acc, 0.9836 in Sens, and 0.9895 in Spec, better than U-Net. In spite of the slightly lower performance of DSCA-Net than R2U-Net in Spec, our model has 1/1.9 times fewer parameters than R2U-Net while three metric scores are higher than R2U-Net. Noting that in Table 5 means recurrent convolution time-step. Table 6 shows the segmentation results of ablation study on lung segmentation dataset. By adding designed modules in sequence, each of the proposed modules improved segmentation performance of DSCA-Net. The backbone + DC + PA + CA exceeds U-Net 0.0138 in Dice, and DSCA-Net shows best performance in Dice, IoU and ASSD evaluation metrics.  The last application is nuclei segmentation of Triple-Negative Breast Cancer (TNBC) dataset. It has 50 images from 11 patients with the size of 512 × 512. To avoid overfitting of training process, we used data augmentation method to expand dataset with a total number of 500, including random lnput image Label U-Net DSCA-Net flipping, random cropping, and random rotation with the angle in (− , ). The probability of process triggering in data augmentation methods is 0.5. As usual, we adopted the same split ratio of 7:1:2, with 350, 50, 100 for training, validation, and testing. Figure 10 illustrates some comparative cases of prediction results between our designed network and U-Net on TNBC dataset. It can be viewed that DSCA-Net performs better than U-Net. However, incorrect segmentation results are also obtained in some segmenting areas, as shown in second line. The obscure color transitional areas and overlaid nuclei increases the difficulty to be segmented. For the relatively easy segmentation target, our network performs better. Additionally, we compared DACA-Net with other networks: U-Net [10], DeconvNet [17], Ensemble [17], Kang et al. [19], DeepLabV3+ [43], and Up-Net-N4 [16]. The comparison results are shown in Table 7. Although the Sens is 0.0456 lower than Ensemble in Table 7, a combination of attention mechanism and data augmentation allows our DSCA-Net to score higher than state-of-theart methods in Dice and Acc. Our model has 1/233.13 and 1/3.36 times fewer parameters than DeconvNet and Up-Net-N4, separately.

Nuclei segmentation
According to the quantitative evaluation results, Table 8 demonstrates the effectiveness of our proposed modules. After adding the MEA module, our proposed network performs better, which indicates that segmented edge is closer to the label with less error.

Discussion
To lighten the network parameters and maintain performance, we take fully advantages of U-Net and integrate designed modules in DSCA-Net for 2D medical image segmentation. First, DC module replaces stacked convolutional layers of U-Net for feature extraction and restoration. Second, PA module is designed to recover down-sampling feature loss. Third, CA module substitutes the simple concatenation operation in U-Net to extract richer context information. In addition, MEA module is proposed to realize segmenting target edges from multi-scale encoder information for final prediction. Evaluation metrics with other state-of-the-art networks showed the performance of DSCA-Net is better.
Multi-group experimental visualized results are shown in Figures 7-10. It can be summarized that our model is more robust than U-Net. For the blurred edge details and occlusions in electron microscope images, our network can also distinguish the segmented target correctly. For the most challenging task like TNBC, the similarity of adherent nuclei and unobvious changed color with great morphological changes increases the difficulty of segmentation. Our proposed network has achieved better results compared with other networks. However, it still needs further development.

Conclusions
The target of this study is to lighten parameters of the network while maintaining good performance. We design a lightweight depthwise separable convolutional neural network with an attention mechanism named DSCA-Net for accurate medical image segmentation. Our proposed network extracts richer feature information and reduces feature loss in segmentation processing compared with U-Net. We assessed our network on four datasets and collected segmentation results against state-of-the-art networks under various metrics. The visualization and quantitative results show that our network has better segmenting ability. We intend to utilize DSCA-Net to segment 3D images in the future.