AEDCN-Net: Accurate and Efficient Deep Convolutional Neural Network Model for Medical Image Segmentation

Image segmentation was significantly enhanced after the emergence of deep learning (DL) methods. In particular, deep convolutional neural networks (DCNNs) have assisted DL-based segmentation models to achieve state-of-the-art performance in fields critical to human beings, such as medicine. However, the existing state-of-the-art methods often use computationally expensive operations to achieve high accuracy and lightweight networks often lack a precise medical image segmentation. Therefore, this study proposes an accurate and efficient DCNN model (AEDCN-Net) based on an elaborate preprocessing step and a resourceful model architecture. The AEDCN-Net exploits bottleneck, atrous, and asymmetric convolution-based residual skip connections in the encoding path that reduce the number of trainable parameters and floating point operations (FLOPs) to learn feature representations with a larger receptive field. The decoding path employs the nearest-neighbor based upsampling method instead of a computationally resourceful transpose convolution operation that requires an extensive number of trainable parameters. The proposed method attains a superior performance in both computational time and accuracy compared to the existing state-of-the-art methods. The results of benchmarking using four real-life medical image datasets specifically illustrate that the AEDCN-Net has a faster convergence compared to the computationally expensive state-of-the-art models while using significantly fewer trainable parameters and FLOPs that result in a considerable speed-up during inference. Moreover, the proposed method obtains a better accuracy in several evaluation metrics compared with the existing lightweight and efficient methods.


I. INTRODUCTION
Image segmentation is a computer vision task that specializes in categorizing an input image or a video frame into a pre-defined number of classes by generating non-intersecting and easily-interpretable sections of the input beneficial for further processing. The image segmentation task is considerably complex compared to other computer vision tasks, such as image classification because image classification categorizes an input by processing the entire image [1], whereas image segmentation generates an output for every single image pixel.
The associate editor coordinating the review of this manuscript and approving it for publication was Anubha Gupta . Image segmentation has numerous real-life applications, including video surveillance [2], augmented reality [3], and driverless cars [4]. The most beneficial and noteworthy image segmentation application is in the field of medicine, where it provides a detailed illustration of the human body for the anatomy analysis, detects illnesses, and identifies the severity level of a disease, to name a few [5]. Medical image segmentation is directly associated with a person's health and life; hence, it must be very accurate to prevent a disease or cure an illness [6], [7], [9], [10].
Based on the input specifications, the image segmentation task can broadly be divided into two distinct groups: binary and multiclass image segmentation. The binary image has two available categories, namely background and foreground.
Some of the applications of the medical image segmentation belongs to this group [9], [10]. On the other hand, the multiclass segmentation may have more than two countable classes, including semantic segmentation in autonomous driving applications [11].
Considering the notable performance of deep learning (DL) methods', artificial intelligence (AI) systems have been shown to outperform humans in image classification tasks [13], [14]. While a person can compete with an AI system in the image classification task, it is impossible in image segmentation due to the significantly complex nature of the task. Because pixel-by-pixel classification is prohibitively tedious and not feasible given the enormous quantity of data in modern medical images. Therefore, generating precisely segmented medical images using AI techniques is becoming a research hotspot [16].
Due to the criticality of the DL methods for the medical image segmentation, extensive research has so far been made in this domain. The most popular DL model architecture in this field is U-Net [17]. After its introduction in 2015, researchers have proposed DL-based networks that achieve a state-of-the-art performance [9], [16], [18]- [22]. However, some of these models [16], [18]- [20], [22] perform complex computations, which make them unusable in machines with limited computation resources. In addition, these computationally expensive models require an extremely long training time for DL-based medical image segmentation models. Some efficient models [9], [21] cannot attain a state-of-the-art performance and cannot generate an accurate medical image segmentation. The aforementioned problems should be addressed to ensure further progress in medical image segmentation. Considering the existing shortcomings, we propose herein an accurate and efficient deep convolutional neural network (DCNN) model, called AEDCN-Net, to alleviate the current issues by reducing the number of trainable parameters and training/ inference time as well as improving the medical image segmentation accuracy. The contributions of this study are fourfold: • The AEDCN-Net benefits from bottleneck, atrous, and asymmetric convolution-based skip connections in the encoding path and nearest-neighbor interpolation method in the decoding path, which significantly reduces the number of trainable model parameters.
• Due to the carefully designed architecture of AEDCN-Net, on average, it is 40% faster than the existing computationally expensive methods that achieve a stateof-the-art performance in medical image segmentation.
• Although AEDCN-Net demands fewer trainable parameters and less training time, it has a superior performance in terms of accuracy and generates more precise segmented medical images compared with its counterparts.
• To the best of our knowledge, no proposed model has yet outperformed the existing methods in both computational efficiency and segmentation accuracy so far. Therefore, the proposed model can be used as a benchmark for further studies in the medical image segmentation domain. The rest of this paper is structured as follows: Section II reveals detailed information on the existing methods in medical image segmentation; Section III provides a meticulous explanation of the proposed methodology; Section IV presents the experimental details; Section V discusses results of the experiments and qualitative comparison of the considered models; and finally, Section VI concludes this study and presents future study directions.

II. RELATED WORK
This section summarizes the currently available methods used in medical image segmentation. Based on the techniques characteristics, they can broadly be categorized into computationally expensive and powerful, as well as lightweight and efficient models.

A. COMPUTATIONALLY EXPENSIVE AND POWERFUL MODELS FOR SEMANTIC SEGMENTATION
After the introduction of the convolutional neural network (CNN) models in computer vision tasks, considerable progress has been observed in the medical image segmentation accuracy. The most notable DL-based network is a fully CNN encoder-decoder architechture-based model for biomedical image segmentation, called U-Net [17]. The existing DL-based methods attaining a state-of-the-art performance in medical image segmentation have a similar model architecture to the U-Net [23]. They are precisely enhanced U-Net variants. For example, Zhou et al. proposed a novel encoder-decoder architecture that uses blocks of nested, dense skip connections [20]. These pathways reduce the semantic gap between the feature maps of the encoder and decoder sub-networks that assisted to significantly outperform the existing methods. Isensee et al. developed a robust and self-adapting framework on the basis of the original U-Net architecture [19]. The network benefits from the leaky rectified linear unit activation function and instance normalization to achieve a performance better than that of the original U-Net. Li et al. improved the U-Net architecture with residual connections by increasing the network depth and adding strong dropouts to extract finer features that allow state-of-the-art performance in fundus image segmentation [18]. Similarly, Jha et al. developed a ResUNet++ model architecture using a conditional random field and a test-time augmentation that achieved a superior performance compared with the existing DL-based networks on various polyp segmentation datasets [22]. Although these models exhibit a superior performance in terms of accuracy and precision in medical segmentation, they require an enormous number of trainable parameters; therefore, they are computationally expensive.

B. EFFICIENT AND LIGHTWEIGHT MODELS FOR MEDICAL IMAGE SEGMENTATION
To devise efficient DL-based models, Mehta et al. introduced a lightweight network that employs group point-wise VOLUME 9, 2021 FIGURE 1. Graphical illustration of the proposed methodology. and depth-wise dilated separable convolutions to achieve a state-of-the-art performance in semantic segmentation [21]. Similarly, [24] and [25] used compressing techniques, such as vector quantization to increase the speed of semantic segmentation models. Punn et al. also presented an inception U-Net architecture [26] inspired by [27]. This network illustrates the model perception of target segmentation images using activation maximization and filter map visualization techniques and attained a superior performance in terms of accuracy. Gadosey et al. developed a modified version of U-Net for devices with a low computational power based on bottleneck layers [28]. They used depth-wise separable convolutions in the entire network. In addition, the model benefited from a weight standardization algorithm with the group normalization method. The modifications allowed the model to be computationally efficient and lightweight. Similarly, Olimov et al. presented a fast U-Net (FU-Net) model relying on the bottleneck convolution layers in the encoding and decoding paths of the model, which allowed medical image segmentation on the devices with limited computational power and memory [9]. Although these models address the problem of efficient computation, they do not provide highly-accurate segmented images.

III. PROPOSED METHODOLOGY
This section presents AEDCN-Net in detail. Figure 1 shows an overview of the proposed methodology. AEDCN-Net has three distinct stages: data preprocessing, data learning, and inference.

A. DATA PREPROCESSING
In data preprocessing, raw medical images are prepared for training using the DCNN model. First, the images are resized to match the network input size. The images are resized to be 256 × 256. Moreover, the image ranges are preserved, and the outside boundary pixels are infilled with a constant value of 0 [9]. After obtaining same size images, their colors are transformed from three channels (i.e., red, green, and blue) to a single-channel grayscale mode. This process is useful in reducing the computational complexity of the DCNN model with almost no impact on its accuracy. Grayscale images are used for training; thus the number of trainable parameters in the first convolutional layer is reduced by thrice. After obtaining the grayscale images, we standardize the data by making them follow the standard normal distribution. For this purpose, we employ the following equation: In (1), X and X std are the original and standardized data, respectively, while i and M are the particular data point and the total number of instances, respectively.
Most medical image databases suffer from data scarcity problems [9]. To alleviate this issue, we applied data augmentation based on the characteristics of the medical image data after completing the data standardization process. The data augmentation techniques should be chosen carefully based on the dataset image characteristics; otherwise, they can result in a low performance of the DCNN model in the data learning stage. The data augmentation is a part of pre-processing stage and pre-computed before starting the data learning stage. The data augmentation is conducted only once before training stage and every epoch in the learning phase used the same augmented images. We used the following data augmentation techniques: • Horizontally flipping the images; • Randomly shifting the image dimensions in the range of integer value x; • Zooming the images in the range of random integer value x; • Randomly changing the angle of images by an integer value of y. In the proposed method, we used x values ranging from -10% to 10%, and y values ranging from -5% to 5% because they resulted in the best performance of AEDCN-Net in the conducted experiments. Since we used four augmentation techniques, the proposed model uses four times more images per epoch in comparison to the original number of images in the datasets. Each epoch in the training process uses slightly different versions of original images in the dataset, which results in better generalizability of the model.

B. DATA LEARNING
After obtaining the preprocessed medical images from the first stage of the proposed methodology, we trained them using a DCNN model. Figure 2 shows the AEDCN-Net model architecture, which was similar to the original U-Net. Detailed description of the ATAS blocks: ks, a, p, and s stand for kernel size, atrous convolution factor, padding, and stride, respectively. Each block employs batch normalization (BN) and weight initialization-based rectified linear unit (WIB-ReLU) activation function [29].
However, several modifications ensured the enhancement of the performance of the proposed model architecture. Specifically, it comprised atrous-asymmetric convolution (ATAS) blocks, max-pooling, concatenation, and upsampling operations. The ATAS blocks are responsible for learning useful features from the preprocessed medical images. Table 1 present details of the ATAS blocks. Table 1 shows two branches in the ATAS blocks, namely the main and secondary branches. First, a raw medical image was input into the main branch by passing through bottleneck, atrous, and asymmetric convolution operations.

1) BOTTLENECK CONVOLUTION
The bottleneck convolutional layer is based on exploiting fewer convolution filters than the input image, each of which VOLUME 9, 2021 measures 1 × 1. This reduces the computational complexity due to the decrease in the input image channels. Specifically, given a medical input image I (I ∈ R H ×W ×C , where H , W , and C are the image height, width, and channels, respectively) and a convolutional filter F (F ∈ R T ×X ×Y ×C , where T , X , Y , and C are the total number of output filters, the filter height, filter width, and number of input filters, respectively), the number of required trainable weights and floating point operations (FLOPs) for a certain original and bottleneck convolution layer can be computed as follows: In (2), l and b are the l th convolutional layer of the network and the bottleneck convolution parameter, respectively. For the proposed model, we set b to 4 because it provided the best results in ablation studies (refer to Section V-C). The bottleneck convolution layer significantly reduces the number of trainable parameters and FLOPS and results in nearly two times of reduction in the aforementioned aspects.

2) ATROUS CONVOLUTION
The atrous convolution uses an atrous factor of a and is defined as follows: In (3), A and f are the function and the convolution filter, respectively. The atrous convolution allows the increase of the receptive field of the convolution kernel without any additional memory space and computational power. Moreover, it ensures that the receptive field decoding does not negatively affect the image resolution and has no loss of its coverage. Considering these advantages of the atrous convolution, we exploited this technique in all convolution operations in AEDCN-Net to obtain a computationally and memory efficient model.
This study aims to develop an accurate and efficient network. Specifically, the expensive cost of the convolution operation can be alleviated by introducing an asymmetric convolution operation as follows: In (5), th and tw define the asymmetric convolution filters convolving with the height and the width of an input image, respectively, whereasˆ tac represents the output of the asymmetric convolution operation. With the usage of this convolution type, the trainable parameters were reduced to (X × C × TH + Y × TW × T ) and the FLOPs decreased to (H × W ) × (X × C × TH + Y × TW × T ). Moreover, the asymmetric convolution conducted two convolution operations using various filters; thus, it could learn many non-linear functions and extract more useful features from the input images.

4) MODEL ARCHITECTURE
We progressively increased the number of filters in the encoding path. The first convolution layer contained 64 filters that have a size of 3 × 1, with atrous factor, padding, and a stride of 1. In every subsequent ATAS block, the number of convolution filters and the atrous convolution factor increased by 2 in the encoding and decreased by the same ratio in the decoding path. Each convolution operation was followed by batch normalization [30] and WIB-ReLU activation function [29]. Regarding the secondary branch, the input data passed through a regular convolution operation with a kernel size of 1 × 1, padding, and a stride of 1, followed by a batch normalization layer. The output of the considered branches were then added and passed through the WIB-ReLU activation function. Inspired by [13], we used the skip connections in the ATAS block to alleviate the vanishing gradient problem. These skip connections ensured that the information from the earlier layers is connected with the subsequent layers, allowing a more effective training of the DCNN model.
Moreover, the max-pooling operation decreased the spatial dimension of the images by a factor of two, ensuring a computational complexity reduction. The upsampling operation also recovered the image original size as the training progressed by increasing the output of the ATAS block in the decoding path by a factor of two. In the proposed model architecture, we used the nearest-neighbor interpolation method to recover the original image size, as in [9]. We chose this operation because it does not have trainable parameters and ensures a reduction in the number of parameters to train, which is consistent with our objective of developing an accurate and efficient DCNN model. Finally, the concatenation operation connected the output of the ATAS blocks in the encoding path to the corresponding output of the upsampling operation in the decoding path. The concatenation helped alleviate the problem of feature loss resulting from the max-pooling and upsampling operations.
In the end, the output of the ATAS blocks passed through a 1 × 1 convolution operation with a sigmoid activation function to generate a segmented image with an object in the foreground and black pixels in the background.

5) LOSS FUNCTION
We used the sum of two loss functions, namely cross entropy loss and dice loss, as a value for minimization. The loss function is formulated as follows: In (6), M, N, and P are the total number of images, classes, and pixels, respectively, and y andŷ are the ground truth and the predicted masks for the segmentation, respectively.

C. INFERENCE
After completing the data learning stage and obtaining a trained DCNN model, we can now employ this model to generate segmented medical images in an inference stage. In this step, the raw data should pass through the same preprocessing operations, as in the training stage, except for data augmentation. A test set of a dataset or real-life medical images was precisely resized, transformed into grayscale, and standardized using (1). For standardization, X must be the training data, i.e., the same data that was used in training and validation stages, to ensure that data in inference stage follow the same distribution. The images are then input into the trained model, which consequently generates segmented medical images.

IV. EXPERIMENTS AND RESULTS
This section describes the conducted experiments and their results and presents a comparison of the performances of the proposed method and the existing state-of-the-art models.

A. EXPERIMENT DATASETS
For the experiments, we employed four publicly available and widely used medical image datasets, namely the 2018 Liver Tumor Segmentation challenge dataset containing abdominal computed tomography (CT) scans [31], 2018 Data Science Bowl (DSB) challenge dataset containing a large number of segmented nuclei images [32], Kvasir-SEG dataset containing polyp images [33], and International Skin Imaging Collaboration (ISIC) 2018: Skin Lesion Analysis Toward Melanoma Detection challenge dataset containing dermoscopic images [34]. Real-life medical image datasets often experience a problem of limited data for training and validation [35], [36]; therefore, we used various datasets that have limited (2018 LiTS: 331) and ample (ISIC 2018: 2594) training images to test the performance of the proposed method from different angles. Table 2 presents the details of these datasets.

B. BASELINE MODELS
We selected five recent medical image segmentation DCNN models that attain state-of-the-art performance to compare the results of the proposed method: FU-Net [9], nnU-Net [19], UNet++ [20], ESPNetv2 [21], and ResUNet++ [22]. We have provided a detailed summary of these models in the Section II; hence, we do not mention their specifications here.

C. TRAINING SETUP
We formulated the baseline and proposed methods using Python version 3.6.9 and TensorFlow Library version 2.4.0, respectively. We initialized the weight parameters based on a standard normal distribution with a mean and a standard deviation of 0 and 1, respectively, to follow the standards of the WIB-ReLU activation function [29]. We did not use bias parameters because they are canceled out while the batch normalization method is used. We used combined cross entropy and dice loss functions as the function for minimization (refer to Section III-B5) and an Adam optimizer [37] with learning rate η = 3e −3 , the exponential decay rate for the first moment β 1 = 9e −1 , and the exponential decay rate for the second moment β 2 = 9e −3 to update the trainable parameters. The experiments were conducted using a 32 GB NVIDIA Tesla V100-SXM2 GPU with CUDA 10.0 with a mini-batch size of 4 for 2018 LiTS, 16 for 2018 DSB and Kvasir-SEG, and 32 for the ISIC 2018 datasets. The models required approximately 100 epochs to converge; therefore, we trained them only for this number of epochs because further training did not improve their performance.

D. EVALUATION METRICS
We assessed the performance of the baseline and proposed methods using several evaluation metrics, including pixel accuracy (PA), dice coefficient (DC), and mean intersection over union (mIoU). The formulas of these evaluation metrics VOLUME 9, 2021 TABLE 3. Comparison of the baseline and proposed models in terms of accuracy and speed*. are as follows: Equation (7) shows the computation methods of the considered evaluation metrics, whereŷ and y are the predicted, and target values, respectively; P and M are the total number of pixels in an image and the total number of instances, respectively; and TP, TN, FP, and FN stand for true positive, true negative, false positive, and false negative, respectively.

V. DISCUSSION
This section discusses the results of the conducted experiments in terms of computational and memory efficiency and shares the results of ablation studies. Moreover, it exhibits qualitative comparison of the baseline and proposed methods and enumerates limitations of the proposed method. Table 3 summarizes the experimental results of the considered models on the test sets of the aforementioned datasets. From the table, the proposed model enjoyed high speed for training and inference and significantly outperformed the existing computationally expensive models, such as ResUNet++ and nnU-Net by achieving nearly 3× of speed-up. As regards the lightweight and efficient models, AEDCN-Net attained a performance faster than those of ESPNetv2 and FU-Net, too. The proposed model was approximately 38% and 15% quicker in training (data augmentation process time is included in training time per epoch) and inference than the ESPNetv2 and FU-Net models, respectively.

A. EXPERIMENT RESULTS
In the case of the accuracy-related metrics, the proposed model considerably outperformed the baseline networks in the datasets with a limited number of medical images, like 2018 LiTS and 2018 DSB primarily because the computationally expensive models with a large number of trainable parameters experienced overfitting and could not generalize well to the unseen test data. However, in the experiments on datasets with 1000 and more images, such as Kvasir-SEG and ISIC 2018, the nn-UNet and ResUNet++ models attained better performances than the lightweight models due to a great number of computations and parameters. AEDCN-Net still could largely outperform the lightweight models and achieve at least a second best result in terms of the PA, DC, and mIoU metrics on the considered datasets.

B. COMPUTATIONAL AND MEMORY EFFICIENCY
We also compared the considered models in terms of trainable parameters, model size, and FLOPs. Table 4 presents the evaluation results.
In Table 4, AEDCN-Net required nearly seven and 15 times fewer trainable parameters in comparison with the lightweight and computationally expensive models, respectively. Moreover, the size of the proposed model was considerably smaller than the baseline networks. Finally, AEDCN-Net was efficient in terms of computation by requiring the  lowest number of FLOPs to produce the medical image segmentation. Table 5 analyzes the effect of different components in the proposed method on the accuracy-related evaluation metrics and number of trainable parameters. We selected the datasets with the fewest and the largest number of images to conduct ablation studies to reduce the computational cost for the experiments.

C. ABLATION STUDIES
As shown in Table 5, the asymmetric convolution operation with the kernel sizes of 3 × 1, 1 × 3 always performed better than that with 5 × 1, 1 × 5 in both datasets. Moreover, the progressive increase of the atrous factor followed by a progressive decrease (2,4,8,4,2) resulted in the highest scores in the evaluation metrics when compared with the other options. In addition, the AEDCN-Net with seven blocks worked better in the dataset with limited number of images, while a more complex network with nine blocks performed well in ISIC 2018 with a large number of trainable images. Although AEDCN-Net attained the most accurate medical image segmentation, it increased the number of trainable   TABLE 5. Effect of different components in the ATAS blocks, where ↑a is a progressive increase, ↓a is a progressive decrease; and a* is an increase followed by a decrease of the atrous factor. parameters by nearly six times and resulted in a longer training and inference time; therefore, we employed AEDCN-Net with seven blocks by default.

D. QUALITATIVE COMPARISON OF THE CONSIDERED MODELS
After finishing the training and evaluating the model performance on the considered datasets, we show herein the generated segmented images using the baseline and proposed methods. Figure 3 depicts the input medical images, ground truth masks, and generated segmentation masks by the considered methods. The most efficient baseline model, FU-Net, failed to generalize well on the test images. Particularly, VOLUME 9, 2021 the model's inferior performance was noticeable in the segmented images from the 2018 DSB dataset. In addition, nnU-Net produced lower-quality segmentation masks in the Kvasir-SEG and ISIC 2018 datasets. Notably, the proposed method could produce more detailed and precise segmented medical images than baseline methods in all considered datasets.

E. LIMITATIONS OF THE PROPOSED METHOD
The results of the conducted experiments using four medical image datasets and comparison of the performance with the existing state-of-the-art models showed that the proposed AEDCN-Net outperformed the baseline models in terms of speed, memory, efficiency, and accuracy. However, the proposed method have several limitations. First, some datasets used in the experiments have limited number of training set, which cannot fully demonstrate a performance difference between the proposed method and the more powerful and computationally expensive networks. Second, the considered datasets in the experiments exhibit only binary (foreground and background) output. Although, the proposed method can easily be employed for multiple output segmentation by slightly altering its activation function in the final output layer, this operation can lead to increase in computational complexity.

VI. CONCLUSION AND FUTURE WORK
This study investigated the medical image segmentation using DL-based techniques. Based on the extensive literature review, we found that the currently available state-of-the-art methods in this field are computationally inefficient and slow. Moreover, the lightweight and efficient models cannot generate precise segmented images. Therefore, we proposed the AEDCN-Net model that benefits from the carefully designed preprocessing and the computationally efficient DCNN model using skip connection-based bottleneck, atrous and asymmetric convolution operations in the encoding path and nearest-neighbor interpolation upsampling technique in the decoding path. In the conducted experiments using four open-source medical image datasets, the proposed method showed a superior performance in terms of computational efficiency, memory, and accuracy compared with the counterpart models. Moreover, the AEDCN-Net significantly outperformed the efficient models by achieving greater results when assessed using several evaluation metrics.
For the future directions of AEDCN-Net enhancement, we will work on increasing the accuracy of the proposed model and attempt to interpret the predicted segmented medical images based on the severity level of illness.