CRF-EfficientUNet: an improved UNet framework for polyp segmentation in colonoscopy images with combined asymmetric loss function and CRF-RNN layer

Colonoscopy is considered the gold-standard investigation for colorectal cancer screening. However, the polyps miss rate in clinical practice is relatively high due to different factors. This presents an opportunity to use AI models to automatically detect and segment polyps, supporting clinicians to reduce the number of polyps missed. Inspired by the success of UNets, a popular strategy for solving medical image segmentation tasks, this article proposes a novel framework for polyp segmentation called CRF-EfficientUNet, which enhances UNet using the EfficientNet encoder, a combined asymmetric loss function, and Conditional Random Field as a Recurrent Neural Network (CRF-RNN) layer on top. A novel loss function that combines pixel-wise cross-entropy loss and asymmetric similarity loss to solve the unbalanced imaging data problem is proposed. Training the proposed network with this loss function can achieve a considerably higher Dice score and better polyp segmentation prediction. In addition, we add the CRF-RNN layer to the proposed framework to improve the quality of semantic segmentation. Experimental results on popular benchmark datasets show that CRF-EfficientUNet achieves state-of-the-art accuracy compared to existing methods. The results of the experiments, which are performed on the CVC-ClinicDB dataset for training and testing, are 95.55% Dice and 92.23% IoU. While the experimental results on cross-dataset using Kvasir-SEG as the training set, CVC-ColonDB as the test set are 85.59% Dice and 76.19% IoU. These results indicate that the proposed method has high generalization capability and learning ability, and it can be a compelling choice for practical applications with considerable data variations.


I. INTRODUCTION
C OLORECTAL cancer (CRC) is one of the most common causes of cancer-related death in the world for both men and women, with 576,858 deaths (account for 5.8% of all cancer deaths) worldwide in 2020 [1]. CRC usually arises from abnormal polyp growth inside the colon, although polyps grow slowly and may take years to become cancer. According to anatomical findings, the structure of polyps is distinguished from normal mucosa by color, size, and surface type. The surface of polyps can be flat, elevated, or pedunculated based on a change in the gastrointestinal tract [2]. Though not all polyps lead to CRC, all CRC starts with polyps that become cancerous over time. While the advanced stages of colorectal cancer have a poor five-year survival rate of 10%, the early diagnosis has shown a significantly more favorable five-year survival rate of 90% [3]. Therefore, accurate detection, investigation, and analysis of types, patterns, and structures of polyps are important to reduce the spread of CRC. Nowadays, colonoscopy is considered the primary method for colon screening and preventing polyps from becoming cancerous. However, colonoscopy suffers from human errors because it depends on highly skilled endoscopists and a high level of eye-hand coordination. Moreover, some of the rare types of polyps are visually difficult to VOLUME XX, 2021 FIGURE 1. Some of the challenges presented by colonoscopy images: (a) Varying shapes and textures of polyps, (b) Small polyps, (c) Blurriness, intestinal contents, flares, or low-quality images distinguish due to flat natures that demand the experiences and expertise of endoscopists. Previous studies confirmed that 22%-28% of polyps are missed in patients undergoing colonoscopy [4]. Segmenting out polyps from the normal mucosa can help endoscopists to improve their segmentation errors and subjectivity. Therefore, this study focuses on the polyp segmentation problem using deep learning methods.
This study focuses on the polyp segmentation problem using deep learning methods. Precise segmentation of the polyp regions is particularly complicated because polyps have different shapes, sizes, colors, and appearances [5]. In addition, there are challenges such as the presence of image artifacts like specularity, saturation, artifact, bubbles, and instrument [6], intestinal contents, and low-quality images that can cause errors during segmentation. Figure 1 shows some of the challenges presented by colonoscopy images. Over the past years, researchers have made several efforts to develop Computer-Aided Diagnosis(CADx) prototypes for automated polyp segmentation. Most of the prior polyp segmentation approaches were based on analyzing polyp color, texture, shape, or edge information to segment polyp regions. More recently, deep neural networks have been widely used to solve medical image segmentation problems, including polyp segmentation. The CADx system for automatically segmenting out polyps from normal mucosa on colonoscopy images can be an effective clinical tool that helps endoscopists for faster screening and higher accuracy [5]. For building a powerful polyp segmentation CADx system that could be used in clinical settings, it is necessary to address two common challenges: (i) Robustness (i.e., the ability of the system to perform well on both easy and challenging images), and (ii) Generalization (i.e., a system trained on a dataset from a specific hospital should generalize across different hospitals) [7]. To address the aforementioned research challenges and issues, the overall goal of this article is to develop a novel deep learning framework for polyp segmentation with high generalizability and learning ability, so that it can be an effective choice for practical applications.
Among various deep learning models, UNet [8] and its variants have demonstrated impressive performance in biomedical image segmentation. Motivated by the success of UNet, in this work, we propose a novel polyp segmentation method based on the UNet architecture. We adapt the UNet model for polyp segmentation and aim to evaluate the model with different encoders (MobileNet [9], ResNet [10], and EfficientNets [11]). We choose the EfficientNetB7 encoder for our model because of the highest performance. One of the challenges in training networks for polyp segmentation is unbalanced data, i.e., polyp pixels are often much lower in numbers than non-polyp pixels. Networks trained by unbalanced data may make predictions with high precision and low recall. These predictions are severely biased toward the non-polyp class, which are particularly undesired because the consequences of false negatives would be more serious than those of false positives. Therefore, we propose a novel loss function that combines pixel-wise cross-entropy loss and asymmetric similarity loss for training polyp segmentation models to address this problem. By training models with the proposed loss function, we found that the models can make predictions with a better trade-off between precision and recall prediction to yield accurate polyp segmentation. Moreover, one central issue in polyp segmentation is the limited capacity of deep learning techniques to delineate polyp objects. To solve this problem, we use a deep network that fully integrates Conditional Random Fields (CRFs) [12] probabilistic graphical modeling with CNN, making it possible to train the whole deep network end-to-end with the back-propagation algorithm, avoiding offline post-processing methods for object delineation [13]. Finally, we perform experiments on a range of recent public datasets for polyp segmentation, i.e., Kvasir-SEG [14], CVC-ClinicDB [15], CVC-ColonDB [16], EITS-Larib [17] with different scenarios of using training and test data to evaluate our proposed method and compare with state-of-the-art (SOTA) approaches.
This article is an extension of our work originally presented in the 2020 RIVF International Conference on Computing and Communication Technologies (RIVF) [18]. We extend previous work by (i) modified the model architecture by remove the ensemble step and add a CRF-RNN layer, (ii) use EfficientNetB7 instead of EfficientNetB5 as encoder, (iii) conducted comprehensive experiments with multiple datasets, multiple experiment settings for comparison with recent SOTAs in polyp segmentation and ablation study. In summary, this article makes the following key contributions: 1) We present a novel neural network architecture for automatic polyp segmentation, called CRF-EfficientUNet, extended from UNet architecture with an EfficientNet encoder and CRF-RNN layer on top. Moreover, we use the transfer learning method on the proposed network architecture to achieve better performance.
2) We propose a loss function that combines pixel-wise cross-entropy loss and asymmetric similarity loss called the combined asymmetric loss function for training polyp segmentation networks. The combined asymmetric loss function can effectively boost the performance of polyp segmentation networks. The proposed loss function was used to train our polyp segmentation model results in a better performance.
3) We train and validate the proposed method on four popular benchmark datasets, i.e., Kvasir-SEG, CVC-ClinicDB, CVC-ColonDB, EITS-Larib, with different scenarios of using training and testing data. The results show that our model has the robustness to detect small polyps that are frequently missed during colonoscopy and perform well on easy images. Moreover, our network CRF-EfficientUNet outperforms all SOTAs across unseen polyp datasets; this demonstrates that our proposed method has better generalizability than existing methods. The experimental results indicate that the proposed model can be a compelling choice for practical applications with considerable data variations.
The rest of the paper is organized as follows. Section II reviews related research on polyp segmentation. Section III describes the proposed method for polyp segmentation in detail. Section IV outlines our experiment settings. The experimental results and discussion are presented and discussed in Section V. Finally, Section VI summarizes and concludes this work.

II. RELATED WORK
Many methods have been proposed that focus on accurate polyp segmentation. The existing research works in polyp segmentation can be roughly grouped into main approaches: using image processing segmentation and traditional machine learning methods, and using deep learning methods. The processing segmentation methods analyze either the polyp's edge or its color and texture for polyp segmentation. Bernal et al. [16] proposed to use the "depth of valleys" of an image to segment polyps. They use the watershed algorithm to segment images into polyp candidate regions and then classify each region into polyp and non-polyp. This classification is based on region information and the "depth of valleys" in each region. Ganz et al. [19] propose a method based on Hough transform to detect the region of interest (ROI) and specular reflection suppression with an exemplar-based image in painting as a preprocessing method. Then, they use an algorithm called shape-UCM for image segmentation, shape-UCM works based on image gradient contours and spectral clustering. Traditional machine learning methods are based on hand-crafted features for image representation. These methods use color, texture, shape, or edge information as extracted features and train the classifier to distinguish polyps from surrounding normal mucosa. Tajbakhsh et al. [20] proposed a feature extraction method to extract sub-patches with a 50% overlap and calculate their average vertically, resulting in one-dimensional signals. After that, they use DCT coefficients as a feature for each extracted patch. Finally, they use a two-stage random forest classifier to label each patch.
The deep learning-based approach for polyp segmentation has gained much attention in recent years due to the automatic feature extraction process to segment polyp re-gions with unprecedented precision. In addition, the public database of polyp images facilitated further research on the use of deep learning models for polyp segmentation. Qadir et al. [21] proposed using Mask-RCNN incorporated with traditional CNN-based feature extractors to provide bounding boxes of the polyp regions. Kang et al. [22] used Mask-RCNN, which relies on ResNet50 and ResNet101, as a backbone structure for automatic polyp detection and segmentation. For obtaining pixel-level segmentation, a fully convolutional neural network (FCN) was used. The authors in [23] showed that FCN architectures could be refined and adapted to recognize polyp structures. Zhang et al. [24] used FCN-8S to segment polyp region candidates, and texton features computed from each region were used by a random forest classifier for the final decision. Fan et al. [25] propose PraNet, enhancing an FCN-like model using a parallel partial decoder and reverse attention modules for medical image segmentation. Instead of a single encoder in traditional FCN architecture, UNet is proposed, which increases the performance of FCN considerably and has established itself as a popular choice in medical image segmentation. UNet is an encoder-decoder-based structure that uses skip connections to concatenate the features from the encoding and decoding layers. Inspired by the success of UNet, several variants were proposed for polyp segmentation and yielded promising results. Jha et al. [26] present DoubleU-Net, which combines two UNets. The first UNet uses a pre-trained VGG-19 as the backbone. The second UNet is added at the bottom of the first UNet to capture more semantic information efficiently. They also adopt Atrous Spatial Pyramid Pooling (ASPP) to capture contextual information within the network. Zhou et al. [27] propose UNet++, a deeply supervised encoderdecoder network, which connects UNets through a series of nested, dense skip pathways. Jha et al. [28] also propose Re-sUNet++, which takes advantage of residual blocks, squeeze and excitation units, ASPP, and the attention mechanism. Similar to UNet, another deep convolutional encoder-decoder architecture, Segnet [29], is also used for polyp segmentation. Wang et al. [30] used the SegNet architecture to detect polyps in real-time and with high sensitivity and specificity. Afify et al. [31] presented an improved framework for polyp segmentation based on image preprocessing and two types of SegNet architecture. Mahmud et al. [32] proposed PolypSeg-Net, a modified SegNet architecture for automated polyp segmentation from colonoscopy images with several sequential depth dilated inception (DDI) blocks, deep fusion skip modules (DFSM), and deep reconstruction module (DRM). Additionally, there are several recent studies on polyp segmentation [33]- [36]. They are useful steps toward building an automated polyp segmentation system.
From the presented related works, we observe that works on polyp segmentation problems are becoming mature. Researchers are conducting a variety of studies with many different methods for precision polyp segmentation. However, the main drawback in the field is that very few works apply towards testing the generalizability of models with the VOLUME XX, 2021 cross-dataset test. Most of the current works have proposed algorithms tested on single, often small, imbalanced, and explicitly handpicked datasets. Besides, many challenging polyps are usually missed during colonoscopy examinations and can develop into cancer if they are not detected early. Moreover, one of the significant challenges in the medical domain is the lack of large training datasets, and the obtained datasets are often imbalanced. These challenges make it harder to build robust and generalizable systems for precision polyp segmentation. Toward addressing these challenges, in this work, we aim to develop an algorithm that could achieve high performance on different datasets. We have done extensive experiments on various colonoscopy images. Furthermore, we have trained the proposed model on datasets from multiple clinical settings and tested it on other diverse unseen datasets to achieve the goal of building generalizable and robust models.

A. OVERVIEW OF THE PROPOSED METHOD
The overall architecture of our proposed network, CRF-EfficientUNet, is depicted in Figure 2. First, we evaluate the performance of the UNet architecture for polyp segmentation with different CNN encoders. We select the EfficientNet B7 encoder for the UNet architecture due to it gives the highest performance. Next, we extended the UNet architecture with the EfficientNet B7 encoder and a CRF-RNN layer on top. Besides, we propose a novel loss function that combines pixel-wise cross-entropy loss and asymmetric similarity loss called the combined asymmetric loss function. Training the networks uses combined asymmetric loss, and the transfer learning method can effectively boost the network's segmentation performance. The CRF-RNN layer is integrated on top of UNet as follows. First, EfficientUnet was trained. When the UNet network's parameters have been trained, they are fixed and set to untrainable. Next, the softmax layer is left out, and the CRF-RNN layer on top is integrated. Finally, the CRF-EfficientUNet is trained end-to-end once again.

B. UNETS WITH DIFFERENT ENCODERS FOR POLYP SEGMENTATION
The UNet architecture was developed by Olaf Ronneberger et al. for Biomedical Image Segmentation [8]. UNet has two symmetric paths. The first path is also called the encoder, which is used to capture the context in the image. The encoder consists of convolutional and max-pooling layers. The second is called the decoder, which is used to enable precise localization using transposed convolutions. Moreover, UNet has connections between encoder and decoder to skip the higher-level features the encoder learned that could be lost during the decoding process. That means the outputs of the encoding layers are passed directly to the decoding layers so that all the important pieces of information can be preserved. We adopt a transfer learning approach with UNet architecture for polyp segmentation. We use UNet with a CNN model pre-trained on the ImageNet dataset as the encoder. The choice of the encoder is essential because the CNN architecture, the number of parameters, the type of layers directly affect the speed, memory usage, and most importantly, the performance of the UNet. In this work, we select three architectures to compare and evaluate their performance in polyp segmentation: MobileNet [9], ResNet [10], and EfficientNet [11]. MobileNet is a family of mobilefirst computer vision models from Google. They are designed to maximize accuracy while being mindful of the restricted resources for an on-device or embedded application. ResNet is a residual learning framework that enables training deep networks easily. With ResNet, we can benefit from deeper CNN networks to obtain an even higher level of essential features for challenging tasks such as polyp segmentation. Effi-cientNets are the latest family of image classification models from Google, which achieves the state of the art accuracy on ImageNet. Mingxing Tan and Quoc V. Le proposed the EfficientNets based on AutoML and Compound Scaling. In particular, they use the AutoML MNAS Mobile framework to develop a mobile-size baseline network named EfficientNet-B0. Then, they use the compound scaling method to scale up this baseline to obtain EfficientNet-B1 to EfficientNet-B7. The accuracies of networks are steadily increasing while maintaining a relatively small size from EfficientNet-B0 to EfficientNet-B7. This study conducts an ablation study on different encoders, including EfficientNets family from EfficientNet-B0 to EfficientNet-B7, ResNet-50, ResNet-101, and MobileNetV2. Our experiments show that UNet with EfficientNet B7 encoder gives the highest performance.

C. COMBINED ASYMMETRIC LOSS FUNCTION
We present the combined asymmetric loss function, a novel loss function that combines existing loss functions with hyper-parameters to boost segmentation results: crossentropy loss and asymmetric similarity loss. Pixel-wise cross-entropy loss was used by Ronneberger et al. in [8] for the task of image segmentation. This loss simply verified each pixel individually, comparing the class predictions defined as a depth-wise pixel vector to the target vector. The cross-entropy loss function is defined as: where p i,j is the predicted segmentation probability, and g i,j stands for the ground truth at image pixel (i, j). The crossentropy loss function assesses every single pixel. In medical imaging applications, such as polyp segmentation, the polyp pixel class is much lower in number than the none-polyp pixel class. Hence, the segmentation network trained with a cross-entropy loss function is biased towards the background image than the object itself. Furthermore, as the foreground region is often missing or only partially detected, it is not easy for the model to see the object. Dice score coefficient (DSC) is an overlap index widely used to assess segmentation maps in the medical community. Dice similarity coefficient between the set of predicted binary FIGURE 3. Dice score: TP is true positives, FP is false positives, and FN is false negatives, P is the set of predicted binary labels, G is the set of ground truth binary labels labels (denoted as P) and the set of ground truth binary labels (denoted as G) is defined as: The dice loss function is formulated based on the Dice score [37]. This is used to improve UNet and other segmentation networks training. Simply put, Dice score is 2 × the area of overlap between P (predicted area) and G (ground truth area) divided by the total number of pixels in P and G. Figure  3 illustrates Dice score. Figure 3 illustrates the Dice score. The Dice score can be calculated as: Where TP is true positives, FP is false positives, and FN is false negatives. In this equation, Dice score weighs false positives (FPs) and false negatives (FNs) equally. When data is class-imbalanced, positive (polyp) pixels are often much lower in numbers than negative (non-polyp) pixels. The network trained with Dice loss on imbalanced data may make predictions severely biased towards the negatives (nonpolyp) class. That is particularly undesired in colonoscopy scan applications where false negatives are more serious than false positives. On the other hand, precision and recall are defined as: Combine Equation (3),(4),(5), we have: As Equation (6), Dice score is the harmonic mean of precision and recall. A trained network with Dice loss on unbalanced data may make predictions with high precision and VOLUME XX, 2021 low recall. In some fields like medical image segmentation problems, however, the data are highly unbalanced, detecting the small number of pixels in the positive class is important. Thus, it is necessary to better balance precision and recall in training segmentation networks for unbalanced data. Asymmetric similarity loss function was proposed in [38] for training segmentation networks to make a better balance between precision and recall. The asymmetric similarity loss function is based on F β score and used to replace Dice loss function. F β score is defined as: By changing the hyperparameter β, we can control the tradeoff between precision and recall. Equation 7 can be written as: where |P \G| is the difference of P and G. Therefore, F β score can be calculated as follows: F β score with the hyper-parameter β generalizes Dice similarity coefficient and Jaccard (IoU) index. When β = 1, the F β score is Dice score, β = 2 generates F2 score, and β = 0 transforms the score to precision. When the hyper-parameter β is larger, the weight of recall is higher than the weight of precision, and the false negatives are more emphasized.
In this work, we proposed a combined asymmetric loss function that combines cross-entropy loss and asymmetric similarity loss for training networks to boost polyp segmentation results. The proposed loss function is defined as: where L CE is cross-entropy loss and L Asym = 1 − F β is asymmetric similarity loss which is based on F β score, the hyperparameter α controls the amount of cross-entropy loss term contribution in the loss function. Due to the polyp segmentation problem is also a pixel classification problem, we use the cross-entropy loss term to verified each pixel individually. However, cross-entropy loss assesses every single pixel. In colonoscopy images, polyps usually have a small surface area. Hence, the segmentation network trained with a cross-entropy loss function is biased towards the background rather than the polyp objects. Like Dice loss, asymmetric similarity loss can handle the input classimbalance problem, e.g., segmenting small polyps from a large background. Moreover, asymmetric similarity loss allows training networks that make a better balance between precision and recall. By combining cross-entropy loss and asymmetric similarity loss for training networks, we can leverage the asymmetric similarity loss term to handles the input class-imbalance problem and control the trade-off between precision and recall. At the same time, we can force networks to learn better parameters by penalizing for false positives/negatives using the cross-entropy loss term. In the proposed loss function, appropriate values of the hyperparameter α, β can be defined based on class imbalance ratios of the dataset. Our experimental results prove that combined asymmetric loss function is more robust than cross-entropy loss function and Dice loss function. Table 1 lists recent polyp segmentation work that used different loss functions for training models. As reported in the table, none of the current loss functions can explicitly handle all the main challenges in the polyp segmentation problem. These challenges are handling class imbalance, the tradeoff between precision and recall, and penalizing for false positives and false negatives. Some studies attempted to deal with class imbalance by using variants of cross-entropy loss and Dice loss: Sánchez-Peralta et al. [40] use a loss function that combines binary cross-entropy and Jaccard index loss; Nguyen et al. [41] use an adaptive weighted loss function which is a weighted cross-entropy loss; Mahmud et al. use Modified Focal Tversky (MFTL) loss function for training the PolypSegNet, MFTL increase the focus on hard training samples by utilizing Tversky index (a generalization of Dice score). However, these methods on polyp segmentation datasets do not handle the trade-off between precision and recall. Nguyen et al. in [39] also proposed a loss function that combines the binary cross-entropy and the Dice loss. Nguyen et al. in [37] also proposed a loss function that combines the binary cross-entropy and the Dice loss. Their loss function could penalize false positives and false negatives. But the trade-off between precision and recall couldn't be dealt with on all polyp segmentation test sets. In this article, we propose a combined asymmetric loss function that combines crossentropy loss and asymmetric loss to train our polyp segmentation model. When the proposed loss function is used as an optimization function, the polyp segmentation model can handle class imbalance, the trade-off between precision and recall, and penalize for false positives and false negatives. The experiment's results in section V-A2 show that when our model CRF-EfficientUnet was trained with combined asymmetric loss function, and it significantly improves the polyp segmentation accuracy.

D. INTEGRATING CRF AS RNN LAYER ON TOP OF THE POLYP SEGMENTATION NETWORKS
Using a fully connected Conditional Random Field (CRF) in conjunction with a deep segmentation model is the popular approach for semantic segmentation. The idea behind this is that the segmentation model plays a role as a feature extractor that produces a coarse segmentation. Then CRF refines the result segmentation. The input of CRF includes the segmentation probality produced by the network and the original input image. Unlike a convolution layer that implements local filters, the fully-connected CRF considers every possible pair of pixels in the image. Each pair is called a clique. In CRF graphical model, the clique is defined by the spatial distance and color distance between pixels. This makes segmentations  [26] Binary cross-entropy loss No No Yes CDED-net [39] Combination of binary cross-entropy and Dice loss Yes No Yes Sánchez-Peralta et al [40] Combination of binary cross-entropy and Jaccard index -Yes Yes MED-Net [41] Adaptive weighted loss function No Yes Yes A-DenseUNet [35] Binary cross-entropy loss No No Yes PolypSegNet [32] Modified Focal Tversky loss No Yes No Debesh Jha et al [34] Binary cross-entropy loss No No Yes Efficient U-Net multi-scale attention [42] Binary cross-entropy loss No No Yes Proposed method Combination of cross-entropy and asymmetric loss function Yes Yes Yes produced by the CRF much sharper than those produced by the original segmentation model. Thus, the receptive field of a CRF is the entire image. However, when using a CRF to improve the quality of a segmentation model, the CRF has to be trained separately after the base model has been trained. Hence, in [13], the authors propose the CRF meanfield approximation as Recurrent Neural Network (RNN) that can be added on top of CNN and train the whole system endto-end.
In the fully connected pairwise CRF model, the image segmentation problem is solved as an optimization problem by minimizing an energy function [12]: The term Φ(y u i ) measures the cost of assigning label u to pixel i, N is the number of pixels in the image, the pairwise potential Ψ(y u i , y v j ) measures the cost of assigning label u and v jointly to pixel i, j and is defined as: where µ(u, v) indicates the compatibility of labels u and v, K = 2 is the number of Gaussian kernels; k (m) is a Gaussian kernel, w (m) is a weight for the Gaussian kernel,f i , f j denote feature vectors of pixels i, j respectively.
Where e i , e j denote the intensity and s i , s j denote spatial coordinates of pixels i, j respectively; θ α , θ β , θ γ are parameters of the Gaussian kernels. Fully connected CRF predicts the probability of assigning label u to pixel i (q u i ) by minimizing Equation (11). {q u i } can be calculated using a mean-field iteration algorithm which is formulated as Recurrent Neural Networks. So that, CNNs and the fully connected CRF are integrated as one deep network and can be trained using a back-propagation [13].
This article presents a deep learning model that integrates UNet and CRF-RNN for polyp segmentation that can be  Figure 4 shows the network structure of CRF-RNN in our proposed model. In Figure 4, G1, G2 are two gating functions: where Q = {q u i }, {q u i } denotes the probability of assigning label u to pixel i, Q in denotes the input of one meanfield iteration, Q out denotes the output Q of one mean-field iteration, Q f inal denotes the final prediction results of CRF-RNN, P denotes the output of UNet, P norm denotes the P that after softmax operation, t represents the t th mean-field iteration, and T is the total number of mean-field iterations. Mean-field iteration [13] is considered as a stack of CNN layers which includes these steps: Message Passing, Re-Weighting, Compatibility transform, Adding Unary Potentials, and Normalization. In our study, the term Φ(y u i ) in Equation (11) is the output of UNets and the term Ψ(y u i , y v j )is computed based on feature vectors of pixel i, j with information is derived from image features such as spatial location and RGB values. The parameters of the Gaussian kernels θ α = 160, θ β = 3, θ γ = 3 while w and µ are learned in the training phase, the RNN parameter iteration count T is set to T = 10 during the test time and T = 5 during the training time, according to [13].
By using the UNet that fully integrates CRF-RNN as layer on top, and making whole network possible to train end-toend with the back-propagation algorithm, we can improve the VOLUME XX, 2021 polyp segmentation accuracy without offline post-processing for object delineation. The experiment's results in section V-A4 show the considerable increase in Dice score when using the CRF-RNN layer on top of all experimented networks.

IV. EXPERIMENTAL METHOD A. DATASET
Several public available benchmark datasets are used for the training and evaluation of the proposed method. The examples are given from the datasets in Figure IV-A Details of these datasets are summarized as below: -CVC-ClinicDB dataset [15] consists of 612 images from 31 different types of polyps along with the corresponding ground truth masks of defined polyp regions. The ground truth masks are manually annotated by experts. All the images have a resolution of 384 × 288.
-Kvasir-SEG dataset [14], publicized by Simula Research Laboratory, includes 1000 polyp images with varying sizes from 332 × 482 to 1920 × 1072 and their corresponding ground truth masks manually annotated by expert endoscopists from Oslo University Hospital (Norway).
-ETIS-Larib dataset [16] contains 36 different types of polyps in 196 images with a resolution of 1225 × 966. These images were extracted from colonoscopy videos, and the ground truth masks were annotated by experts. This dataset is provided in the 2015 MICCAI automatic polyp detection sub-challenge as the test set.
-The CVC-ColonDB dataset [17] is contributed by the Machine Vision Group (MVG). This dataset consists of 300 polyp images and their corresponding pixel-level annotated polyp masks extracted from 15 video sequences. The images had a resolution of 574 × 500.
These datasets were obtained with different imaging systems. Each dataset contains binary masks as the ground truths to indicate the location of the polyps for each image. Expert endoscopists annotated all ground truths of polyp regions from the corresponding associated clinical institutions. There are similar image frames within a dataset. However, the datasets vary regarding the number of images, image resolution, availability, devices used for capturing, and the accuracy of the segmentation masks. In this work, we conduct experiments with different scenarios using training and testing data to compare the proposed model's performance over the SOTA approaches.

B. DATA AUGMENTATION
One of the challenges in training polyp segmentation models is the insufficient numbers of data for training. Since the endoscopy procedures involving moving camera control, color calibrations are not consistent, the appearance of endoscopy images significantly changes across different laboratories. The data augmentation step extends endoscopy images into the space that can cover all their variances. By augmenting training data, we can also reduce the over-fitting problem on training models. Figure 6 shows the examples of the data augmentation method applied to the original polyp image (a). The methods of augmentation used in our work include vertical flipping, horizontal flipping, random rotation between -10 and 10 degrees, random scaling ranging from 0.5 to 1.5, random shearing between -5 and 5 degrees, random Gaussian blurring with a sigma of 3.0, random contrast normalization by a factor of 1 to 1.5, random brightness ranging from 1 to 1.5, and random cropping and padding by 0-5% of height and width.

C. EVALUATION METRICS
For the evaluation of polyp segmentation, we use Dice coefficient as the main metric. Furthermore, to provide a general view of the effectiveness of our method, we also employed interception over union (IoU), recall (Re) which is also known as sensitivity, and precision (Pre). The evaluation metrics are calculated as follows: This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2021.3129480, IEEE Access P re = |T P | |T P | + |F P | (18) where PR represents prediction results, GT is the groundtruth, TP is true positives, FP is false positives, and FN is false negatives. Metrics compute on every image, then average on the whole dataset across all images.

D. TRAINING SETUP
The proposed models are implemented using Keras and TensorFlow backend. All algorithms have been programmed/trained on a PC with a GeForce GTX 1080 Ti GPU. Weights pre-trained on ImageNet for encoders are used as initialization. The encoders are unfrozen, and the entire network is updated via Adam optimizer with the learning rate of 1e − 4 and the maximum epoch number of 500. The proposed loss function, combined asymmetric loss, is used for training models. The dataset is divided into batches with a mini-batch size of four for the training. The model generated at the epoch with max Dice score on the validation set is used as the final model.

V. EXPERIMENTS RESULTS AND ANALYSIS A. ABLATION STUDY
To analyze the effect of each component in the proposed model on the segmentation performance, we performed an ablation study with model variants. To make equal to all ablation experiments, we conduct experiments on CVC-ClinicDB dataset. The dataset is split 80/10/10 for training, validation, and testing.

1) Performance evaluation on CNN pre-trained encoders
We first evaluate UNets with different encoders. Several encoders are selected to evaluate their performance in polyp segmentation. The EfficientNet family from B0 to B7, Mo-bileNetV2, ResNet variants, including ResNet18, ResNet34, ResNet101 have been used. 2) The effect of combined asymmetric loss function Next, we evaluate the effect of the proposed loss function on models' performance and compare it with basic loss functions in polyp segmentation. We conducted experiments using three backbones, UNet-MobileNetV2, UNet-ResNet101, UNet-EfficientNetB7; the models are called UNet1, UNet2, and UNet3, respectively. We trained these models using binary cross-entropy loss (BCE loss), Dice loss, Asymmetric loss, and our proposed loss, i.e., combined asymmetric loss. The hyperparameters of loss functions are chosen for the best results of models with α = 0.4 and β = 1.6. The improvements of performance metrics are reported in Table 3. This table demonstrates that our proposed loss function makes a better balance between precision and recall than other loss functions. Therefore, the performance of models trained with our proposed loss function is increased. Comparing to binary cross-entropy loss, the models trained by the proposed loss function could improve performance the most, specifically as follows: Unet1 (MobileNetV2 encoder) could improve Dice by 6.23% and IoU by 4.75%; Unet2 (ResNet101 encoder) could improve Dice by 3.37% and IoU by 1.6%; Unet3 (EfficientNetB7 encoder) could improve Dice by 3.37% and IoU by 1.6%. Although the precision may be decreased, the proposed loss function can make a trade-off between precision and recall so that the Dice score can be increased. Figure 7 illustrates the Dice scores of models trained by cross-entropy loss, Dice loss, asymmetric loss, and the proposed loss. This figure shows that the Dice scores of models trained by the proposed loss function outperform the others. Moreover, Figure 8 describes the effects on the network learning progress of the proposed loss function (combined asymmetric loss) and the cross-entropy loss function. This figure shows that the validation loss values are less variable during training when the model is trained by our proposed loss function than when the model is trained by the crossentropy loss function. VOLUME XX, 2021 This work adopts a transfer learning approach with UNet architecture for polyp segmentation by using CNN models pre-trained on the ImageNet dataset as the encoder. To evaluate the effect of this transfer learning method, we train UNet from scratch and compare the received results with the result from the transfer learned UNet. We conducted experiments using six backbones: Unet-MobileNetV2, Unet-Resnet50, Unet-Resnet101, Unet-EfficientNetB5, Unet-EfficientNetB6, Unet-EfficientNetB7. The comparations of performance metrics for polyp segmentation between the UNet trained from scratch and transfer learning methods are reported in Table  4. The table demonstrates that the performance of models trained by the transfer learning method is significantly improved compared to those trained from scratched. In addition, when the models are deeper, the performance improvement is greater.

4) The effect of Conditional Random Fields as Recurrent Neural Network layer
We adapted some experiments to test whether using a CRF-RNN layer on top of the polyp segmentation networks improved the segmentation quality. These experiments aim to compare the performance difference between using and not using a CRF-RNN layer on top of the segmentation network. Underlying network architectures used for polyp segmentation are several UNets with different backbones, including UNet-MobileNetV2, UNet-ResNet101, and UNet-EfficientNetB7. The results are presented in Table 5. This table shows a considerable increase in Dice score when using a CRF-RNN layer on top of all experimented networks. More specifically, the UNet-EfficientNetB7 with a CRF-RNN layer on top achieves the most improvements of 1.83% in terms of Dice score, and UNet-MobileNetV2 with CRF-RNN increases the least by 0.92% in terms of Dice, as can be calculated from average metrics in Table 5. The improvement in results demonstrates the advantage of using a CRF-RNN layer on top of segmentation networks. Moreover, Figure 9 illustrates the comparison of Dice scores of UNets with and without a CRF-RNN layer on top. This figure also shows improvements in Dice score when using a CRF-RNN layer on top of all experimented networks. Finally, some examples of different segmentations produced by model variants are depicted in Figure 10. The figure describes the UNet model with EfficientNetB7 backbone and CRF-RNN trained by combined asymmetric loss function can recognize the polyp mask more accurately than other models. This figure also shows that our model has the robustness with the ability to detect polyps on challenging images (e.g., blurriness, low-quality images in Figure 10(a), 10(b), small polyps in 10(e)) and perform well on easier images (e.g., Figure 10(c), 10(d)).

B. COMPARISON TO EXISTING METHODS
This section compares our proposed CRF-EfficientUNet to several recent SOTAs for polyp segmentation. Results for the compared models are reported in their respective papers. From the previous ablation study, we select the UNet-EfficientNetB7 with combined asymmetric loss function and CRF-RNN layer as the comparison model for this section. Then, we conduct experiments with different scenarios of training and testing data. The hyperparameters of asymmetric loss function are chosen based on the empirical evaluation, with α = 0.4, β = 1.6 on the CVC-Clinic dataset, and α = 0.3, β = 1.3 on the Kvasir-SEG dataset. We present and compare the results of the proposed method with existing methods in terms of learning ability, generalization capability on the same dataset, and cross-dataset.

1) Results on the same datasets
We conduct two experiments to validate the model's learning ability when the training and test set are from the same dataset. The first experiment uses CVC-Clinic dataset, and the second uses Kvasir-SEG dataset. These experiments are conducted with a five-fold cross-validation scheme. In this scheme, four folds are used for training, while the remaining fold is used to evaluate performance. The training and evaluating processes are repeated five times, and the mean values of the evaluation metrics are reported. The results are compared with several SOTAs. Table 6 and Table 7 show the comparisons of the quantitative results on CVC-Clinic and Kvasir-SEG, respectively. As shown in these tables, our method outperforms all other methods in Dice and IoU metrics across both datasets. Specifically, Table 5 shows that our proposed methods achieve the best performance on CVC-Clinic dataset with Dice of 95.12% and IoU of 91.85%, outperforming the second-best ResUNet++ CRF by 3.09% in Dice and 2.87% in IoU. In Table 6, on Kvasir-SEG dataset, our proposed method also gets the highest Dice of 92.72% and the second-highest IoU of 87.69% (the highest is Efficient UNet multi-scale attention with IoU of 88.69%). These  [25] 89.8 84.0 n/a n/a Efficient U-Net [42] 87.85 88.69 n/a n/a multi-scale attention [42] ±0.11 ±0.63 n/a n/a results demonstrate that our model has a strong learning ability to segment polyps effectively.

2) Results on cross-dataset
We carry out experiments with training and testing across different datasets to measure the generalization capability of the proposed method. Since different polyp datasets have different image properties and feature distributions, the models need to generalize well to have good performance. In this session, we train models on CVC-ClinicDB, Kvasir-SEG, and a mixed Kvasir and CVC-ClinicDB, respectively, and use the other independent datasets: ETIS-Larib, CVC-ColonDB for testing. Then, we compare the results with current works that have the same training and testing data scenarios. The results are reported in comparison tables, where 'n/a' denotes unavailable results, and '*' indicates the results generated using the released code. First, we train the model with CVC-ClinicDB dataset.  [32], which is the secondhighest method. In addition, Figure 11 presents examples of segmentations produced by the proposed model with chal- Next, we train the model with Kvasir-SEG dataset. Table 9 shows the results and comparison with other models for polyp segmentation. Like the previous experiment with CVC-ClinicDB, our proposed method also outperforms all other methods on both test sets. On ETIS-Larib test set, we obtain the best segmentation performance with 78.53% Dice, 66.95% IoU. On CVC-ColonDB test set, the proposed method gets the best results with 85.59% Dice, IoU 76.19%, recall of 88,07%, and precision of 86.78%, outperforms the second-highest method ResUNet++ TTA [33] by 29.63% Finally, we use 1450 images, including 900 images in Kvsir-SEG and 550 images in CVC-ClinicDB for training models. Table 9 presents the results of the cross-data generalizability of methods. The table shows that our proposed method achieves the highest results on both test sets with 78.35% Dice on ETIS-Larib and 86.04% Dice on CVC-ColonDB. We have compared the results with the existing

ETIS-Larib
UNet [8] * 57.25 n/a n/a n/a UNet++ [27] * 55.12 n/a n/a n/a ResUNet++ [28] 40 n/a n/a n/a UNet++ [27]* 61.85 n/a n/a n/a ResUNet++ [28] 54 n/a n/a n/a PolypSegNet [32] 74.7 n/a n/a n/a works that used the same scenarios of using training and test data. Our method also outperforms all in both Dice and IoU metrics. Especially with the CVC-ColonDB test set, our Dice score is 15.14% higher than the second-highest method PraNet [25].
In this section, we conduct experiments to measure the generalization capability of the proposed method. Generalization capability checks the usefulness of the model across different available datasets coming from different hospitals. A good generalizable model could be a significant step toward developing a good clinical system. It should be noted that the performance of the proposed method outperforms all SOTAs across independent test sets in terms of Dice metric. These results indicate that the proposed method has better generalizability than existing methods, and it can be a compelling choice for practical applications with considerable data variations.

VI. CONCLUSION
This paper proposes CRF-EfficientUNet, an improved UNet framework for polyp segmentation. We present a novel UNetbased architecture extended from UNet with the EfficientNet  [28] n/a n/a 49.74 68.0 ResUNet++ TTA [34] n/a n/a 50.84 68.59 PraNet [25] 62 B7 encoder and the CRF-RNN layer on top. A novel loss function is proposed for training CRF-EfficientUNet to solve the unbalanced data problem and achieve better performance. Besides, we use the transfer learning method to train and validate the proposed method on various datasets, i.e., Kvasir-SEG, CVC-ClinicDB, CVC-ColonDB, EITS-Larib, with different scenarios of using training and test data. Moreover, we check the generalization capability of the proposed method by training the proposed model on Kvasir-SEG and CVC-ClinicDB and testing it over other independent datasets: ETIS-Larib, CVC-ColonDB. The results of the proposed method outperform all SOTAs on the same dataset and crossdataset. These results indicate that our proposed method has better generalizability and learning ability than others. In the future, we will focus on reducing the network size with better performance to build a model which can be an effective choice for practical automated polyp segmentation. Besides, the proposed method can be converted to 3D models and easily applied to other screening modalities like CT and MRI.