Medical

To decrease colon polyp miss-rate during colonoscopy, a real-time detection system with high accuracy is needed. Recently, there have been many effort s to develop models for real-time polyp detection


Introduction
Colorectal cancer (CRC) is the third most common cause of cancer mortality for men and women globally, and CRC is the overall second leading cause of cancer-related death ( Bray et al., 2018 ). CRC most often begins as growths of glandular tissue in the mucosal layer of the bowel. Most cases of CRC are initially non-cancerous called polyps. However, if polyps are left untreated, they may become malignant and potentially life-threatening cancer ( Arnold et al., 2017 ). Thus, early detection and removal of precancerous polyps in the colon are crucial for prevention.
Colonoscopy is the most sensitive method for colon screening. It is effective for detection of colonic lesions and polyps of any size, and allows removal of lesions during the procedure. Colonoscopy is an operator-dependent procedure and prone to human errors. Polyp miss rate is reported to be as high as 22%-28% in certain cases ( Leufkens et al., 2012 ). A number of supportive systems have been proposed to help clinicians detect polyps and tumors during colonoscopy, thus reducing polyp miss-rate and optimize the screening procedure.
Deep learning-based detection models which adopt pre-trained deep CNN networks have been successfully applied for automatic polyp detection Shin et al., 2018;Sornapudi et al., 2019;Wang et al., 2019a;2019b;Zhang et al., 2019 ). Most of these models are slow ( Yu et al., 2016;Pogorelov et al., 2018;Bernal et al., 2017;Shin et al., 2018;Kang and Gwak, 2019 ) or have difficulty detecting ambiguous types of polyps such as flat-shaped or small polyps ( Bernal et al., 2012;Tajbakhsh et al., 2013;. A highly accurate supportive system may be crucial to help endoscopists reduce polyp miss rate during colonoscopy. Moreover, a detection system can only be used if it is fast enough for real-time deployment. Most studies have focused on improving detection performance rather than on real-time aspects. In recent years, researchers have become increasingly interested in developing real-time polyp detection systems ( Zhang et al., 2018;Mohammed et al., 2018;Wang et al., 2019a;2019b;Zhang et al., 2019;Liu et al., 2019 ).
In the colon, there are many polyp-like structures with strong edges, including colon folds, blood vessels, specular lights, luminal regions, air bubbles, etc . This is one of the main challenges in the automatic polyp detection task ( Shin et al., 2018 ). When a model is trained to segment polyps from the background, binary masks are used as the ground-truth images, which have very strong outer edges. During training, the binary masks may lead the model to learn edges as one of the strongest features to distinguish polyps. Therefore, such models tend to produce many false positives (FP) ( Shin et al., 2018;. Most of the CNN-based encoder-decoder models, which are commonly used for object segmentation, can be implemented for real-time applications ( Ronneberger et al., 2015 ) because they are designed to predict a binary mask in a single shot feed-forward fully convolutional neural network (F-CNN), meaning there is no need for a second stage or anchor proposals ( Ren et al., 2015;Liu et al., 2016 ). These models can only predict pixel-wise confidence value and a threshold value is applied to produce the final output binary masks. For object detection, a more explicit mechanism is needed to predict the confidence value for the whole object ( Ronneberger et al., 2015 ). The confidence value is important because a threshold value can be set for the detection confidence to eliminate some FP outputs which tend to have low detection confidence values Shin et al., 2018;.
In this paper, we aim to use CNN-based encoder-decoder network variants for polyp detection. To tackle the two problems discussed above, we propose to use two-dimensional (2D) Gaussian masks as the ground-truth masks for polyp regions instead of using binary masks, which are normally used to train these types of CNN networks for object segmentation. In this way, we force the CNN networks to predict 2D Gaussian shapes for polyp regions. We propose that 2D Gaussian masks are more efficient than binary masks to reduce the impact of the outer edges during training because a 2D Gaussian shape has smaller values on the tails compared to the values around the mean. This property of the 2D Gaussian shape can give less importance to the outer edges and force the models to learn surface patterns more efficiently than binary masks. The strength of the predicated 2D Gaussian shapes can be used as the confidence values of the detection to further reduce FP outputs.

Polyp detection as a 2D Gaussian shape
Fig. 1 presents our approach to detect polyps in a one-shot manner. Instead of generating a binary output, we enforce a CNNbased encoder-decoder network to predict a 2D Gaussian shape, ˆ The output 2D Gaussian shape ˆ Y (x, y ) has the same resolution as the input image I(x, y ) , i.e., downsampling is not applied on the ground-truth mask Y (x, y ) during training the models. In contrast to ( Zhou et al., 2019 ), this elimination of downsampling allows us to ignore: • computation of the loss for a local offset prediction as there is no need to recover the discretization error. • the regression for the polyp size as it is calculated from the predict 2D Gaussian shape ˆ Y (x, y ) which has the same size as the input image I(x, y ) , using size-adaptive standard deviations σ x and σ y ( Law and Deng, 2018;Zhou et al., 2019 ) described in Section 2.4 .

Binary masks to 2D Gaussian masks conversion
Usually, for a dataset of polyp images, binary masks f (x, y ) ∈ { 0 , 1 } W ×H×1 , are provided as the ground-truth images to indicate the location of the polyps. These binary masks are drawn and confirmed by expert clinicians. In the masks, white pixels (1's) correspond to the polyp regions whereas black pixels (0's) correspond to the background. Fig. 2 (b) shows a binary mask provided for the polyp shown in Fig. 2 (a). We use a 2D elliptical Gaussian kernel expressed in Eq. (1) to convert all the binary masks, f (x, y ) , in the training dataset to 2D Gaussian masks, where A is the amplitude located at the center, (x o , y o ) , of mass in the binary image f (x, y ) , To rotate the output 2D Gaussian masks according to the orientation, θ , of the polyp mask in f (x, y ) , we set where σ x and σ y are the polyp size-adaptive standard deviations ( Law and Deng, 2018;Zhou et al., 2019 ). We compute the orientation, θ , of the mask in f (x, y ) as,  Similar to ( Zhou et al., 2019 ), we set the coefficient A = 1 , and use it as the confidence value of the detection at the inference time. If two Gaussians overlap, we take the element-wise maximum ( Cao et al., 2017 ). Fig. 2 (c) shows a 2D Gaussian mask obtained from Fig. 2 (b) using the equations presented above.

F-CNN models for polyp detection
To prove our concept, we evaluate several different F-CNN based encoder-decoder models, including UNet ( Ronneberger et al., 2015 ), Hourglass ( Newell et al., 2016 ), MDeNet , and MDeNetplus-our proposed model. We compare these models for two tasks: 1) polyp segmentation using binary masks as the ground-truth images for training, 2) polyp detection using 2D Gaussian masks as the ground-truth images to force the models to predict 2D Gaussian shapes for polyp regions.
Typically, these models consist of two parts: a contracting path (the encoder) to capture context, and 2) an expanding path (the decoder(s)) that enables precise localization (see Fig. 1 ). The en-coder follows the typical architecture of a CNN with alternating convolution and pooling operations to progressively downsample the resolution and increase the depth of feature maps at every layer. In this study, we use ResNet50 ( He et al., 2016 ) pre-trained on ImageNet database ( Deng et al., 2009 ) as the encoder network for all the models. The decoder(s) gradually up-samples the feature maps at each layer to increase their resolutions and predict an output of the same size as the input RGB image, I(x, y ) .
UNet ( Ronneberger et al., 2015 ): UNet is developed for medical image segmentation and has proven itself very useful when there is limited amount of data available for training. This network combines up-sampled features maps at the decoder part with the corresponding high-resolution features maps from the encoder part via skip-connections. This feature combination enables precise localization ( Ronneberger et al., 2015 ). For our UNet model, we use AlbuNet34 proposed by ( Shvets et al., 2018 ) for angiodysplasia detection.
EncDec : For the Encoder-Decoder (Enc-Dec) model we use the same architecture of AlbuNet34 without the skip connections.
Hourglass : To build our hourglass model, we stacked two models of AlbuNet34. Hourglass network is famous for yielding the best key-point estimation performance ( Newell et al., 2016 ).
MDeNet : MDeNet is proposed by  for semiautomatic polyp annotation. MDeNet consists of an encoder and multiple paths of decoders. Similar to other models, ResNet34 is used as the encoder part to extract different levels of features. At each layer of the encoder, the extracted features are decoded by a decoder. The multiple decoders are meant to increase contextual and semantics information by utilizing the features from different scales and receptive field which helps to segment polyps of different sizes more precisely ( Pinheiro et al., 2016;Yu et al., 2018 ). We predict the final output from the outputs of the decoders after concatenating them into a single layer.
MDeNetplus : Our MDeNetplus shown in Fig. 1 is similar to MDeNet with some modifications. Unlike MDeNet, MDeNetplus has feedback connections from decoders of deeper layers to the decoders of previous layers. The feedback connections sum the activation maps of similar layers of different decoders. We prefer summing the activations rather than concatenating them into a single layer to build a smaller network with fewer parameters, helping to realize the network for real-time implantation. This model is based on the concept of aggregation of layers to acquire rich representations that span levels from low to high ( Yu et al., 2018 ), scales from small to large, and resolutions from fine to coarse, iteratively and hierarchically merge the feature hierarchy to make a model with better accuracy.

From 2D Gaussian shape prediction to bounding boxes and confidence values
At the inference time, we use the peaks in the predicted 2D Gaussian shapes as the confidence values of detection. We calculate the two size-adaptive standard deviations ( σ x and σ y ) for the size of the detection. Fig. 3 shows an example in which the 2D Gaussian shape obtained using Eq. (1) is projected back as a bounding box calculated from σ x and σ y and a confidence value (coefficient A) onto the original image. This process allows us to generate all outputs directly from the predicted 2D Gaussian shapes without the need for any post-processing such as IoU-based non-maximum suppression (NMS) ( Zhou et al., 2019 ). This is important to make polyp detection fast for real-time implementation.

Public datasets
To train the models and evaluate their performance, we use three publicly available datasets of polyp images and videos:  In our experiments, we use CVC-ClinicDB for training the models while ETIS-LARIB and CVC-ColonDB are used for the performance evaluation. All three datasets come with ground-truth images in the form of binary masks provided by clinical experts. The ground-truth masks indicate the polyp pixels in the images. The masks are drawn as exact boundaries around the polyp regions.

Augmentation strategies and preprocessing
We apply several simple pre-processing methods to the input images before used for training the models: 1. Image cropping is applied to remove the canvas around the informative part of the images (see Fig. 4 ). 2. The input images are resized to 512 × 512 because the pretrained Resnet34 accepts this image resolution. 3. We re-scale the input images from [0, 255] to [0, 1] and use the mean and standard deviation calculated from the ImageNet dataset to normalize them.
To improve model generalization during training, we apply several image augmentation methods on the fly such as random affine transformations, (e.g., rotation, vertical and horizontal flips), random zoom-in (up to 25%) and zoom-out (up to 50%), and color augmentations in HSV space. Unlike zoom-out, to keep the balance between large and small polyps, we apply zoom-in only up to 25% because the training dataset contains more large polyps than small ones.

Training the models
We randomly split the training dataset using 5-fold crossvalidation to train the models and choose hyper-parameters. We only use images that contain polyps for training. To prevent the models from over-fitting due to shortage of training data, Resnet34 was initialized with ImageNet pre-train weights and the up-sampling layers were randomly initialized. We use Adam optimizer to train the models for 60 epochs with learning rate 0.0 0 01 (chosen using cross-validation) and a batch size of 2 (due to GPU memory restriction).

Loss functions
It is a known fact that loss function plays an important role in the performance improvement of deep learning. There are many loss functions to choose from and it can be challenging to decide what to pick to obtain the best performance. In this study, we evaluate three loss functions: 1) mean absolute error (L1 loss), 2) mean square error (L2 loss), 3) generative adversarial network (GAN) loss, where N is the number of samples in the epoch, concat is a simple concatenation of I with either Y or ˆ Y , D is the discriminator network, and G is the generator network. For GAN, we use VGG16 ( Simonyan and Zisserman, 2014 ) as the D network to evaluate the output of the G network which can be one of the models discussed in Section 2.3 .

Evaluation metrics
To clinically evaluate a computer-aided diagnosis (CAD), it is important to compute the following medical terminologies: True Positive (TP) : This is a true detection output where the centroid of the detection is located within the polyp masks. Only one is counted if there are multiple overlapped detection outputs for the sample polyp.
True Negative (TN) : This is a true detection output where there is no detection for a negative image (image without polyps).
False Positive (FP) : This is a false alarm where a wrong detection output is provided for a negative region.
False Negative (FN) : This is a false detection output where a polyp is missed in a positive image (image with polyp). We use these terminologies to evaluate the performance of the models in terms of: Sensitivity (Recall) : It measures the ratio of true detection outputs to the total number of polyps in the test dataset. This metric shows the detection ability of a specific model. Sensit i v it y (Sen ) = T P/ (T P + F N) × 100 Precision : It measures the ratio of true detection outputs to the total number of predicted outputs including false alarms. This metric shows the ability of a model to make correct predictions. P recision (P re ) = T P/ (T P + F P ) × 100 F-1 score : This metric is clinically important because it shows the balance between sensitivity and precision.

Performance comparison of binary and Gaussian masks
We used the ETIS-LARIB dataset and L1 loss to compare Gaussian and binary ground-truth masks on different models. Table 1 shows that Gaussian ground-truth is more efficient and effective than the binary ground-truth. When Gaussian masks were used to train the models to predict 2D Gaussian shapes, all the models were able to detect more TPs and eliminate a number of FPs. These results indicated that our hypothesis of using Gaussian ground-truth is valid. Many FPs could be removed from the final results, because the confidence values (coefficient A) of the predicted masks were less than the threshold value which we set to be 0.5. Many other FPs were eliminated because Gaussian masks were successful for reduction of the effect of outer edges during training.
It can be concluded from Table 1 that MDeNetplus experienced the largest performance improvement with 2D Gaussian masks, especially in terms of precision. The main reason for this superiority is that MDeNetplus hierarchically merges the feature hierarchies to better fuse semantic and spatial information for more accurate detection. This outcome is in line with the results obtained previously ( Yu et al., 2018 ). MDeNetplus was also able to produce fewer FPs because feature aggregation across different layers helps to improve inference of what and where ( Yu et al., 2018 ), making the model well constructed to precisely predict the 2D Gaussian shapes for the polyp regions. However, his method of feature fusion might not be suitable for binary masks because edge information may dominate the features in every decoder of the expanding path, leading to generate more FP outputs. When the network is trained on 2D Gaussian masks, the impact of the edges are reduced and the network more efficiently decodes other types of features to make fewer FP detection outputs and precisely detect more polyps. Fig. 5 presents two examples showing that the MDeNetplus trained on Gaussian masks could precisely predict the location of the polyp without producing FPs, while the same model trained on binary masks produced two FPs along with one correct detection. As can be seen, the two FPs are generated at two locations bounded by some sort of round edges in the image.
We run our tests on an NVIDIA GeForce GTX 1080 Ti to investigate the inference speed of our models. The EncDec model seems to be the fastest model requiring only 28 ms to process a single frame. Compared to other models, the EncDec model has no skip connections and fewer parameters, making it the smallest model. MDeNetplus is the slowest (MTP = 39 ms) model with the best performance, but still fast enough for real-time implementation on videos with 25 frames per second.

Performance evaluation of 2D Gaussian and binary masks on different types of polyp mythologies
In this section, we compare the performance of 2D Gaussian and binary masks in detecting different types of polyps. Based on the morphological shapes, Paris classification divides polyps into several categories: pendunculated (0-Ip), sessile (0-Is), slightly elevated (0-IIa), flat (0-IIb), slightly depressed (0-IIc) and excavated (0-III) (see Fig. 6 ). ETIS-LARIB dataset contains only pendunculated (0-Ip), sessile (0-Is), and slightly elevated (0-IIa). The sessile and pedunculated polyps are most common types Vleugels et al. (2017) . Sessile and slightly elevated polyps lie flat   against the surface of the colon's lining, making them harder to detect in CRC screening while pedunculated polyps are mushroomlike tissue growths with a long and thin stalk Vleugels et al. (2017) .
In Table 1 , we can notice that 16 additional polyps were detected by 2D Gaussian masks than by binary masks. To be exact, we present how many more 0-Is and 0-IIa polyps were detected by 2D Gaussian masks in Table 2 . As it can be seen, 2D Gaussian was successful to detect 4 additional sessile and 12 additional slightly elevated polyps. The same 0-Ip polyps were missed by both types of masks. This outcome shows that 2D Gaussian ground-truth was helpful to detect more flat shaped polyps. Fig. 7 presents two 0-IIa polyps (barely noticed by human eyes) detected successfully by our MDeNetplus model trained on 2D Gaussian masks whereas the same model trained on binary masks missed them. Table 3 shows the performance of MDeNetplus when trained using different loss functions. As seen in the Table, GAN loss is more effective than L1-and L2-loss to force the model to predict 2D Gaussian shapes. We surmise this is because GAN is not only computing the loss between Y and ˆ Y , but also can assess the quality of the predicted Gaussian shapes. If the model predicts an output with irrelevant Gaussian shape, the GAN loss will become large, forcing the model to predict more precise shapes.

Comparison with other methods on ETIS-LARIB
We followed the same dataset guidelines recommended by endoscopic vision challenge in MICCAI 2015 to train and evaluate our detection models. CVC-ClinicDB is used for training whereas ETIS-LARIB dataset is used for testing. In Table 4 , we compare the performance of our best model, MDeNetplus trained with GAN loss, against several state-of-the-art models on ETIS-LARIB dataset. MDeNetplus could outperform the other methods including Faster R-CNN, the-state-of-the-art object detector, in terms of sensitivity (86.54%), and F1 score (86.33%). AFP-Net ( Wang et al., 2019a ) has 2.42% better precision (88.89%) than our model (86.12%). We surmise this is because they utilized more data to train their model. They used CVC-ClinicVideoDB  which comprises 18 videos with a total number of 11,954 frames in which 10,025 frames contain at least a polyp. Table 4 shows the inference time of the models to process a frame. The fastest model is AFP-Net with only 19 ms of MPT per frame. However, we must mention that they run their model on an NVIDIA GeForce RTX 2080 Ti which is faster than our NVIDIA GeForce GTX 1080 Ti. Nevertheless, we are confident that our MDeNetplus can run faster on an NVIDIA GeForce RTX 2080 Ti.

Comparison with other methods on CVC-ColonDB
In this experiment, we used CVC-ColonDB to further compare our results with other methods. Table 5 shows that our MDeNetplus trained with GAN was able to produce fewer FP outputs and thus the highest precision (88.35%) and F1 score (89.65%). RCNN-Mask has the highest sensitivity (95.67%) whereas our MDeNetplus  has the second-highest (91%) compared to all other methods. However, our MDeNetplus is much faster than RCNN-Mask and needs only 39 ms to process an image. Fig. 8 presents two images in CVC-ColonDB. Again, our method successfully detected a very difficult polyp as shown in the first row of Fig. 8 , and even predict the polyp orientation in the image as shown in the second row of Fig. 8 . We also encountered FP detection outputs that are shown in Fig. 9 . The first row of Fig. 9 shows that MDeNetplus was able to detect the polyp in the input image along with an FP output. The second row of Fig. 9 shows that the model missed the polyp and generated an irregular Gaussian shape in a normal region.

Effect of resizing the 2D Gaussian and binary masks on the performance
In this experiment, we resized the 2D Gaussian and binary masks to evaluate the effectiveness of smaller and larger masks on the model performance. Fig. 10 shows that when smaller 2D Gaussian masks ( < σ ) are used for training the model, sensitivity is low and precision is high because when smaller 2D Gaussian masks are used, less weights are given to the polyp outer edges during training, leading to less FPs being generated for folds and objects with strong edges. When larger 2D Gaussian masks are used, sensitivity increases while precision decreases. From Fig. 10 , it can be concluded that the polyp outer edge: a) is an important feature to detect more polyps, b) contributes to produce the majority of FP outputs. Fig. 11 demonstrates the effect of different sizes of binary masks on model performance. The figure shows that using smaller binary masks ( < actual polyp region) are not as effective as using 2D Gaussian shapes to reduce the effect of polyp edges. This is because when smaller binary masks are used, unlike 2D Gaussian masks, part of the polyp region, including the outer edges, are totally excluded from training of the model. It seems that edges cannot be ignored because they are important parts of polyp features. This way of training may fool the model and make it difficult for the model to distinguish between polyp and background. In contrast, 2D Gaussian masks do not totally ignore the edges, but reduce the importance of them by giving them less weights during training of the models.   . 10. Effect of resizing 2D Gaussian masks on the model performance. Fig. 11. Effect of resizing binary masks on the model performance.

Conclusion
In this paper, we proposed a method for real-time automatic polyp detection with good accuracy. Instead of binary masks, we used 2D Gaussian masks as the ground-truth images to train several convolutional neural networks based encoder-decoder variants which are usually used for object segmentation. We showed that 2D Gaussian masks are more effective and efficient than binary masks to detect more polyps and reduce the number of false positives.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.