A Polyp Detection Method Based on FBnet

: The incidence of colorectal cancer (CRC) in China has increased in recent years. The mortality rate of CRC has become one of the highest among all cancers; CRC increasingly affects the health and quality of people’s lives. However, due to the insufficiency of medical resources in China, the workload on medical doctors has further increased. In the past few decades, the adult CRC mortality and morbidity rate dropped sharply, mainly because of CRC screening and removal of adenomatous polyps. However, due to the differences in polyp itself and the skills of endoscopists, the detection rate of polyps varies greatly. In this paper, we adopt an anchor-free mechanism and introduce a better method to factorize the process of bounding box regression. Firstly, we regress the shape of object by the variant of Faster RCNN. Secondly, we re-define the target function of the location of object. The experimental result shows that our method achieves a mAP of 55.8%, which outperforms other state-of-the-art methods by at least 11.9%. This will greatly help to reduce the missed diagnosis of clinicians during endoscopy and treatment, and provide effective help for early diagnosis, early treatment and prevention of CRC.


Introduction
Colorectal cancer (CRC) is one of the common malignant tumors in China. With the continuous development of people's living standards and dietary habits, the incidence and mortality of colorectal cancer keep rising in recent years [Society and Society (2018)], which seriously endangers the health and living quality of people. CRC has become a major public health problem due to its high morbidity and high mortality. According to statistics, CRC is the second and third leading cause of death in men and women, respectively [Society and Society (2018)]. In addition, a recent study reported a significant increase in the annual percentage of CRC incidence among young people [Bailey, Hu, You et al. (2015)]. In clinical diagnosis, colonoscopy plays an important role in the screening of CRC [Rex, Boland, Dominitz et al. (2017)]. The use of colonoscopy to reduce CRC mortality and incidence is mainly due to the ability to detect polyps/adenomas [Brenner, Chang-Claude, Jansen et al. (2014)] and remove them by resection [Doubeni, Corley, Quinn et al. (2018) ;Brenner, Chang-Claude, Jansen et al. (2014)]. In addition, there is evidence that for every 1.0% increase in adenoma detection rate (ADR), the risk of interphase CRC is reduced by 3.0% [Corley, Jensen, Marks et al. (2014); Kaminski, Regula, Kraszewska et al. (2010)]. In the past few decades, the adult CRC mortality and morbidity rate dropped sharply (reduced by 51% and 32%, respectively), mainly because of CRC screening and removal of adenomatous polyps [Burke, Kaul and Pohl (2017)]. However, due to the differences in polyp itself and the skills of endoscopists, the detection rate of polyps varies greatly [Shaukat, Oancea, Bond et al. (2009)], and, in some cases, polyps may be missed by the diagnosis, and the rate of missed diagnosis is as high as 27% [Mahmud, Cohen, Tsourides et al. (2015); Ahn, Han, Bae et al. (2012)]. Thus, unrecognizable polyps in the field of view during colonoscopy are an important issue [Mahmud, Cohen, Tsourides et al. (2015)]. Some studies have shown that the second observer's observations increase the polyp detection rate (PDR), but such strategies are still controversial in improving adenoma detection rate (ADR) [Aslanian, Shieh, Chan et al. (2013);Buchner, Shahid, Heckman et al. (2011)]. At present, the medical industry has incorporated more high-technology such as computer sciences and sensor technology, making medical services become more intelligent and precise. With the latest breakthroughs in artificial intelligence, especially the development of deep learning (DL), computer-aided diagnosis (CADx) of polyps during colonoscopy has attracted wide attention [Chen, Lin, Lai et al. (2018); Byrne, Chapados, Soudan et al. (2019);Fang, Cai, Sun et al. (2018)]. Deep belief network studied by Wan et al. [Wan, Chen, Kong et al. (2019)] is adopted to help doctors to detect the early intestinal cancer. The ultimate goal of a real-time automatic polyp detection system is to assist endoscopic detection of polyp lesions. Although several automated polyp detection systems have been developed over the past decade [Tajbakhsh, Gurudu and Liang (2015); Misa-wa, Kudo, Mori et al. (2018)], there is a lack of the ability of this technique to locate and track polyps in clinical practice during on-site colonoscopy. This paper proposes an anchor-free and two-steps-decomposition method to improve the detection rate of polyps/adenomas. This will greatly help to reduce the missed diagnosis of clinicians during endoscopy and treatment, and provide effective help for early diagnosis, early treatment and prevention of CRC.

Description of the problem
In this paper, to model the correlation between polyp and bounding box in a small dataset, we propose a novel method to enormously reduce the difficulty of learning and risk of overfitting. The whole process of polyp detection is shown in Fig. 1.

Anchor-free mechanism
In recent years, mainstream literatures usually use anchor-based methods to regress bounding box (i.e., researchers regress the deltas between an anchor box and a groundtruth box instead of directly regressing the ground-truth box.). Although anchor-based methods reduce the difficulty of regression, it introduces some drawbacks. Firstly, in order to improve recall as much as possible, lots of anchor boxes are defined to guarantee to capture all ground-truth boxes in an image. It not only causes that most of anchor boxes have few IOU with the ground-truth boxes but also brings in a huge class imbalance between positive and negative anchor boxes. Secondly, researchers must use prior knowledge to design anchor boxes, including scales and aspect ratios of anchor boxes. Obviously, it is hard to choose a set of appropriate anchor boxes in a dataset. In our experiment, due to drastic variations in the sizes and aspect ratios of groundtruth boxes, we have no choice but to define more anchor boxes. Unfortunately, it greatly increases the number of parameters and finally leads to over-fitting. To alleviate it, we employ an anchor-free mechanism which means that we must directly regress ground-truth boxes without the help of anchor boxes. However, as you see in YOLO-v1 [Brenner, Chang-Claude, Jansen et al. (2016)], it performs so badly when it tries to directly regress ground-truth boxes. To fix this problem, we, in the next section, introduce two steps for bounding box regression to reduce the difficulty of learning without increasing the number of parameters.

Two steps of bounding box regression
In consideration of the questions mentioned above, we factorize the bounding box regression into two steps. First, we regress the shape of object by the variant of bounding box regression in Faster RCNN. Second, we re-define the target function of locating objects.

Object shape regression
To reduce the learning difficulty, we take the following variants of bounding box regression in RCNN [Ren, He, Girshick et al. (2015)]: , ℎ , respectively, denote the width and height of the ground-truth and the predicted boxes, and λ is a normalized factor.

Object location regression
Notably, it is bad to couple object shape (i.e., h, w) with object location (i.e., x, y) like RCNN [Ren, He, Girshick et al. (2015)]. If the object shape has huge variation, object location also has the same nasty properties. To address this issue, we decompose it into two steps. First, we predict each position to identify whether or not it contains the center point. Second, we predict the location offsets relative to the center point. We convert a location (x, y) in an image to (x/s, y/s) in a feature map where s is the downsample factor. However, x/s and y/s are not exactly integers. Hence, we have two choices. In the first choice, we simply map (x/s, y/s) to (⌊x/s⌋, ⌊y/s⌋). Then, we predict the offset relative to (x/s, y/s) as follows: where (x, y) is the center point of an object in the heatmap.

Loss function
The loss function of FBnet contains two parts (i.e., the loss in two steps of bounding box regression.). In our experiment, we simply view the two parts as regression problems. For the first step, we assign Smooth-L1 loss to object shape as in RCNN [Ren, He, Girshick et al. (2015)] for being robust to outliers. For the second step, we adopt MSE loss for both object categories and object locations. Consequently, the total loss is formulated as follows: denotes whether an object appears in a cell, and and are two zoom factors.

Training details
In this paper, we take VGG-16 [Simonyan and Zisserman (2014)] as our feature extractor.

Iterating training
We take two steps training to optimize our networks: (1) freeze all layers in VGG-16 [Simonyan and Zisserman (2014)] and train customed layers by the total loss, (2) freeze the front layers in VGG-16 [Simonyan and Zisserman (2014)] and fine-tune them.

Super-parameter setting
In YOLO-v2 Redmon et al. [Redmon and Farhadi (2017)], they explored that a lowresolution classifier cannot extract robust features from high resolution images. For the sake of convenience, the input resolution in all experiment is resized as 224×224.
In the first step, we freeze all layers in VGG-16 Simonyan et al. [Simonyan and Zisserman (2014)] and add detection head whose parameters are randomly initialized. We use Adam to train the customed layers for 20 epochs with an initial learning rate of 5e-3 which is divided by 10 separately at 5 and 17 epochs. In the second step, we use the weights from Step 1 to initialize all layers and take Adam to train it for 3 epochs with a very tiny learning rate of 1e-5.

Data augmentation
All data augmentation methods are merely random cropping and horizontal flip-ping. In this experiment, we leverage many kinds of data augmentation approaches. However, most of them have negative effects on mAP. It even causes misconvergence. When we check the feature map, we found that it is so bad and hard to find a potential pattern in it. As we know, human body environment is simple and structured. For example, almost all the colors of colonoscopy pictures are red. When we use color jittering, it may generate blue or other color images to train, intro-duce extraneous noise and finally make the network confused of this noise and hard to convergence.

Inference details 3.5.1 Inference post-procession
(1) Filter all boxes whose confidence are lower than α.
(2) Filter all boxes whose area are lower than β.

Evaluation metrics
The results of our methods are measured with Mean Average Precision (mAP) as in Faster RCNN [Ren, He, Girshick et al. (2015)].

Dataset
In this polyp dataset, it contains 201 colonoscopy pictures. As shown in Fig. 2, Each picture has one to three objects. There are 150 for training and 51 for testing. For preventing data leaking, all images from one patient belong to either training set or testing set.

Super-parameter sensitive experiment
First, we verify the effectiveness of Step 1 with different hyper-parameters . All experimental results are shown in Tab. 1. Especially, the scores are very close to each other when the hyper-parameter is set as 3, 4, 5, 6, and 7. For verifying the effect of λ in the first step, all models in Tab. 1 share the same architecture and use the same regression (i.e., = * , = ℎ * , = * e P w , ℎ = * ℎ ) without the participation of the second step. Notably, when λ=1, it is similar to the target function of Faster RCNN [Ren, He, Girshick et al. (2015)]. As the increment of λ, mAP is slightly advancing. When λ=4, FBnet achieves the top result with the mAP of 0.466. It proves that λ reduces the difficulty of learning and ultimately improves the performances. Then we test our Eqs. (3) or (4) under different settings of λ. All results are shown in Tab. 2. For checking the effect of combining two steps, all models in Tab. 2 take the same regression (i.e., = + , = + , = * e P w , ℎ = * ℎ ). In Nevertheless, it's still converted to (� � , � �). Even though the point changes a little, it has absolutely different meaning in Eqs. (3) and (4). As shown in Tab. 2, Eq. (4) performs better than Eq. (3).
Tab. 2 shows that FBnet improves the mAP by 12.7%. In addition, as we have seen in the experiment, such decoupling operation makes our network converge more quickly and locate more exactly. Actually, if we remove the operation of decoupling, it is equivalent to YOLO-v1 [Brenner, Chang-Claude, Jansen et al. (2016)]. In YOLO-v1, it was creative to introduce anchor-free mechanism. However, it lacks effective techniques to stabilize the process of training. For these problems, we introduce the two steps factorizations. As shown in Tab.

Comparisons with state-of-the-art methods
Finally, we compare our method with some classic methods and all result are shown in Tab. 3. For the sake of a fair comparison with the state-of-the-art counterparts and demonstrate the feasibility of our network including Faster-RCNN [Ren, He, Girshick et al. (2015)], Yolo-v2 [Redmon and Farhadi (2017)], SSD [Liu, Anguelov, Erhan et al. (2016)], and RetinaNet [Lin, Goyal, Girshick et al. (2017)], all super-parameters of the state-of-the-art methods are fine-tuned to the best. All details are shown in Appendix A. As shown in Tab. 3, with VGG-16 as the backend, our FBnet is superior to the other state-of-the-art methods.

Conclusion
In this paper, we adopt an anchor-free mechanism and introduce a better method to factorize the process of bounding box regression. Firstly, we regress the shape of object by the variant of Faster RCNN [Ren, He, Girshick et al. (2015)]. Secondly, we re-define the target function of the location of object. The experimental result shows that our method achieves a mAP of 55.8%, which outperforms other state-of-the-art methods by at least 11.9%. This will greatly help to reduce the missed diagnosis of clinicians during endoscopy and treatment, and provide effective help for early diagnosis, early treatment and prevention of CRC.

Appendix A
All models have taken optimal training epochs, learning rate and optimizer. Meanwhile, different models have different decisive parameters. All results are shown in Tabs. 4-6. In this paper, we also fine-tune them.
In Faster RCNN, we mainly fine-tune those fateful parameters including prior anchor, train-time region proposals, test-time region proposals, input resolution, data augmentation and so on. For prior anchor, we adopt the principle in YOLO-v2 [Redmon and Farhadi (2017)]. In this paper, we run the k-means algorithm in training set to cluster k (k>0) prior anchors without human intervention. For the train/test-time region proposals, we fix train-time proposals and find best testtime proposals. At the same time, we also change other parameters including backend network. In Yolo-v2 and SSD, we mainly focus on the setting of prior anchor, backend network, input resolution, data augmentation and so on. In RetinaNet, we pay attention to the setting of the parameters of focal loss, backend network, input resolution, data augmentation and so on. By fine tuning other parameters, the best results of all models are shown in Tab. 3.