SegT: Separated Edge-guidance Transformer Network for Polyp Segmentation

: Accurate segmentation of colonoscopic polyps is considered a fundamental step in medical image analysis and surgical interventions. Many recent studies have made improvements based on the encoder-decoder framework, which can e ff ectively segment diverse polyps. Such improvements mainly aim to enhance local features by using global features and applying attention methods. However, relying only on the global information of the final encoder block can result in losing local regional features in the intermediate layer. In addition, determining the edges between benign regions and polyps could be a challenging task. To address the aforementioned issues, we propose a novel separated edge-guidance transformer (SegT) network that aims to build an e ff ective polyp segmentation model. A transformer encoder that learns a more robust representation than existing CNN-based approaches was specifically applied. To determine the precise segmentation of polyps, we utilize a separated edge-guidance module consisting of separator and edge-guidance blocks. The separator block is a two-stream operator to highlight edges between the background and foreground, whereas the edge-guidance block lies behind both streams to strengthen the understanding of the edge. Lastly, an innovative cascade fusion module was used and fused the refined multi-level features. To evaluate the e ff ectiveness of SegT, we conducted experiments with five challenging public datasets, and the proposed model achieved state-of-the-art performance.


Introduction
According to the reports published by Globocan'2020, colorectal cancer (CRC) is the second most prevalent cancer type worldwide in terms of mortality and the third most pervasive disease across the arXiv:2306.10773v1[cs.CV] 19 Jun 2023 globe [1].Colorectal polyps are abnormal tissue growths in the lining of the colon that are a precursor to CRC.After 10 to 15 years, polyps can turn into cancer if they are not treated.The best way to lower the prevalence of CRC is by early detection and effective treatment.Colonoscopy is the gold standard method of examining the gastrointestinal tract.It is used to find polyps and remove them before they turn into cancer.However, colonoscopy is a highly operator-dependent procedure, and one in four polyps may be missed during a single colonoscopy owing to human factors, such as clinician skill or subjectivity [2].In addition, there is evidence that absence or incomplete resection of the tumor are two key factors in the development of cancer after colonoscopy [3].Therefore, an automatic and accurate polyp segmentation method is needed to help doctors locate.
Despite significant progress in deep learning [4,5], automatically segmenting polyps remains a formidable challenge.Polyps, which result from abnormal cell growth in the human colon, are strongly related to their surrounding environment.They can vary in shape, size, texture, and color, making their appearance highly diverse.One of the significant difficulties in polyp segmentation arises from the fact that the edges of polyps and the surrounding mucosa are not always clearly distinguishable during colonoscopy.This ambiguity is particularly pronounced in different lighting conditions and when dealing with flat lesions or inadequate bowel preparations.Consequently, the learning model for polyp segmentation faces considerable uncertainty due to these factors.In summary, despite advancements in machine learning and computer vision, the automatic segmentation of polyps remains challenging due to the wide variety of polyp appearances and the difficulties associated with accurately identifying polyp edges and mucosal boundaries in colonoscopy images.These factors introduce significant uncertainty into the learning process, making the task particularly demanding.
In recent years, the rapid development of deep learning has led to an increasing number of deep convolutional neural networks (DCNNs) [6][7][8][9][10][11] being proposed for polyp image segmentation.Brandao et al. [6] introduced the fully connected convolution network (FCN) into the polyp region extraction issue by converting AlexNet, VGG, and ResNets into FCNs.U-shaped [9,11] architecture containing an encoder and a decoder built up from convolutional layers is widely used for segmentation tasks with impressive performance.However, the limitation of convolutional neural networks (CNNs) indicates that the receptive field is limited, and the model only obtains local information but disregards spatial context and global information.In addition, CNNs behave similarly to a series of high-pass filters, favoring high-frequency information.Transformer [12][13][14][15][16] is a recently proposed deep neural network architecture.Compared with CNN, the self-attention layer in transformer is similar to a lowpass filter and can effectively identify long-term dependencies.Therefore, combining the advantages of convolutional and self-attention layers can improve the representation ability of deep networks.
Although these methods have substantially improved accuracy and generalization ability compared with traditional methods, locating the edges of polyps remains a challenge for them, as shown in Figure .1.The color and texture of polyps are markedly similar to surrounding tissues; low contrast provides them with powerful camouflage properties [18] and makes them difficult to identify.Previous studies [17,[19][20][21][22] have explored fusing low-scale boundary and high-scale semantic information to preserve boundary details better.Takikawa et al. [23] and Zhen et al. [24] designed a boundary stream and coupled the task of boundary and semantics modeling.PraNet [8] generated a global map as the initial guidance region and used the reverse attention module thereafter to reveal more complete objects.However, these endeavors seldom consider simulating how humans detect polyps with their ambiguous boundaries to the backgrounds.To address the previously mentioned issues, we have proposed a new deep learning model called the Separated Edge-Guidance Transformer Network (SegT) for polyp segmentation.Our model generates high-quality segmentation maps imitating the human manner and has demonstrated remarkable performance in various challenging scenarios.When people detect potential polyp targets in colonoscopy images, they will first look for the possible polyp region.Thereafter, they will outline the precise edge of the polyp area by comparing the difference between the foreground and background.Inspired by this observation, we propose an effectively separated edge-guidance for polyp segmentation.
The key contributions of our work are as follows: • We propose a novel framework called separated edge-guidance transformer network (SegT) for polyp segmentation, which adapts the pyramid vision transformer (PVT) as encoder rather than the existing CNN-based methods to extract features.• We design a separated edge-guidance (SEG) module, which is composed of two parts: separator (SE) and edge-guidance (EG) blocks.Their purpose is to simulate how humans detect polyp targets.In particular, the SE block is utilized to highlight the object's edges between an image's background and foreground.The EG block aims to embed edge information into the feature map, which can significantly address the "ambiguous" problem of edges.• We present a cascade fusion module (CFM), which collects polyps' semantic and location information from the features through progressive integration to obtain refined segmentation results.

Polyp segmentation
Traditional Methods.Computer-aided detection [25][26][27] is an effective alternative to manual detection, and hand-engineered methods are widely used in polyp detection.The methods of polyp segmentation schemes are mainly based on low-level features, such as texture and geometric features.In the method proposed by Gonzalez et al. [25], the shape, color, and curvature features of edges are utilized for polyp segmentation.Figueiredo et al. [27] proposed a unified bottom-up and top-down saliency method for polyp detection that considers shape, color, and texture information.However, these methods have a high risk of missed or false detection owing to the high similarity between polyps and the surrounding tissues.
Deep Learning-Based Methods.Owing to the powerful feature expression and analysis capabilities of deep learning models [6,8,9,28,29], many deep learning-based methods have been proposed for polyp segmentation tasks.Brandao et al. [6] introduced FCN into the polyp region extraction issue by converting the classification neural network.However, this fully convolutional network architecture lacks detailed semantic features, and the segmentation result is not ideal.Encoder-decoder-based models, such as U-Net [9] and UNet++ [11], have recently become important model frameworks in this direction, which have excellent performance.U-Net [9] introduced incremental up-sampling of feature maps alongside the corresponding scales of low-level feature maps with "skip-connections."U-Net++ [11] included additional layers and dense connections, which are used to reduce the gap between low-and high-level features.With the increasing importance of polyp segmentation, attention mechanism [30] has been designed specifically for polyp datasets in recent years.PraNet [8] utilizes a reverse attention module to establish the relationship between region and boundary cues, recovering a clear boundary between a polyp and its surrounding mucosa.However, solely using reverse attention may lead to false detections and introduce unnecessary noise.Inspired by Chen et al. [31], we adopt the separate attention mechanism, which combines reverse and normal attention to focus on the background and foreground, respectively.

Vision Transformer
Transformer [16] is a markedly influential deep neural network architecture originally proposed to solve similar problems, such as natural language processing.Originally, the transformer architecture was not well suited for image analysis.To apply transformers to computer vision tasks, Dosovitskiy et al. [13] proposed a vision transformer (ViT), which is the first pure transformer for image classification.ViT splits an image into patches and processes them as consecutive labels.This method substantially reduces the computational cost and enables transformers to efficiently process large-scale images.However, ViT requires large-scale datasets to train effectively and is severely limited when trained on small datasets.This property hinders its usage in such problems as medical segmentation, where the dataset is scarce.
Recent studies have attempted to enhance ViT in several ways further.DeiT [32] introduces a dataefficient training strategy combined with a distillation method, which helps improve performance when training on small datasets.HVT [33] is based on a hierarchical progressive pooling method to compress the sequence length of tokens, reducing redundancy and computation.TNT [34] adopts a transformer suitable for fine-grained image tasks to segment the original image patch and perform self-attention mechanism calculations in small units.Simultaneously, global and local features are extracted using external and internal transformers.Previous research has demonstrated that the pyramid structure in convolutional networks is also applicable to transformers and various downstream tasks, such as Swin Transformer [35], PVT [36], and Segformer [37].PVT is less computationally intensive than ViT and uses the classic semantic FPN to deploy semantic segmentation tasks.
In medical image segmentation, the TransUNet [38] and TransFuse [39] models are developed based on a transformer for polyp segmentation and have achieved good results.TransUNet uses a transformer-based network with a hybrid ViT encoder and an upsampled CNN decoder.Hybrid ViT stacks CNN and transformer together, resulting in high computational costs.TransFuse solves this problem by using a parallel architecture.Both models use the attention gate mechanism [40] and the so-called BiFusion module.These components make the network architecture large and highly complex.To efficiently train models on medical images, Poly-PVT [17] introduces a similarity aggregation module based on the graph convolutional domain [41].

Image Edge Segmentation
Locating pixels on the border is considerably difficult, as demonstrated by many previous methods.To address the issues, various edge-aware models (or say boundary-aware models) have been developed to highlight these hard pixels.
Learning edge information has shown excellent performance in many image segmentation tasks in recent years.In early studies on FCN-based semantic segmentation, Bertasius et al. [42] and Chen et al. [43] used boundaries for post-starting to refine the results at the end of the network.SFANet [44] applies region boundary constraints to supervise polyp learning.To compensate for missing object parts, Chen et al. [45] and Fan et al. [8] utilized reverse attention blocks to learn missing parts and details.However, using only edge information as shape constraints or reverse attention may lead to incorrect detection or introduce unnecessary noise.Several recent approaches have explicitly parallelized boundary detection as an independent subtask with semantic segmentation to achieve cleaner results.Ma et al. [46] explicitly exploited boundary information for context aggregation, further enhancing the semantic representation of the model.Kim et al. [47] went a step further than BlendMask [48] and explored base mask representations and boundary information for instance-specific features.
Although the preceding methods can improve performance, they only use boundary information as supplementary clues to effectively refine the target region segmentation.These methods minimally exploit the complementary relationship between regional and boundary features.Compared with the methods above, our proposed method can mine the deep information of the foreground and background and combine the boundary information to enhance the features at the junction of the foreground and background, thereby improving the segmentation performance of the polyp targets.

Methodology
The full architecture network is shown in Figure 2.For the input I ∈ R W×H×3 , where W and H denote the width and height of an image, we use pyramid vision Transformer (PVT) as our backbone to extract the multi-level features En i , i ∈ {1, 2, 3, 4}.First, we input En 1 , En 2 , En 3 and En 4 into the channel-wise feature pyramid (CFP) [49] to determine the features of different receptive fields.Second we utilize separated edge-guidance module (SEG) to refine the feature maps.The SEG module consists of separator blocks (SE) and edge-guidance blocks (EG).The SE blocks contain the normal and reverse attention streams to focus on the foreground and background.The foreground maps F i , i ∈ {1, 2, 3, 4} are supervised by the ground truth.Furthermore, we utilize an effective edge extractor module (EEM) to obtain the edge map, which is exploited in the EG blocks.Thereafter, we obtain the edge refined maps marked as EG i , i ∈ 1, 2, 3, 4 and feed them to the cascade fusion module (CFM) to fuse refined feature maps, leading to a final feature map P. We choose the sum of EG 1 and P as the final output in the inference stage.

Transformer encoder
Some recent works [37,50] report that vision transformer [36,51] have stronger performance and robustness to input disturbances such as noise than CNNs [5,52].Inspired by this, we choose a vision transformer as our backbone network to extract features for polyp segmentation.Compared with [13,35], the PVT [36] is a pyramid architecture whose spatial-reduction attention can reduce the computing resource consumption.To the segmentation task, we design a polyp segmentation head on top of four multi-level feature maps (i.e.En 1 , En 2 , En 3 , En 4 ).Among these feature maps, En 1 gives detail low level appearance information, En 2 , En 3 , En 4 provides high-level feature.

Edge extractor module
Object detection can benefit from a good edge prior in segmentation and localization [53].Even though low-level features contain rich edge details, they introduce non-object edges information.To easily to explore edge features associated with the polyp area, high-level semantic or location information is required.As illustrated in Figure 2, we combine the high-level feature (En 4 ) and low-level feature (En 1 ) in this module to model the object-related edge information.First, the channels of En 1 and En 4 are separately changed to 32 and 256, respectively, using two 1×1 convolution layers.Second, the feature En ′ 1 and up-sampled feature En ′ 4 are combined using a concatenation technique.Lastly, we generate the edge feature f e using two 3×3 convolution layers and one 1×1 convolution layer.The produced edge map and its edge ground-truth label can be measured using the binary cross-entropy where EM i denotes the produced edge map of the i-th image after the upsampling operator of the edge feature f e , and E gt i denotes the edge ground-truth map.In our model, the Canny edge detection method is used to extract E gt i .Moreover, our EEM can provide edge-enhanced representation f e to guide the detection in the separated edge-guidance module.In addition, f e is cascaded to multiple supervisions to enhance the ability of feature representations.

Separated edge-guidance module
As shown in Figure 3, the original input images have blurred boundaries that disguise areas of polyps that are difficult to segment.To address these issues, a separated edge-guidance (SEG) module is proposed.The module integrates the SE and EG blocks.The SE block contains forward and reverse streams to focus on the foreground and background, respectively.The EG block integrates edge information from EEM into the feature space to enhance the sensitivity of the model to the edge.
Separator.When delineating the polyp area in colonoscopic images, information at the boundary between the foreground and background is an important cue.The human vision system is able to perceive border information effectively because of the fusion of information from the background and interior of the object.Inspired by [31], we adopt the SE block, which contains two steams to focus on the foreground and background.In the first stream, we erase the internal details of objects to focus on the background.Meanwhile, the internal information of the object is recovered in the second stream to focus on the foreground.The operation mechanism of the separator is to highlight the boundary through the synergy between the foreground and background information.The separator can be written as follows: where C i denotes the i-th layer of the feature map produced by the CFP module [49].The foreground map in the i-th layer F i is the result of upsampling on the coarse map of the (i + 1)-th layer, written as 1 − σ (F i+1 ), where σ denotes the sigmoid function, Conv is a 1 × 1 convolution, ⊗ indicates the multiplication operation, and expand () aims to expand the channel of maps similar to C i .The background map is the foreground map substracted from 1, which is defined as 1 − σ (F i+1 ).Out i is the coarse output map of the i-th layer, which is supervised by the ground truth map.Edge-guidance.We use F i and B i as the input to Channel Attention Module (CAM) [54], which is beneficial in representing different scales of features in a more general way.Moreover, the attention that obtains the weight of the feature maps on a global and local scale can be written as W (F i + B i ).After the attention module, we add an edge guider to enhance the model's ability to understand the edges, so they can be more prominent after the two streams are merged.In particular, we integrate the information of the two streams by simple addition.The mechanism of the edge guider module is similar to the conditional normalization module with a prior knowledge of edge map.We consider edge prediction as our condition, and such a module embeds the spatial information into feature maps F i and B i , which allows the feature map to learn better edge features.The formula of the operation is defined as follows: ) where EGF i and EGB i are the output results of the foreground and background streams respectively, EG i is the output of the SEG module, f e is the edge feature map, BN denotes batch normalization, and Conv 3×3 denotes a 3×3 convolutional layer to encode information on the edge map and enlarge the channel to the same as coarse feature maps.Thereafter, the shuffle attention module (SAM) [55] is utilized to make the model focus on the informative channels.

Cascade fusion module
A multi-level feature fusion strategy was applied and verified to improve segmentation performance.The feature fusion has immense influence on the quality of the segmentation result, so we design a cascade fusion module (CFM) to achieve more effective output feature fusion.CFM obtains four edge-guided maps of the first round predictions marked as { EG i , i = 1, . . ., 4 }.Each lower-level feature map is aggregated with the result of the fusion process.The process can be summarized as Fusion ( f i , EG i ) = Concat ( f i ⊗ EG i , EG i ).The four levels of fusion feature stacks are shown in Figure 2. The four levels of fusion feature { f i , i = 1, . . ., 4 } are computed using Equation 3.7, and the final output is computed using i f i .

Loss function
Binary cross-entropy loss is widely used in many polyp segmentation tasks.However, it has clear shortcoming that will lead to poor performances when the number of foreground pixels considerably

Mathematical Biosciences and Engineering
Volume 19, Issue x, xxx-xxx less than that of background pixels.Inspired by [56], we combine the two loss functions as the total loss for supervision with the following formula: where L wIOU and L wbce denote the weighted IoU loss and BCE loss for global and local restrictions, respectively.Note that L wIOU can increase the weights of hard pixels to highlight their importance, and L wbce focuses more on hard pixels rather than treating all pixels equally.Moreover, our model includes six supervised outputs, including four foreground maps (F 1 , F 2 , F 3 , F 4 ), one feature fusion map P, and one edge map f e .Each map (i.e.,F 1 , F 2 , F 3 , F 4 , P ) is up-sampled to have the same size as the ground-truth map (i.e., G).Thus, the final total loss function can be represented as follows: 4. Experiments

Implementation Details
Our method is implemented based on the PyTorch framework and runs on an NVIDIA GeForce RTX 3090 GPU.Considering the differences in the sizes of each polyp image, the input image is simply resized to 352 × 352, and we adopt a multi-scale training strategy thereafter [8,62,63].The network is trained end-to-end by an AdamW [64] optimizer.The learning rate is set to 1e-4, and the weight decay is adjusted to 1e-4 as well.The batch size is set at 16.

Result
We first evaluate our proposed SegT model for its segmentation performance on the seen datasets.As summarized in Table 1, our model is compared to four recently published CNN-based neural networks: U-Net [9], UNet++ [11], PraNet [8], and CaraNet [65].Note that our proposed model outperforms other models on the seen datasets, as shown in Table 1.In the first two rows of the table, we compared two classic medical image segmentation networks (the U-Net and the U-Net++).The SegT network achieves over 10% gains in mDice and mIoU on the Kvasir datasets and ClinicDB.In the 3rd and 4th rows, we compared state-of-the-art models of the polyp segmentation task.Our proposed model outperforms the two models in mDice, mIoU, and MAE on the seen datasets.In Table 1, we also report the results of three transformer-based methods (TransUNet [38], TransFuse [39], and Polyp-PVT [17]) with our proposed framework.Although TransFuse is close to the performance of our proposed model in the ClinicDB dataset, our model is more stable in terms of overall performance.Furthermore, without considering model complexity, our model has a 1% improvement in the mIoU metric in the challenging Kvasir dataset.When compared to the other two types of transformer-based models, the advantage of the metric performance is more obvious.We further evaluate the generalization capability of our model on unseen datasets (i.e., ETIS, ColonDB, EndoScene).Table 2 shows that our model outperforms the existing medical segmentation baselines on the unseen datasets.Concretely, performance gains over the best contender built on a CNN-based backbone network (i.e., CaraNet) are (4.1%,4.3%, 0.016) for metrics (mDice, mIoU, MAE) on the ColonDB dataset, and (6.3%, 6%, 0.004) for metrics (mDice, mIoU, MAE) on ETIS dataset.Besides, when compared with transformer-based backbone networks on the challenging ETIS datasets, our SegT surpasses the best competing method (i.e., Polyp-PVT) by 2.3% and 2.5% for the mDice and mIoU metrics, respectively.However, our evaluation results on the EndoScene dataset show that our method doesn't demonstrate a significant performance advantage compared to other approaches.This outcome is mainly due to the fact that the test dataset consists of only 60 images, which hardly draws definitive conclusions about the superiority of the different methods.
the results of examples with low-contrast backgrounds.As we can see, while most competing methods cannot identify the boundary area, our SegT almost correctly segments all the polyp regions.The 2nd row is an example of the occluded polyp.As we can observe, our SegT is capable of producing accurate results, while other methods tend to generate results with poor accuracy.The 3rd and 4th rows are examples with relatively large targets and small targets, respectively.As we can observe, our SegT correctly identifies all the targets.Other methods tend to miss several details of the boundaries.The 5th row is an example of brightness interference.Our SegT not only accurately segments the target but also eliminates the salient distraction.In summary, SegT is capable of producing high-quality prediction maps under various challenging scenarios.It is worth noting that in these examples, the texture of these polyp objects is almost identical to that of the surrounding environment, which can well prove that the SegT is effective in locating the targets by leveraging the edge cues.

Ablation Study
We describe in detail the effectiveness of each component on the overall model.The training, testing, and hyper-parameter settings are the same as mentioned in Sec.B.We evaluate module effectiveness by removing components from the complete SegT on three datasets, and we choose mDice, mIoU and MAE for evaluation.In order to better explain the relationship between models, we labeled different experimental models as a to e. Model a is composed of the backbone network   Since the separator between the foreground and the background, that is, the boundary of the polyp area contains fewer pixels, we need to exploit the Edge-guidance to embed additional edge information into the feature to strengthen the model's understanding of boundary.With the help of Edge-guidance, the predicted result can maintain a clear edge structure of the object.Effectiveness of CFM.Similarly, we test the effectiveness of the CFM module by removing it from the overall model and replacing it with an element-wise addition operation, which is called Model c.Compared with SegT, the performance of the Model c drops on all three datasets by a large margin.The performance degradation of the model demonstrates that the CFM is helpful in effectively integrating refined feature information at every stage.By comparing Model a with Model d, the baseline model with the CFM module also can perform better in most of the evaluation metrics.
The visual results are given in Figure 5. Green and red indicate regions that are not detected and accurately detected, respectively.Evidently, our designed module can obtain significant results in the edge detection of small and large target regions.We observe that the SEG module facilitates the finegrained ambiguous boundaries, and the CFM module significantly improves the accuracies of object detection and target object location.

Conclusion
We proposed a new image polyp segmentation framework called SegT.SegT is inspired by the habit of observing objects with blurred boundaries by finding the foreground and background, the outline of the object can be depicted.Therefore, this research argues that the boundary information will enhance the ability of polyp segmentation.On the bases of the preceding observations, we first utilize a PVT backbone as an encoder to explicitly extract more powerful and robust features.Thereafter, we propose a SEG module composed of two blocks (i.e., SE and EG blocks).The SE block is used to separate two streams: one stream focuses on the foreground but disregards the background, while the other focuses on the background and erases the foreground.After each stream, edge information is embedded into the features using the EG block, and the two streams are fused to enhance the ability of the model to detect object boundaries.Lastly, CFM is used to obtain more accurate features.Extensive experiments show that SegT consistently outperforms on five challenging datasets without any pre-/post-processing.
Futher Work.Although the SegT model provides a powerful and effective solution for the polyp segmentation task, some limitations still deserve further exploration.First, in the current work, the boundary information is collected explicitly to guide the foreground and background to find the subtle differences between the two to depict the boundary of polyps.In contrast, for humans, the information extraction and integration process should be implicitly included in the knowledge-learning process.Moreover, this design brings additional inference costs.In future work, we will simplify the inference structure further to make it more consistent with the actual human decision-making process.In addition, the backbone of the SegT model was pre-trained on ImageNet, where most natural images differ from medical images.In future work, we will use pre-training that is more suitable for medical image segmentation and adapt the model structure to use it for 3-D medical imaging segmentation.

Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.

Figure 1 .
Figure 1.Main challenge of polyp segmentation is the object boundary, which is ambiguous due to the color and texture of polyps are very similar to surrounding tissues.The segmentation examples of our model and Poly-PVT [17] with different datasets, which show that our model has better ability on the border.Red indicates correct polyp region.Green is the missed polyp region.

Figure 2 .
Figure 2. Overview of the proposed SegT, which consists of three key components: edge extractor (EEM) module , separated edge-guidance (SEG) module, and cascade fusion module (CFM).

Figure 3 .
Figure 3. Illustration of the proposed separated edge-guidance (SEG) module, consisting of two key components: separator (SE) block and edge-guidance (EG) block.
PVTv2 and Channel-wise Feature Pyramid (CFP) module; Model b adds the Separator block (SE) based on Model a; Model c adds separated edge-guidance module (SEG) on the Model a; Model d is the final model without Cascade fusion module, and Model e is our final model.We evaluate the seven models on three benchmark datasets.Quantitative experimental results are shown in

Figure 5 .
Figure 5. Effectiveness of each component.Green indicates regions that are not detected.Red represents regions that are accurately detected.

Table 1 .
Quantitative comparison of different methods on the Kvasir and ClinicDB(seen datasets) to validate our model's learning ability.↑ denotes higher the better and ↓ denotes lower the better.

Table 2 .
Quantitative comparison of different methods on the ColonDB, ETIS and En-doScene datasets (unseen datasets) to validate our model's generalization capability.↑ denotes higher the better and ↓ denotes lower the better.

Table 3 .
Effectiveness of SEG.By comparing Model a with Model b, we observe that Model b outperforms Mathematical Biosciences and EngineeringVolume 19, Issue x, xxx-xxx

Table 3 .
Ablation study for SegT on the Kvasir, ETIS and ColonDB datasets.↑denotes higher the better and ↓ denotes lower the better.Model a in terms of all the evaluation metrics.It means by adding the separator block, our model can perform better.The apparent improvement in the evaluation metrics shows that the separator can highlight the boundaries of objects by focusing on the foreground and background information separately, thereby improving the accuracy of polyp segmentation.In order to validate the effectiveness of the Edge-guidance, we compare the results of Model b and Model c.After adding the Edge-guidance (EG), the performance of our Model c increases compared with Model b.Moreover, we further investigate the contribution of the SEG by removing it from the overall model, which is labeled as Model d, the performance without the SEG drops sharply on all three datasets.Compared with Model d, Model e and Model f shows an improvement, which demonstrates the two block in the SEG module work effectively.