Automated polyp segmentation based on a multi-distance feature dissimilarity-guided fully convolutional network

: Colorectal malignancies often arise from adenomatous polyps, which typically begin as solitary, asymptomatic growths before progressing to malignancy. Colonoscopy is widely recognized as a highly efficacious clinical polyp detection method, offering valuable visual data that facilitates precise identification and subsequent removal of these tumors. Nevertheless, accurately segmenting individual polyps poses a considerable difficulty because polyps exhibit intricate and changeable characteristics, including shape, size, color, quantity and growth context during different stages. The presence of similar contextual structures around polyps significantly hampers the performance of commonly used convolutional neural network (CNN)-based automatic detection models to accurately capture valid polyp features, and these large receptive field CNN models often overlook the details of small polyps, which leads to the occurrence of false detections and missed detections. To tackle these challenges, we introduce a novel approach for automatic polyp segmentation, known as the multi-distance feature dissimilarity-guided fully convolutional network. This approach comprises three essential components, i.e., an encoder-decoder, a multi-distance difference (MDD) module and a hybrid loss (HL)


Introduction
Colorectal cancer (CRC), ranking as the third most common cancer in the world, posed a significant global health challenge in 2020, with over 1.9 million new cases and 930,000 deaths reported [1].Clinically, identification of colon cancer plays an important role in cancer treatment.Nevertheless, the current methodologies for colon cancer detection remain limited in scope.One frequently employed strategy is the utilization of colonoscopy to assess the patient's gastrointestinal tract (i.e., bowel) with the objective of identifying and removing colonic polyps, which are often considered potential precursors to the development of colon cancer.Regrettably, accurately recognizing polyps from colonoscopy images can be challenging in three major aspects.First, it is essential to acknowledge that polyps exhibit diversity in terms of colors, shapes and appearances, which can evolve over time.Second, the context in which polyps manifest in various regions of the colon is complicated and varied.Lastly, the subtle color differentiations and low contrast between polyps and healthy tissue make it a daunting task to effectively distinguish their distinctive features.Consequently, this difficulty leads to diagnostic challenges, with physicians often struggling to exactly differentiate the boundaries of polyps from the surrounding healthy tissue.These limitations result in a substantial number of undetected cases, affecting approximately 10% of patients misdiagnosed with metastatic colon cancer, which leads to a delay in diagnosis and thus reduces the likelihood of patient survival [2].Hence, it is crucial to develop automated computer-assisted techniques for highly accurate polyp segmentation is of paramount importance to enable effective prevention and treatment of colorectal cancer.
Recently, with the advancement of deep learning technology, the previous polyp segmentation methods that require manual extraction and construction of features are gradually being phased out.In its place, polyp segmentation methods based on convolutional neural networks (CNNs), a deep learning model commonly used in image analysis, are promising and efficient compared with traditional segmentation techniques [3,4].
Moreover, the employment of a fully convolutional network (FCN) [5,6] is common in the field of medical image segmentation and can be used to refine the feature representation and learning.There are three reasons for choosing to apply the FCN framework to automatic colon polyp segmentation: 1) Semantic segmentation capability: Since polyp segmentation is essentially a semantic segmentation task aimed at accurately depicting and distinguishing polyps from the background in endoscopic images, the architecture of FCN is well suited for this purpose as it can capture detailed spatial information to create pixel-level segmentation maps.2) End-to-end learning: FCN implements endto-end learning, i.e., it learns feature extraction and segmentation simultaneously, which is advantageous for polyp segmentation and facilitates capturing complex and changing polyp features.3) Efficient: Known for its computational efficiency, FCN can process an entire image in a single forward pass.This is critical to enable fast and accurate polyp detection and segmentation in colonoscopy.
Notably, as a type of FCN, U-Net [7], an algorithm for semantic segmentation using fully convolutional networks, and its variants [8][9][10] are widely exploited for this purpose and have brought new advances in polyp segmentation.The advantages of U-Net for automated polyp segmentation are threefold: 1) Fully convolutional network architecture: U-Net utilizes the FCN framework, which can handle input images of arbitrary size and outputs pixel-level segmentation results.This flexibility is crucial in polyp segmentation, where the size and location of polyps can vary significantly.2) Downsampling path and upsampling path: U-Net's structure consists of both a downsample path for capturing global features and an upsample path for preserving local details.This dual-path approach allows U-Net to simultaneously process a wide range of features and fine details, which is particularly beneficial for tasks like polyp segmentation.3) Skip connections: U-Net introduces skip connections that connect features from the downsample path to the upsample path.These connections facilitate the propagation of information between layers, enabling better contextual understanding of the discriminative features.This is especially important when dealing with complex structures of polyps.
Nevertheless, the existing FCN approaches, including the U-Net methods, often fall short of the necessary accuracy for effectively segmenting tiny polyps.These polyps are characterized by their frequent occurrence, small volume, strong resemblance to the surrounding environment and significant background interference.Theoretically, the primary reason for this problem is that most FCN models are limited to mining more powerful features to capture information from small imperceptible polyps and exploring contextual knowledge to avoid interference from adjacent tissues or complex backgrounds.
To tackle these issues, we propose a novel automatic polyp segmentation approach based on multi-distance feature dissimilarity-guided FCN.In essence, the low-level features within a network have the ability to preserve intricate details, contours and background noise.In contrast, high-level features, while lacking the ability to discern sharp boundaries, offer consistent semantic properties and a wealth of contextual information.Recognizing the complementary nature of high-level and low-level features, we design a multi-distance difference (MDD) module, which aims to enhance the feature representation through the application of multi-layer feature subtraction (MLFS) operations, thus facilitating more comprehensive feature learning.More importantly, we specifically analyze the information variations between neighboring layers at short distances, cross-layers relationships at short distances and cross-layers connections over longer distances, thus effectively overcoming the challenges posed by the polyp's contextual environment.Furthermore, the ultimate polyp segmentation outcome is generated by merging the differential information from both adjacent and cross-layers.This approach incorporates the distinctive attributes of each layer into the decoder to perform convolution and upsampling.In addition, we leverage a hybrid loss (HL) module, which requires no training and is capable of supervising the feature maps of each decoded layer.Overall, this paper comprises the following four major contributions: 1) We present a novel network architecture for polyp segmentation that incorporates a multidistance feature dissimilarity approach to capture feature information spanning from high-level layers to low-level layers.By employing feature subtraction operations, duplicate information between shortrange and long-range feature layers is eliminated.This process ensures a comprehensive presentation of the differential information between adjacent layers and across layers, while emphasizing the complementary nature of the feature maps from the lower to the higher layers.Accordingly, we acquire more discriminative feature information for each pyramid layer, thus enhancing the recognition of polyps.
2) We are committed to expanding the perceptual range of the network by extracting features from various receptive fields.To achieve this, we develop a bi-directional feature complementation technique.In particular, this technique involves augmenting high-level features with low-level features that encompass considerable detail information in a bottom-up manner.Moreover, we adopt a topdown approach to complement low-level features with high-level features that contain meaningful semantic and contextual information.As a result, such a strategy is particularly suitable for segmenting polyps that are highly affected by background interference and are of smaller size.
3) We introduce a hybrid loss module designed for efficient feature supervision.This module utilizes learnable weights to dynamically tune the distribution of various losses, and incorporates a hierarchical difference loss to enhance the optimization of the details in segmented polyps by focusing on the finer features.
4) Through comprehensive experimental comparisons, our network model is validated on four benchmark polyp segmentation datasets.The results demonstrate its superior performance, particularly on the challenging Kvasir dataset, as evidenced by six evaluation metrics.

Related works
In the era of rapid computer technology advancements, especially in the field of machine learning, computer-aided cancer detection has become crucial in helping medical practitioners to identify and diagnose cancer.Colorectal cancer, ranking as the third most prevalent cancer worldwide, has seen continuous innovation in techniques for automated detection.In this section, we specifically delve into methods based on CNN and FCN.

CNN-based methods
Typically, traditional polyp segmentation methods involving manual extraction and design of features have proven to be inadequate in meeting current requirements due to their numerous limitations, e.g., a dependency on high-quality medical imaging, extended processing time and sensitivity to parameter selection.By contrast, with the innovation of deep learning techniques, polyp segmentation based on CNNs has emerged as a prospective method.For instance, Tajbakhsh et al. [11] presented a polyp detection approach based on a unique three-way image presentation and CNN, which dramatically reduced the false positives.Shin et al. [12] employed a region-based CNN method to automatically determine polyps from images and videos acquired during colonoscopy, which used the image enhancement strategy to solve the problem of a smaller amount of trained polyp images.Nisha et al. [13] introduced a dual-path CNN for detecting polyps in colonoscopy images.Although these methods have made valuable contributions to automated polyp segmentation, they exhibit limitations in terms of detection under homogeneous conditions.Detecting the precise structure and boundary details of polyps through convolutional operations alone remains challenging.

FCN -based methods
To improve the accuracy and efficiency of automatic segmentation, end-to-end FCNs have been increasingly employed to refine the representation and learning of polyp features in polyp segmentation tasks.Brandao et al. [5] converted three well-established networks into an FCN architecture to recognize and segment the polyps from colonoscopy images.Ji et al. [14] explored a progressive normalized self-attention network based on the FCN framework, which can effectively learn feature representations from polyp videos in real time.More advanced, Wen et al. [15] represented a simple effective polyp segmentation strategy that combines FCN-based segmentation and CNN-based classification tasks.Sanderson and Matuszewski [16] utilized a novel architecture that integrates FCN and transformer to improve polyp segmentation of colonoscopy images.Nevertheless, it is important to note that these FCN-based methods often ignore spatial coherence and pixel relationships, leading to blurred content and incomplete boundaries in the prediction outcomes.
To address these issues, the U-Net structure [7] has been employed to supplement feature information transfer from the encoder to the decoder through long-skip connections, which is capable of accurately maintain the complex details of the target tissues and has gained great popularity in the field of biomedical image segmentation.Based on U-Net, the variant UNet++ [17] captures finegrained details of foreground objects more efficiently through nested and dense skip connections.Similarly, Jha et al. [18] utilized the ResUNet++ structure to segment polyps, which takes full advantage of residual blocks, attention blocks, atrous spatial pyramid pooling, etc.Recent studies have concentrated on calibrating the misalignment between the polyp regions and boundaries to detect some ambiguous boundaries.For example, Fan et al. [19] established associations between region and boundary cues by exploiting the reverse attention module, and Zhao et al. [20] generated differential features at neighboring layers with degradation units.More recently, Wu et al. [21] introduced a multiscale transformer attention mechanism for high-precision polyp segmentation, which embeds transformer blocks into a U-shaped encoder-decoder framework to efficiently realize multi-scale attention for adaptive feature integration.Lewis et al. [22] proposed a dual encoder-decoder based network for polyp segmentation which combines a transformer encoder and an enhanced dilated transformer decoder for improving the overall polyp segmentation capability.However, despite the notable advancements achieved by these U-Net-based methods in polyp segmentation, they are unsatisfactory for accurately segmenting tiny polyps with high similarity in appearance to the surrounding, cluttered background and severe interference.
In contrast to conventional FCN models, our research introduces a multi-distance difference module to enhance feature representativeness at each pyramid layer and seamlessly integrates the highlevel and low-level features.This integration allows our model to capture fine-grained polyp information essential for accurately identifying small-volume polyps.Our innovation stands out in three key aspects: 1) Elimination of background interference: We address the issue of misdetection that arises from traditional FCN approaches, which struggle to capture polyp context information and are susceptible to background interference.To overcome this, we eliminate redundant background information and emphasize the complementary polyp features.This is achieved through feature subtraction, which leverages the difference information between adjacent or cross-layer features.This enhancement significantly bolsters the network's ability to distinguish polyps.2) Extending perceptual range: We mitigate a common limitation of standard U-Net models, which may struggle to recognize small targets due to the increasing relative receptive fields during downsampling.Our solution involves the development of a bi-directional feature complementation technique.This technique extracts features from various receptive fields, extending the network's perceptual range and enhancing the synergy between high-level and low-level features.This synergy is crucial for identifying small polyps effectively.3) Optimizing fine polyp details: To address the issue of insufficient details in traditional CNN segmentation results, our supervised process incorporates a hierarchical difference loss.This specialized loss function prioritizes the refinement of finer polyp features, optimizing the quality of segmented polyps by capturing the necessary intricate details.
In summary, our research introduces the multi-distance difference module and hybrid loss module to enhance polyp segmentation by effectively addressing background interference, extending the network's ability to identify small polyps and optimizing the details of segmented polyps.These innovations collectively contribute to the accurate identification and segmentation of polyps in medical imaging.The proposed multi-distance feature dissimilarity-guided FCN model is illustrated in Figure 1, which mostly consists of three key components: An encoder-decoder, a multi-distance difference (MDD) module and a hybrid loss (HL) module.1) Encoder-decoder architecture: The foundation of our model is an encoder-decoder framework.The encoder primarily consists of a five-layer pyramid structure, which serves to extract essential features from the input image.Through a series of hierarchical convolution operations, the raw image undergoes transformation, yielding feature maps of varying dimensions.In the decoder stage, the received features are further processed through convolution and upsampling operations.The final touch is the application of the Sigmoid function, which yields the prediction results.2) MDD module: The MDD module is core innovation of our approach.It plays a pivotal role in enhancing the model's performance by skillfully aggregating and disaggregating features from adjacent layers or across layers, thus highlighting their complementarities and differences.To achieve this, we have developed a bi-directional feature complementation strategy, effectively extending the network's perceptual range.This strategy facilitates the harmonious integration of high-level and low-level features, a critical aspect of polyp recognition, particularly when dealing with intricate polyp characteristics.Notably, each layer's features within the MDD module are fused and propagated to the corresponding pyramid layer within the decoder.3) HL module: Another essential aspect of our innovation is the HL Module, which fully accounts for losses from hierarchical differences.This dynamic learning process enhances the model's ability to generate precise segmentation results.It takes a central role in supervising the training process.During training, it closely monitors the prediction outcomes and ground truths.The network's parameters are updated by efficiently propagating the prediction errors through the backpropagation method.

The proposed approach
While our underlying framework adheres to the standard FCN structure involving encoding and decoding, our contributions are far from conventional.Notably, we have introduced two pivotal components, the multi-distance difference module and a hybrid loss module, which play a decisive role in our quest for precision in polyp segmentation.These innovations are particularly essential when dealing with the accurate identification of polyps, including the challenging task of detecting small polyps amidst intricate background interference.

Overview of the proposed model
The encoder consists of five encoding layers, the feature maps generated by each layer are defined as  ,  ∈ 1,2,3,4,5 , which are extracted using Res2Net [23] as the backbone network.For the original image, it is progressively passed to the next layer through downsampling operations, and the feature maps of each pyramid layer are propagated to the layers of the MDD module through a    combination operation.Let  ,  ∈ 1,2,3,4,5 denote the five layers of features within the MDD module (as illustrated in Figure 1), it encompasses the following features: where  indicates the layer number and  denotes the depth of the features extracted by each layer, i.e., the number of features.The  of all five feature layers in the MDD module are extracted from each of the five layers in the encoder, which can be represented as: where ReLU,  and  denote the activation, batch normalization and convolution operations, respectively, and  denotes the different convolution kernels.For feature maps in the MDD module, we calculate the dissimilarity between features of adjacent layer features and cross-layer features by applying the multi-layer feature subtraction procedure (i.e., MLFS, see Figure 2 for details), which can be defined as: where In particular, to adequately capture the differences between the features of each layer, we perform MLFS on three categories of features, i.e., short-range adjacent layer features, short-range cross-layer features and long-range cross-layer features, respectively.The specific set of features for implementing MLFS is summarized in Table 1.This configuration allows the proposed MDD module not only to consider the perceptual differences between features with different sizes of receptive fields, but also to store the information differences between multi-scale features, enabling the network to explore the details of the overall polyp structure.Moreover, the complementary of the high-and low-level features can produce a more comprehensive knowledge representation, and a large amount of redundant information is mitigated.This, in turn, enhances the precision of polyp localization and the clarity of boundaries for polyps of varying sizes.
Finally, after subtracting and combining the features of each pyramid layer along the vertical direction in the MDD module, the generated feature maps of each layer are propagated to the decoder of the same layer through horizontal aggregation.The operation is expressed as follows: where  ,  ∈ 1,2,3,4,5 indicate the five layers of feature maps generated in the decoder (see Figure 1 for details), and  denotes the number of features in each layer of the MDD module, see Eq (1).All layer features within the decoder are activated by the Sigmoid function and are fused to generate the ultimate prediction.In particular, we leverage element-wise addition or concatenation in the decoder to progressively fuse the features at all levels.This approach results in fewer parameters in the subsequent backpropagation compared to traditional methods, rendering the proposed network relatively straightforward to train.

Hybrid loss module
To optimize the prediction of the network, we employ a hybrid loss module for supervised learning between the prediction results and the ground truth (GT).The hybrid loss function (denoted as ) is expressed as follows: where |•| represents the mean calculation,  and  denote the corresponding learnable weights for adaptive loss distribution, and  and  represent the weighted intersection over union (IOU) loss and binary cross entropy (BCE) loss, respectively,  denotes the loss derived from hierarchical differencing, which is applied to correct the loss function.The  can be calculated as follows: where  ,  ∈ 1,2,3,4,5 denotes the Euclidean distance between the predicted results (denoted as

𝐹𝑚𝑎𝑝
) after multi-scale feature extraction from the five decoder layers and the corresponding GTs (denoted as  ).The  can be defined as: The lowest layer of the  contains a large amount of boundary information, while the higher layers contain rich location knowledge.Thus, the inclusion of  enhances the generalization ability of the model to detect various types of polyps.

Experimental datasets and evaluation metrics
To verify the effectiveness of the proposed model, we evaluated it on four medical polyp segmentation benchmark datasets, including CVC-ColonDB [24], CVC-ClinicDB [25], CVC-T [26] and Kvasir [27].We assume that the four datasets used for the experiments represents actual clinical cases and contains a diversity of polyp characteristics.A brief description of the functionality and application areas of each dataset is provided below: 1) CVC-ColonDB is a dedicated dataset for colonic polyps, comprising images of polyps with diverse types, sizes and shapes.It is mainly used in the field of medical image processing and computer-aided diagnosis, particularly in the detection and segmentation of colonic polyps during colonoscopy.2) CVC-ClinicDB is characterized by its diversity, including a wide array of endoscopic images depicting various diseases and clinical scenarios.It encompasses widespread application in medical image analysis, covering tasks such as colonic polyp segmentation and the analysis of diverse endoscopic images.3) Similar to CVC-ClinicDB, the CVC-T dataset presents a variety of endoscopic images, each depicting different diseases, making it a valuable resource for algorithm testing and evaluation.It is instrumental in the testing and validation of medical image processing algorithms, including colonic polyp segmentation.4) The Kvasir dataset contains images from different endoscopy devices and clinical scenarios, serving as a resource for research and analysis in multiple medical image tasks.It is widely adopted in medical image processing research, covering a spectrum of tasks including colonic polyp segmentation, lesion detection and the analysis of other endoscopic images.In summary, these four datasets have extensive applications and representativeness in the field of colonic polyp segmentation.They encompass endoscopic images from various clinical scenarios, different devices and diverse patient cases, thus providing a wealth of diverse data for our research.
In the training stage, we used the same training set similar to the polyp segmentation model MSNet [20], that is, about 38% of the images are from CVC-ClinicDB dataset and 62% from Kvasir dataset, totaling 1450 images.We also supplemented the training set with selected data from other datasets.For a comprehensive evaluation of the model's performance and to obtain more convincing results, we utilized six common evaluation metrics in the field of object detection and image segmentation, as listed below: • The mean Intersection over Union ( ), which is mainly used to calculate the similarity between the predicted segmentation and actual one via: where  and  denote the number of elements for the predicted and actual values, respectively.
• The mean Dice coefficient ( ), which is usually to measure the consistency between the segmented regions of interest and manually segmented one via: • The mean absolute error (), which calculates the absolute error between predicted value and the corresponding actual value, and takes the average of all errors as follows: where  and  represent the predicted and corresponding actual values, respectively.
• The weighted Dice metrics ( ) [28], which is a variant of the Dice coefficient, allocates weights to multiple categories based on demand and computes a weighted average coefficient for each category, is defined as: where  and ℛ denote the precision and recall, respectively, and  and  indicate the corresponding weight coefficients.
• The S-measure ( ) [29], which simultaneously measures the similarity of object structures, object regions and object boundaries to assess the consistency of predicted segmentation and ground truth, is computed as: where  and  represent the object-oriented and region-oriented structural similarity measures, respectively, and  denotes the corresponding weight coefficient, which is set to 0.5 in our experiment.
• The E-measure ( , ) [30], which encrypts the entropy of an element-by-element match by comparing the correlation between the segmented result and the real mask to determine the evaluation performance, is defined as: where ℳ refers to the augmentation matrix of the response correlations, and  and  denote the width and height of the matrix, respectively.

Implementation details
In the preprocessing phase, all input images are scaled to 352 × 352 and a traditional multi-view training method is applied.The proposed method is implemented using PyTorch and executed on dual NVIDIA Ge-Force RTX 3090 GPUs.In the training phase, the parameters of the stochastic gradient descent (SGD) optimizer are set to a learning rate of 0.05, a batch size of 16, a momentum of 0.9, a weight decay of 0.0005 and epochs of 50.For data optimization, we adopt a stochastic cropping and flipping approach to prevent overfitting.

Quantitative and qualitative experiments
To fully demonstrate the superiority of the proposed method, we perform quantitative and qualitative experiments by comparing it with three medical image segmentation models (i.e., PraNet [19], MSNet [20] and Inf-Net [31]) and two state-of-the-art salient object detection models (i.e., DCFM [32] and UMNet [33]).We hypothesize that the network architecture mentioned above is based on prior research in the field that is applicable to the polyp segmentation task.
Tables 2-5 list a comparison of the quantitative evaluation results of the proposed model with those of the other five models.Based on the statistics provided in Tables 2 and 5, it is clear that our model ranks first on all six evaluation metrics on both the CVC-ColonDB and Kvasir datasets.Moreover, from Table 3, it can be observed that on the CVC-ClinicDB dataset, our model achieves optimal results in all metrics except for the  , metric, which ranks second, and the  , metric is only 0.0038 less than that of the first-ranked PraNet model.Similarly, as can be seen in Table 4, except for the  , which is slightly lower than that of the top-ranked PraNet model, the proposed model shows the best performance in all the other metrics with respect to the five compared models.Undoubtedly, our model stands out as state-of-the-art performance, underscoring its capability to comprehensively and accurately segment polyps of varying sizes, which is attributed to the novel multi-distance feature dissimilarity approach it employs.To assess the statistical differences between our model and the comparison models across the four datasets, we computed p-values by comparing the Dice values.The specific outcomes are detailed in Table 6.P-values less than 0.05 signify a statistically significant distinction between the two subgroups.Referring to Table 6, it is evident that the Dice results of Inf-Net, DCFM and UMNet exhibit statistically significant differences compared to our model across all four datasets.This implies that the performance of these alternative models is notably inferior to our approach.Furthermore, PraNet and MSNet show no significant statistical differences with our model when evaluated on CVC-ClinicDB and Kvasir datasets, but they do display statistical distinctions on the CVC-ColonDB and CVC-T datasets, respectively.As a result, when considering the cumulative analysis across all datasets, our model emerges as the more favorable choice for polyp detection.Figure 3 visually shows a qualitative comparison of the segmentation results from various models.It can be observed that the proposed model can achieve accurate detection results regardless of whether the polyp to be detected is in a shaded and narrow region (see 1st row) or in a bright region (see 2nd row).In addition, for images with cluttered background interference (see row 3) or images with tissue interference similar to the appearance of polyps (see row 4), our approch can accurately localize the correct polyp region.Notably, for tiny polyps, our method can also detect them completely (see row 1).Comparatively, the Inf-Net, DCFM and UMNet models have difficulty in determining polyp regions, and the PraNet model struggles to recognize small polyp regions and is susceptible to background interference.Similarly, the performance of MSNet is also highly affected by the background noise.

Ablation study
To validate the performance of the proposed MDD module and HL module, we conducted ablation experiments on two datasets, CVC-ColonDB and Kvasir, and performed elaborate quantitative comparisons of different configurations of baselines in terms of five evaluation metrics, as demonstrated in Tables 7 and 8, which include: 1) Baseline 1, which consists of the backbone network of encoderdecoder and the features ∑  ,  ,  , denoted as BN+F1; 2) Baseline 2, which adds features  ,  ,  ,  ,  ,  ,  ,  to Baseline 1, denoted as BN+F1+F2; 3) Baseline 3, which adds  ,  ,  ,  to Baseline 2, denoted as BN+F1+F2+F3; 4) Baseline 4, which adds  ,  to Baseline 3, denoted as BN+F1+F2+F3+F4; 5) Baseline 5, which adds  to Baseline 4 to form the complete MDD module, denoted as BN+MDD; 6) Baseline 6, i.e., the proposed model, containing the HL module, denoted as BN+MDD+HL.As illustrated in Table 7, the metrics of the segmentation results obtained from the BN+F1 combination are notably low on the CVC-ColonDB dataset.Comparatively, the performance of BN+F1+F2 improves dramatically after adding additional features and performing MLFS operations between neighboring layers.In addition, the performance of all metrics continues to improve with increasing MLFS operations in baselines BN+F1+F2+F3, BN+F1+F2+F3+F4 and BN+MDD.Moreover, our baseline BN+MDD+HL achieves an improvement of 0.0222, 0.0242, 0.0292, 0.0229 and 0.0385 for the metrics  ,  ,  ,  and  , , respectively, compared to the BN+MDD after the implementation of the HL module.Similar trends are observed in the performance of the test results on the Kvasir dataset, as presented in Table 8.Once again, the performance of BN+MDD+HL configuration stands out as the optimal choice with respect to the other baselines, and the performance gradually improves from BN+F1 to BN+MDD+HL.This further underscores the efficacy of our proposed modules.
Figure 4 provides the visualized segmentation results for different baselines.Upon close examination, it becomes evident that the polyp region identified by BN+F1 appears quite blurry, making it challenging to pinpoint the precise location of the polyp (refer to Figure 4(c)).Furthermore, the polyp detected by BN+F1+F2 with the adoption of MLFS outlines a rough boundary, but still leaves certain regions unsegmented due to the interference of relatively high exposure intensity (observe Figure 4(d)).Subsequently, the baselines BN+F1+F2+F3 and BN+F1+F2+F3+F4 overcome the interference of high exposure to some extent, but their results are still interfered by background noise, as evident in (e) and (f) of Figure 4.In comparison, BN+MDD successfully overcomes the impact of noise, although the polyp region appears incomplete in its structure (as shown in Figure 4(g)).Interestingly, our final prediction achieves strikingly similar results to the ground truth results by employing the HL module (see Figure 4

Conclusions
In this paper, we propose a novel network architecture for automatic polyp segmentation, utilizing a multi-distance feature dissimilarity-guided FCN, which is composed of three core modules, i.e., the encoder-decoder, MMD and HL.The MDD module effectively mitigates the challenges posed by cluttered backgrounds as well as the influence of the normal tissue areas that are very similar to the appearance of the polyps.It accomplishes this by capturing the dissimilarity information between the different network layers of the encoder, thus delivering more discriminative features to the decoder.Additionally, the MDD module further enhances the feature expression capability of micro-polyp segmentation by supplementing semantic and contextual information for the low-level features, while also incorporating detailed information for the high-level features to improve the discriminative ability of tiny polyps.Based on the differential features at various scales and receptive fields aggregated by the MDD module, the HL module further optimizes the completeness of polyp and the clarity of polyp details by supervising features at multiple scales.Through a series of experiments conducted on four challenging datasets, our model exhibits state-of-the-art performance across six evaluation metrics, which confirms that the proposed method can facilitate the identification of colon cancer at an early stage.

Use of AI tools declaration
The authors declare that they have not used Artificial Intelligence (AI) tools in the creation of this article.

FMFigure 1 .
Figure 1.Structural diagram of the proposed model containing an encoder-decoder module, a MDD module and a HL module.

Figure 3 .
Figure 3. Visual comparison of results from various models.These examples are selected from four different datasets.(a) Input images, (b) Ground truths of the corresponding images and (c-h) Segmentation results generated by different models.

Figure 4 .
Figure 4. Visualization results for each baseline on the CVC-ColonDB dataset.(a) Input image, (b) Ground truth of the corresponding image, (c-h) Segmentation results generated by different baselines.

Table 1 .
The set of feature maps for performing MLFS operations in the MDD module.

Table 2 .
Quantitative comparison of various models against six evaluation metrics on the CVC-ColonDB dataset.The best results are highlighted in Bold.

Table 3 .
Quantitative comparison of various models against six evaluation metrics on the CVC-ClinicDB dataset.The best results are highlighted in Bold.

Table 4 .
Quantitative comparison of various models against six evaluation metrics on the CVC-T dataset.The best results are highlighted in Bold.

Table 5 .
Quantitative comparison of various models against six evaluation metrics on the Kvasir dataset.The best results are highlighted in Bold.

Table 6 .
P-value between the Dice obtained by the various comparison models and the proposed model on the CVC-ColonDB, CVC-ClinicDB, CVC-T and Kvasir datasets.Subgroups demonstrating statistically significant differences have been highlighted in Bold.

Table 7 .
Quantitative comparison of the proposed baseline on the CVC-ColonDB dataset.

Table 8 .
Quantitative comparison of the proposed baseline on the Kvasir dataset.