Simultaneous segmentation and classification of colon cancer polyp images using a dual branch multi-task learning network

: Accurate classification and segmentation of polyps are two important tasks in the diagnosis and treatment of colorectal cancers. Existing models perform segmentation and classification separately and do not fully make use of the correlation between the two tasks. Furthermore, polyps exhibit random regions and varying shapes and sizes, and they often share similar boundaries and backgrounds. However, existing models fail to consider these factors and thus are not robust because of their inherent limitations. To address these issues, we developed a multi-task network that performs both segmentation and classification simultaneously and can cope with the aforementioned factors effectively. Our proposed network possesses a dual-branch structure, comprising a transformer branch and a convolutional neural network (CNN) branch. This approach enhances local details within the global representation, improving both local feature awareness and global contextual understanding, thus contributing to the improved preservation of polyp-related information. Additionally, we have designed a feature interaction module (FIM) aimed at bridging the semantic gap between the two branches and facilitating the integration of diverse semantic information from both branches. This integration enables the full capture of global context information and local details related to polyps. To prevent the loss of edge detail information crucial for polyp identification, we have introduced a reverse attention boundary enhancement (RABE) module to gradually enhance edge structures and detailed information


Introduction
Colorectal cancers are malignant tumors that commonly occur in the colon and rectum.They make up one of the most prevalent cancer types globally, ranking third in terms of cancer incidence and being the leading cause of cancer-related deaths in the United States.Advanced rectal cancers are difficult to cure, and thus how to improve survival efficiency is key.Early detection and diagnosis play crucial roles in improving survival efficiency.
Colon polyps are lumps in the lining of the colon.Colon polyps have a high possibility to turn into cancers and are a leading cause of colon cancers.Thus, the detection and removal of polyps are important in preventing polyps from developing into colon cancers.The primary tool for screening for colon cancers is colonoscopy.Studies show that the prevalence of rectal cancers can be reduced by as much as 30 percent with regular colonoscopies.Polyps have two types, based on their shapes: pedunculated polyps and sessile polyps (as shown in Figure 1).The latter only account for 15%, and they are difficult to detect in images.Incorrect diagnosis carries the risk of bleeding and perforation, thus accurate identification of the type of polyps and treatment requires a high degree of concentration and experience of physicians.However, according to [1], about 25% of polyps are missed during routine colonoscopy.Thus, we need an accurate and efficient method for polyp classification and segmentation.In recent years, deep learning has made significant progress in processing medical images.Chen et al. [2] proposed a gait pattern recognition method for lower limb exoskeleton based on long short-term memory (LSTM) and convolutional neural network (CNN) to improve the recognition accuracy.Tian [3] proposed a new artificial neural network model evaluation strategy, which has been experimentally proved to be closer to the actual biological nervous system.Xu [4] proposed using deep learning methods to predict new cases of the new coronavirus, and the experiment achieved high prediction performance.Xie [5] proposed a Physically Constrained Deep Active Learning (P-DAL) framework to model spatiotemporal cardiac electrodynamics.The results showed that the proposed P-DAL method is significantly better than the traditional modeling methods.Guan [6] proposed a texture-constrained multi-channel asymptotic generative adversarial network (TMP-GAN), which adopts multi-channel joint training, which effectively avoids the typical shortcomings of current generative methods.Because of the good performance of deep learning for image analysis, deep learning was also used to help endoscopists to improve accuracy and efficiency of diagnosis.In a CAD system for colorectal cancers, polyp segmentation and classification are two important tasks.With the development of CNN based methods, research on polyp segmentation and classification has made some progress.Zhang [7] presented a migration learning algorithm to perform classification of colorectal polyps and achieved excellent results.Bourne et al. [8] proposed a real-time evaluation model to classify polyps into two classes: adenomatous polyps and hyperplastic polyps.Their experiments on polyp videos showed that their model obtained an accuracy of 94%.Younas et al. [9] combined the strengths of individual weak learners to form a weighted integrated model for polyp multiclass classification.In addition to polyp classification, some work is also about polyp segmentation.U-type architectures are the baseline for most medical image segmentation, and thus they are also studied in polyp segmentation.Inspired by U-Net [10], U-Net++ [11] improved U-Net by employing multi-scale nested jump connections and showed high accuracy in polyp segmentation.Jha et al. [12] extended the ResUnet++ [13] by incorporating temporal random fields for polyp segmentation.To solve the edge blurring problem caused by high similarity of polyps to the background, Fan et al. [14] developed a method using a parallel reverse attention network.The proposed network aggregates high-level information through a parallel partial decoder to generate a global mapping.The global map is combined with a reverse attention module to extract boundary information.Zhang [15] designed a network with an attention module aiming at adaptively focusing on different background information.The proposed network alleviates intra-class inconsistency.Experimental results on two datasets validated the accuracy of the method.Ji et al. [16] first proposed the study of video polyp segmentation, introduced a high-quality frame-by-frame annotated VPS dataset, designed a simple and efficient model (PNS+), and demonstrated the effectiveness and high performance of the baseline through many experiments.Lin et al. [17] proposed a new bit-slice contextual attention network for polyp segmentation to improve the ability to extract boundary information, and they proposed a dual-path attention link encoder to further improve the segmentation performance for polyps.Many experiments proved that this method can effectively improve the performance of polyp segmentation.Zhang et al. [18] built a parallel architecture by adding a transformer to CNN for polyp segmentation.Their network can capture both long-term dependency and local information.Experiments on several datasets substantiated the effectiveness of the method.
Although the forementioned methods show improvements in polyp segmentation or classification with comparison to traditional methods, they still face several challenges: 1) The past methods performed polyp segmentation and classification separately, ignoring the intrinsic correlation information between the two tasks.However, multi-task learning allows the network to share feature representations such as image texture, shape, and boundary, which improves the learning efficiency and representation of features.In addition, multi-task learning allows information interaction and colearning between polyp segmentation and classification tasks, which improves the robustness and generalization of the model.2) Existing multi-stage methods [19][20][21] combining the two tasks are based on CNNs.Due to the limitations of convolutional operations, they can only establish shortdistance dependency relationships and cannot establish relationships between target pixels and global pixels.This often leads to the neglect of a significant amount of global information crucial for detecting the location of the targets.However, these global features are necessary to achieve more accurate classification and segmentation of polyps.For this reason, some researchers have taken steps to enhance CNNs by extracting global contextual information.Nonetheless, in most cases, this method fails to yield satisfactory results.
To overcome the limitations of existing methods, we propose a multi-task network that enhances the model's performance by concurrently training it for both classification and segmentation.Furthermore, we develop a dual-branch network structure that combines a transformer encoder and a CNN encoder, inheriting the advantages of both the transformer and CNN.The transformer branch enhances the model's ability to capture global context information by learning long-term dependencies among inter-pixel features, aiding in the localization of polyp regions.Meanwhile, the CNN branch excels at capturing feature representations with spatial information (especially local information), such as edge information, which is more beneficial for the segmentation of small targets.In addition, to fully use the advantages of the two branches, we propose a feature interaction module (FIM) for information fusion and a RABE module to enhance the extraction of fuzzy boundaries.In summary, our contributions are as follows: 1) We propose a multi-task model for simultaneous segmentation and classification of colon polyps.The proposed network utilizes an end-to-end architecture, employing a task-sharing encoder to enhance the correlation between different task networks more effectively.In this network, we adopt a dual-branch structure that incorporates a transformer to extract global features from colon images, thus combining the advantages of CNN into the proposed model.This approach enables the network to learn more meaningful feature information and significantly improves the segmentation and classification results of polyps.
2) We propose a feature interaction module that serves to eliminate the semantic gap between the transformer and the CNN.It also fully integrates the global contextual information of polyps extracted by both, along with local detail information.This approach reduces the loss of polyp location and detailed information.
3) We design a RABE module to further extract boundary information by establishing relationships between the targets and the boundaries.This enhancement improves the network's performance in detecting polyps with ambiguous target boundaries.4) Our proposed method has undergone extensive evaluation on several benchmark datasets, and a significant number of experimental results demonstrate that our approach outperforms other stateof-the-art methods.It exhibits superior performance in both polyp segmentation and classification.
The paper is structured into five distinct sections.The introductory section provides background knowledge relevant to the paper's focus.The second section delves into research pertinent to the methodology employed in this study.Following that, the third section offers a comprehensive exploration of the principles and structure underlying the chosen method.The fourth section substantiates the method's effectiveness through a series of experiments.Lastly, the fifth section conducts an analysis and summary of the paper's findings.

Multitask learning
Multi-task learning (MTL) is a learning paradigm in machine learning that aims to utilize useful information contained in multiple related tasks to help improve the performance of all tasks, and it has had a great impact in many fields, such as natural language processing and computer vision directions.Compared with single-task learning, it can share the common features of multiple tasks, achieve multiple tasks at the same time, and have good generalization ability, which is an important application of deep learning.Due to the time-consuming and labor-intensive nature of radiologists' annotation work in the field of medical images as well as the label-intensive nature of the images, it is necessary to analyze the medical images comprehensively by means of multiple related tasks.Chen et al. [22] proposed a multi-task learning network for segmentation and classification of atria, and the results showed that by sharing features between related tasks, the multi-task network can obtain additional anatomical information about the atria and achieve more accurate segmentation of atria.Zhang et al. [23] proposed a multi-task relational learning network for segmentation, localization, and identification of vertebrae, which utilized the relationship between vertebrae and the correlation of three tasks to train the network and finally proved the effectiveness of the network on an MRI dataset.Zhou et al. [24] proposed a multi-task learning framework for joint classification and segmentation of tumors in ultrasound images.The framework includes a network for segmentation and a multi-scale network for classification.Experiments were conducted on three clinical datasets using an iterative training strategy.The experimental results demonstrated that the proposed multi-task framework has better performance than the single task learning framework.Liu et al. [25] proposed a multi-task learning method for processing data stored in different locations.This method transformed the original centralized computing framework into a distributed framework that can be computed in parallel, thereby enhancing both learning performance and efficiency.
In summary, previous studies have demonstrated the effectiveness of multi-task learning networks.However, the multi-task models still overlook the importance of global features.Therefore, we propose to leverage the transformer architecture to construct a two-branch network for capturing global features.

Transformer
In earlier studies, various CNN-based network models were developed for polyp classification or segmentation, and they achieved some level of effectiveness.However, these methods often overlook the global features of the targets due to the limitations of convolution operations, hindering the improvement of experimental results.In recent years, transformers have been proven to be an excellent model for extracting global features from targets, primarily through the self-attention mechanism.A large body of research, even before its emergence [26,27], has confirmed that self-attention can enhance the performance of CNNs in many applications.Inspired by self-attention, a lot of models on transformers have been proposed.Dosovitskiy [28] applied a transformer to image classification and achieved good performance.Carion et al. [29] proposed DETR, a model for object detection.Experimental results on the Coco dataset outperformed Faster-Rcnn.Due to the superior performance of transformers, many studies combining transformers with other models have been applied to the vision direction.Chen [30] proposed to combine a CNN and a transformer for medical image segmentation and achieved promising results.Transformer-based methods have also shown great potential in colon polyp detection and classification.Wang et al. [31] proposed a multilayer fusion network using a hierarchical guided strategy to aggregate information.The proposed network combined a transformer encoder and CNN encoder to extract deep semantic information and shallow localized spatial features for polyp detection and yielded reliable results.Huang et al. [32] explored the potential of using a joint technique that combines transformers and CNNs to address the challenges of polyp segmentation.They introduced interaction modules for the identification and fusion of information from both sources, resulting in a more robust model compared to existing methods.Park et al. [33] proposed the SwineE-Net network for polyp segmentation, and extensive experiments on five public datasets demonstrated the model's generalizability and scalability.In contrast to the tasks mentioned above, our goal is to develop a multi-task model that combines transformers and CNNs for diagnosing colon cancer from colonoscopy images.

Network architecture
The structure of our proposed network for the segmentation and classification of polyps is shown in Figure 2(a).The network is composed of a dual-branch encoder, a classification module, and a decoder.The dual-branch encoder consists of a transformer branch and a CNN branch.The transformer branch is responsible for gathering high-level semantic information of polyps, aiming at capturing long-term dependent features.The CNN branch is used to learn the localized detailed texture features of polyps.We designed a feature interaction module, which can better fuse global and local features by eliminating the semantic gap between the two branches and learning more useful information from the fused features.The decoder consists of a partial decoder (PD) as well as a RABE module.The PD combines features from multiple levels for decoding and initially aggregates a pre-segmentation result.This combined information is then used in conjunction with the RABE module for level-by-level decoding, resulting in a series of feature mappings  ,  ∈ 0, 1, 2, 3 .The mappings are used to facilitate network learning.The decoding process is to extract boundary information of the polyps, aiming at capturing structural details, minimizing the segmentation errors at the boundaries.In addition, our segmentation and classification tasks share a two-branch encoder.The classification task consists of the shared encoder as in Figure 2(a) and a classification module as in Figure 2(b) to classify the polyp images.The classification module consists of four fully connected layers, three normalized layers, two dropout layers, and a sigmoid activation function.In the classification module, we set the first dropout rate as 0.7 and the second set to 0.5.The results were labeled as two different colonoscopy polyp images.

Dual branch encoder structure
The shapes, sizes, and locations of polyps vary significantly in different images, especially within large and small target regions.Consequently, inaccurate segmentation and classification may occur, primarily due to the absence of contextual information.Many existing segmentation and classification networks utilize encoding and decoding structures, with the encoder playing a pivotal role in information extraction.The encoder is responsible for learning the mapping relationships between pixels and their corresponding topology and projecting the learned salient features onto the pixel space.Hence, the design of the encoder holds significant importance as it directly impacts the robustness of the extracted features, thereby influencing the overall performance of the network.A traditional CNNbased encoder learns through convolutional parameter sharing, making the encoder more sensitive to noise from the input.To address this issue, we propose to integrate transformers in the network because transformers can obtain more robust information through remote dependency modeling.The combination of the two possesses the advantages of both CNNs and transformer, which can provide richer coded information and semantic features for polyp segmentation and classification.For the transformer branch, we use patch partition to divide the input polyp image  ∈  into a set of non-overlapping image patches.The feature dimension of each patch is /4 /4 3. The number is 4 4 , and then these patches are used as input to the transformer branch.We employ a mix transformer (MIT) [34] (as shown in Figure 2(c)).Compared with the models in [29], MIT generates multi-scale features and can improve the performance of semantic segmentation.MIT extracts multi-scale features, preserving both coarse-and fine-grained features, enabling a more accurate region classification and a more complete set of edges.MIT demonstrates several advantages through its unique processes.First, to reduce the complexity of self-attention computation, MIT additionally employs a sequence reduction operation known as efficient self-attention.This operation reduces the computational complexity by decreasing the sequence length.Specifically, in this process, for each input patch, its Q, K, and V values are calculated through linear transformation, and then the Attention weight of the multi-head attention is calculated based on these three vector values.Then, the Attention of each head is spliced to obtain the final Attention representation to update the model.In the original multi-head self-attention process, The Q, K, and V of each head are the same dimensions N × C, where represents the length of the sequence.This process is expressed by the formula To reduce the amount of calculation, the K matrix with input dimension N × C is transformed as follows: where  is the sequence to be reduced,  is the reduced sequence, its dimension is represents a scalar value used to scale the attention weights, used to solve the gradient disappearance problem of the softmax function when the inner product value is too large.We set the value of  ℎ to 4 and set the R values for the four transformer blocks to 64, 16, 4, 1 .Second, to enhance the representation capability of the model, Mix-FFN is added as a technique used to improve the self-attention model.A depth-separable convolution and a multilayer perceptron are employed to convey position information to Mix-FFN to ensure local continuity.This not only reduces computational complexity and parameter requirements but also greatly aids in localizing the position of the polyp region.This process is expressed by the formula where MLP represents the Multilayer Perceptron, GELU () represents the activation function, and  represents the output of efficient self-attention.Finally, MIT includes an overlap patch merging module, which serves to reduce the feature map size while increasing the number of channels in the feature map.However, MIT still exhibits some shortcomings; it does not effectively handle continuity information between blocks, potentially leading to segmentation results with boundary or detail loss consequently.The above issues can be mitigated by the CNN branch.The convolutional encoder in the CNN branch can preserve shallow highresolution features for better characterization of local information.The CNN branch employs ResNet18 as the backbone and adopts small 3 × 3 convolution kernels.The small size kernels can learn relative relationship between neighboring pixel points effectively, thereby extracting texture and detail information effectively.
The process in the network is described as follows: For a given input image  ∈  , the CNN branch initially performs a convolution operation to obtain the feature map  ∈  .
Subsequently, it undergoes four ResBlock operations, resulting in feature maps  ∈  , where  ∈ 64, 128, 256, 512 and  ∈ 1, 2, 3, 4 .In contrast, the transformer branch extracts four layers of feature maps  ∈  , where  ∈ 64, 128, 256, 512 .Finally, the outputs of each layer from the two branches F ,  are jointly input into the feature interaction module to combine the information of the two branches.

Feature interaction module
Given that the learning mechanisms and semantic information acquired by the transformer branch and the CNN branch are distinct, seamless fusion of information from both branches becomes crucial.This allows us to leverage the integration advantages offered by both branch encoders effectively.We propose employing feature interaction module (FIM) to achieve the goal.The FIM employs an interactive fusion approach to integrate the local features with the global representation, effectively eliminating the semantic gap between the two branches.The proposed FIM is shown in Figure 3.In the proposed FIM, to narrow the semantic gap between the CNN and the transformer, we begin by employing convolutional kernels of various sizes to expand the receptive field of the feature map  obtained from the CNN branch.This enables the capture of more contextual information.The results after various convolutional operations are then catenated into a new feature map denoted as  .Subsequently, we conduct an element-wise summation operation between  and the feature  produced by the transformer branch.The whole process can be represented by the following equation: where where () denotes various convolutional operations on feature map  to enlarge the receptive field and the information catenation operation.The convolutional operations on feature map  is composed of four convolutional branches.The first level of each convolutional branch employs 1  conventional kernel, where  ∈ 1,3,5,7 .Except for the first convolutional branch, each other conventional branch is composed of three conventional levels.For the second level of the second to fourth branches, the convolution operation employs 3 1, 5 1, 7 1 convolution kernel, respectively.For the third level of the second to fourth branches, 3 3 dilated convolution operations are employed.Each conventional branch output a feature map  (k = 1,2,3,4).The feature maps  and  , which have different receptive fields, are fused together to obtain  .
,  ,  , where In addition, we hope that  and  can learn useful features from their fused features  while retaining their respective original features.Inspired by the attention mechanism [35], we learned that spatial attention operation is used to extract the spatial relationship of features, focusing on the regions with key information in the image, thus improving the perception of local details.Meanwhile, channel attention focuses on learning the relationship between feature channels and ultimately selecting effective features.Therefore, we first obtain the important information of the fusion feature  in channel and spatial dimensions: the spatial attention map  and the channel attention map  .Then, we pass the information in  and  to the original input features  and  to make them learn the effective features of  .Meanwhile, to maintain the original features of each of the two branches in this process, we perform a Split operation on the channel attention map to select different channel weights for  and  .After that, the features  and  of the two branches are first subjected to spatial level multiplication operation with the spatial attention map, and then subjected to channel level multiplication operation with different channel weights respectively.Finally, the final fused feature map  is obtained by combining the results  and  produced by the two branches.This process can be expressed as equations: where where FC represents fully connected operation.

The partial decoder
The good performance of Unet-based image segmentation relies on the aggregation of multilevel features extracted from the encoder.For example, Unet aggregates all the hierarchical features extracted from the encoder, and there are many network variants that utilize Unet, such as Unet++ [11] and ResUnet [36].However, research found that low-level features contribute less to the performance in comparison with high-level features, while the computational cost is high when both low-level and high-level features are used.Thus, to achieve a more efficient use of the features and reduction of computational cost, we developed a PD module, as shown in Figure 4 in the decoder path.In the PD module, we only use three high-level feature mappings  ,  2, 3, 4 .The specific steps are as follows: We first reshape the three feature mappings to the same channel size using 1 × 1 convolution.Then, we resample the resulting feature mappings to the same spatial resolution and concatenate them together.The final feature map    ,  ,  is obtained using convolution, batch normalization and ReLu operations.Our PD module uses a small number of parameters to preserve multi-scale contextual information for localizing the approximate location of polyps.

The reverse attention boundary enhancement module
The network with dual-branch encoder and FIM module can only obtain the approximate positions of various polyps, lacking the refinement of polyp structure.Conversely, the general network for polyp segmentation lacks detailed boundary information and structural details.To address this issue, [14] introduced the inverse attention module for polyp segmentation and obtained some improvements in preserving the boundaries of polyps.Inspired by [14], we added a reverse attention boundary enhancement (RABE) module in the decoder section, as shown in Figure 5.The RABE module consists of reverse attention and spatial attention.We first use the reverse attention mechanism to focus on the details of the polyp boundary.The network can better identify the edge information between the target and the background, thereby making the boundary of the segmentation result clearer and more accurate, and then gradually incorporate it into the decoder to obtain the global segmentation feature map, specifically expressed by the following formula: where  denotes the feature map obtained from the decoder, and  denotes the output reverse attentional feature.Here,  ∈ 0, 1, 2, 3 .The operation Sample () denotes the sampling operation corresponding to the input of  to the module, as shown in Figure 1(a).We employed spatial attention to extract polyp location information from the initially segmented feature maps.When a segmentation network locates polyp boundaries in the feature map, extreme binarization of polyp regions and other regions could easily lose boundary details.To address this issue, we put more weight on the initially located non-polyp regions in the spatial direction while reducing the weights of the polyp regions.The spatial attention module can capture rich boundary information of polyps and enhance the performance of polyp segmentation.Let the output of the spatial attention module be  , and it can be computed from  by    2  1  (10) where Conv represents a convolution operation, and RELU represents an activation function operation.
Because  can capture detailed boundary information, we combine it with the feature map  to enrich the boundary details of polyps in the initial predicted segmentation map and finally get the feature map  .

Loss function
We employed cross-entropy loss for the classification task, which is defined as follows: where Q is the number of classes,  represents true labels of a class, and  represents the predicted labels.For colonoscopy polyp segmentation, Dice loss is usually used.However, when the polyps are small, it could make significant changes in the network gradients.One way to mitigate the issue is to use binary cross entropy (BCE) loss to guide Dice loss to make the gradient reasonably small.Therefore, we combine the two losses for the segmentation task.These losses are defined as in [37].In addition, we performed deep supervision on the outputs  of the four decoders.Before calculating the deeply supervised losses, we up-sampled them to the same size as GT.Therefore, the total segmentation loss is where

𝐿 𝐿 𝐿
where  denotes GT,  denotes the results after up-sampling to the original image size, which are used for deep supervision, and Y denotes the final prediction result.The total loss for classification and segmentation is as follows:

Experiments and results
The proposed segmentation and classification network was implemented using the PyTorch framework with NVIDIA RTX 3090Ti graphics environment.For the experiments, we trained the model using the Adam optimizer with a momentum of 0.9 and a weight decay of 1e-4.The initial learning rate was set to 0.01 and then reduced by half every 30 cycles.The batch size was set to 8, and the learning period was set to 100.Our network was pre-trained on the ImageNet dataset to accelerate network training.To evaluate the effectiveness of the proposed method, we conducted experiments on segmentation and classification tasks using five publicly available datasets: KvasirSEG [38], CVC-ClinicDB [39], CVC-ColonDB [40], ETIS-LaribPolypDB [41] and CVC-EndoSceneStill [42].In these datasets, images were annotated by a specialized endoscopist.We adopted dataset division criterion by [9]: 900 images from KvasirSEG and 550 images from CVC-ClinicDB were used as the training set while the remaining images from KvasirSEG and CVC-ClinicDB plus all the images from the other three datasets (ETIS, CVC-ColonDB, and CVC-300) were used as the test set.Many existing methods have utilized this criterion for their experiments, so to be fair, we also used this division criterion in our experiments.Tables 1 and 2 show the dataset divisions for the two tasks, respectively.Table 1.Division of the data set for the polyp segmentation task.

Evaluation metric
For the classification task, we evaluated the performance using four metrics: specificity (Spe), recall (Rec), accuracy (ACC), and the area under the curve (AUC).According to [43], sessile polyps are at a higher risk of complications such as perforation or hemorrhage during the treatment process.Therefore, we considered the class of sessile polyps as positive samples, and the class of pedunculated polyps as negative samples.TP denotes the instances where the class of sessile polyps is correctly predicted.So, these four indicators can be represented by the following formulas:

𝐴𝐶𝐶
For the segmentation task, we used three evaluation metrics: Dice similarity coefficient (DSC) which was used to evaluate the accuracy of the segmentation; intersection over union (IoU), which was used to assess the internal consistency of the segmented objects; and Hausdorff distance (HD), which was used as a similarity metric.These three metrics are follows: where where TP, FP, FN, and TN represent true positive, false positive, false negative, and true negative, respectively.True positive means that the model correctly labels the pixels or regions in the image that belong to the polyps to be segmented.The DSC and IoU metrics have a range of [0, 1], with higher values representing better segmentation or classification results.On the other hand, for the HD metric, lower values represent better results.

Segmentation results
For segmentation of colon polyps, we compared the proposed network with several state-of-the-art models.These comparison models include CNN networks that are widely used for segmentation tasks, including Unet [10] and Unet++ [11].We also compared the proposed network with the networks that are specifically designed for polyp segmentation, including ParNet [14], EMS-Net [44], BDG-Net [45] and BSCA-Net [17].In addition, we also compared with some transformer-based segmentation models, including TransUnet [30] and TransFuse [18].Tables 3-7 presents the results of the proposed model for each metric on each of the five test datasets.In addition to the standard evaluation metrics, we also included three additional metrics inspired by ParNet to provide a more comprehensive evaluation of the model's performance.These additional metrics are a weighted measure ( ) [46] that combines recall and precision, an S-measure ( ) [47] that evaluates the similarity between predicted and true values, and an E-measure ( ) [48] that assesses similarity at both the pixel and global level.In the table, the optimal results were highlighted in bold, while the second-best results were highlighted in blue font.Except for HD, the  Next, we visually compared the segmentation results obtained with different models.The visualization results clearly demonstrated the superiority of our model for polyp segmentation.Figure 6(a) is an image with easily identifiable polyps, and we found that most methods produced good segmentation results.However, our method exhibited slightly better performance in capturing finer details. Figure 6(b) is an image with small and densely distributed polyps, and we found that some methods failed to identify the correct polyp regions, such as Unet.Others only identified a few polyp regions and have incorrect polyp location information, like Unet++ and EMS-Net.BDG-Net and BSCA-Net identified polyp regions in multiple locations compared to the other methods but still produced incorrect predictions.Our method effectively suppressed non-noise regions of interest, accurately localized polyps and correctly identified most polyp regions.Figure 6(c) is an image with large and irregularly shaped polyps, and our method demonstrated a strong scale adaptation.In contrast, all other methods were negatively affected to some extent and struggled to accurately segment the polyps.Figure 6(d)-(f) are images of polyps with blurred backgrounds that are difficult to distinguish from the direct border of the normal intestinal wall.Many methods struggled to correctly detect the edge region of the polyps.Unet and Unet++ performed the worst and almost failed to recognize the polyp regions.Pranet, TransUnet and EMS-Net had a small number of incorrectly detected regions and missed many target regions that are like the background.BDG-Net and BSCA-Net performed slightly better than but still had a small number of under-segmented regions.In contrast, our method excelled in the polyp edge region and accurately detected the polyp boundary with the best segmentation effect.
In summary, the above observations demonstrate that our method outperformed other methods in capturing global context information and local detail information.It performed well on both large polyps in Figure 6(c) and small target polyps in Figure 6(b) and achieves the best detection of edge regions.

Classification results
For polyp classification, we compared our proposed model with several powerful and effective classification methods based on our dataset classification criteria.These methods included Inceptionv3 [49], MobileNetv3 [50], DenseNet [51], Vit [28], ResNet-50 [52], EfficientNet [53], TransUNet [30], and FusionM4Net [54].Table 8 presents the average classification results for each metric evaluated on our experiments using the five test datasets.When combining the results from all five datasets, our model achieved the optimal performance with AUC of 0.915, Spe of 0.901, Rec of 0.934, and ACC of 0.937.These results represent 3%, 2%, 3%, and 3% improvements, respectively, over the Second-best results.This indicates that our model possesses strong learning and generalization capabilities.Upon reviewing Table 8, it is evident that the results achieved by other methods on the polyp classification were not very satisfactory.For instance, the AUC of each classification model ranged from approximately 0.813 to 0.855.This suggests that existing classification methods were not effective for recognizing polyps.Particularly, the ViT model and the transformer model performed relatively poorly compared to the other models.This implies that transformer-based classification models struggled to process polyp images with distinct local features despite their advantage in extracting global feature information.However, according to Table 8, our model overcame the performance bottleneck of traditional classification models and achieved more accurate classification results by leveraging the strengths of both CNN and transformer architectures.For a clearer understanding of the classification results, we provide a confusion matrix in Figure 7.The horizontal axis represents the predicted classes, which are Sessile polyps and Pedunculated polyps, while the vertical axis represents the actual classes.The figure displays the number of polyp images that were misclassified as other classes for each class.In the confusion matrix, we observed that there were 36 instances where Sessile polyps were misclassified as Pedunculated polyps.On the other hand, there were 13 instances where Pedunculated polyps were misclassified.Therefore, Sessile polyps were more prone to misclassification compared with Pedunculated polyps.

Ablation experiment
To demonstrate the effectiveness of our proposed model for polyp segmentation and classification, we selected three challenging datasets KvasirSEG, CVC-ClinicDB and CVC-ColonDB for ablation experiments.These experiments aimed to showcase the effectiveness of each individual module in our model.The results are presented in Tables 9-11.For the baseline model, we only utilized the transformer encoder (TE) and a simple U-Net decoder.Subsequently, we added the CNN branch (CB) to form a two-branch network.Further, we incorporated the FIM module and the RABE module into the network in sequence to assess the effectiveness of each module.The table displays the results obtained in various cases.It is evident that the model's performance gradually improved as each module was added to the network.Specifically, the inclusion of CB helped in learning local information, resulting in 2%, 1%, and 1% improvement in the segmentation index (DSC), 4%, 2%, and 2% improvement in the classification index (AUC) for datasets KvasirSEG, CVC-ClinicDB, and CVC-ColonDB respectively.The SFEM module enhanced the model's performance by preserving edge detail information, and the FIM module aided in learning by fusing information from the two branches.The data presented in the table demonstrates that each module of our model is effective and contributes to an improvement in model performance.
Figure 8 shows the visualization results of the ablation experimental results for segmentation with different settings of the dual-branch network.From the left to the right, new modules were added to the baseline dual-branch network one by one.The results clearly show that the localization effect of the polyps and the local segmentation effect were gradually improved.After fusing the information of the two branches through FIM, the information of the polyp acquired by the network also became more information, which was more friendly to the segmentation of some details and the segmentation of small targets.In addition, with the addition of the RABE module, the network's ability to detect the boundary region of polyps was improved, and the effect of polyp edge segmentation was more accurate.For the classification task, the first and fourth rows are dedicated to the classification of pedunculated polyps, and the model consistently classified them correctly from start to finish.several experimental processes in the middle of the second and third rows produced some incorrect class predictions, but with the addition of our proposed modules, the learning effect of the network was greatly improved accordingly, and therefore, the correct classification results were finally obtained.The last row is a typical example of misclassification of polyps.

Time complexity and efficiency analysis
The computational complexity and efficiency of deep learning models are crucial indicators for evaluating their prospects in clinical applications.Parameters (Param) and floating-point operations per second (FLOPs) serve as metrics for computational complexity, while frames per second (FPS) is a measure of analysis efficiency.Smaller values for Param and FLOPs indicate lower computational and time complexity, while higher FPS values suggest a faster model.Table 12 displays the values of Param, FLOPs, and FPS for each comparison method.As observed in the table, the Param of our method was approximately 42.4 M, which was lower than that of Unet++ [11] and BSCA-Net [17].Furthermore, in terms of FPS, the model's efficiency in this paper was notably advantageous, ranking second only to EMS-Net [44] and BSCA-Net [17].Additionally, our model demonstrated heightened sensitivity to polyp detection accuracy, achieving a commendable trade-off between efficiency, time complexity, and accuracy.

Conclusions and discussion
Polyp segmentation and classification have important applications in the diagnosis of colorectal cancers.Traditional image segmentation methods [55,56] and traditional image classification methods [57,58] offer low-accuracy performance.Thus, we proposed a multi-task network for polyp segmentation and classification that can better handle segmentation and classification of randomly located polyps with varying sizes and confusing edges and backgrounds.The main structure of the network is a combination of dual-branch encoders, which employ CNN and transformer as its two branches.We also designed several modules to make the model more effective.One module is the feature interaction module (FIM), aiming at eliminating the semantic gap between the two branches, and better fusion of the information obtained by the dual encoder while retaining the information of each branch.Another module is RABE.This module helps the model extract boundary information and enhances the segmentation performance, particularly for small targets and images with fuzzy boundaries.
We performed experiments on five public datasets.Experimental results show that the multi-task network proposed in this paper has high segmentation and classification accuracy and good reliability.However, the network proposed in this paper still has potential for improvement in two key areas: The small target area is small, which is difficult to accurately locate and capture, resulting in inaccurate segmentation.In addition, sessile polyps do not have obvious pedicle features and are not easy to identify, resulting in misclassification.Another possible work is to enhance the inference speed of our network on devices with low computational power.In our future research work, we will focus on addressing these challenges by optimizing the network architecture and reducing redundancy parameters.By doing so, we aim to better meet the requirements of high-precision and real-time clinical applications.

Figure 2 .
Figure 2. (a) Overall structure of the network.(b) Classification block, where the blue blocks represent the fully connected layers, the green block represents the Normalization operation, and the yellow and gray blocks represent Dropout and sigmoid, respectively.(c) Transformer block.

Figure 6 .
Figure 6.Visualization results for each comparison method, where the green line represents GT, and the blue line represents the segmentation result.

Figure 7 .
Figure 7. Confusion matrix for classification results.

Figure 8 .
Figure 8. Visualization results of a multi-task ablation study, where the green line represents GT, and the blue line represents the segmentation result.

Table 3 .
Segmentation results on the KvasirSEG dataset.

Table 8 .
Average classification results for five datasets.

Table 9 .
Ablation experiments on the KvasirSEG dataset.

Table 12 .
Time complexity and efficiency analysis of each method.