AGCNet: A Precise Adaptive Global Context Network for Real-Time Colonoscopy

Colonic endoscopy is the gold standard for detecting rectal polyps and rectal cancer. In which polyps are a major predisposing factor for colorectal cancer, the precise diagnosis of polyps within colorectal endoscopy is highly dependent on a physician of professional level. With the development of deep learning, some semantic segmentation methods have recently been applied to polyp detection, but there are problems with insufficient accuracy and segmentation speed. To this end, we propose a precision adaptive global context network (AGCNet) based on real-time colon endoscopy. Firstly, in order to adapt to the problem of large-scale variation of polyps, we designed a multi-scale semantic fusion module (MSFM), which enhances the representation capability by varieties of filters to collect contextual information at different scales, thus adapting to the problem of large variation of polyp size, especially smaller polyps. In addition, modelling long-range dependence by simply using complex spatial pixels tends to introduce more background noise and increase the computational effort. To this end, a context-aware pyramid aggregation module (CPAM) was designed, which internally includes a novel dual attention mechanism whereby the CPAM aggregates feature information across different regions to boost the network’s ability to utilize global context and model long-range dependency through dual attention to further reinforce the features information of important regions and efficiently suppress features in non-important regions. Additionally, the CPAM performs multi-level pooling on the input features to extract multi-scale context information from the image and uses an attention mechanism to selectively highlight informative regions of the image that are most relevant to the segmentation task. The module fuses the multi-level pooled features with the attention map to produce enhanced feature representations that capture both global and local information. Thereby achieving precise polyp segmentation and taking real-time into account. Our proposed AGCNet performed extensive experimental studies on datasets Kvasir-SEG, CVC-ClinicDB, CVC-ColonDB and ETIS-LaribPolypDB. Specifically, AGCNet achieved an IoU of 87.40% and a Dice score of 92.63% on the Kvasir dataset, achieving accurate segmentation results faster than many current state-of-the-art models.


I. INTRODUCTION
Early diagnosis of colorectal cancer (CRC) improves the patients' survival rate. Most CRCs start as adenomatous polyps: surface protrusions on the colon and rectum lining. Over time, they grow into malignant tumours and spread to The associate editor coordinating the review of this manuscript and approving it for publication was Yongqiang Cheng . surrounding organs. The survival rate drops from 95% in the first stage to 35% in the fourth and fifth stages [1]. Early screening and removal of polyps can increase the survival rate. Colonoscopy is the standard for screening CRC, but even so, accurate identification of polyps remains challenging due to (1) the large variation in scale between polyps (Fig.1). (2) the blurring of polyp border information (Fig.1(c)-(d)).
(3) low contrast between polyps and gastrointestinal tract.
(4) endoscopist's skill [2], [3]. Adenomatous polyps detection rate (ADR) measures physician quality, which indicates the percentage of polyps diagnosed in patients after a complete colonoscopy. ADR varies from 7% to 53%, yet statistically, there is a 25% probability that polyp will be missed in each patient's diagnosis [4]. It is worth mentioning that there is a causal relationship between ADR and reduced CRC mortality. According to the data, a 1% increase in ADR is associated with a 3% decrease in interval cancer [5]. The main factors that can affect ADR are mescal intubation rates, withdrawal times, and quality of bowel preparation, which depend on human intervention by endoscopists. However, the human approach brings great uncertainty and reliability to the diagnostic results, so there is an urgent need for an automated polyp segmentation method to reduce the misdiagnosis brought by human factors.
Polyp segmentation during colonoscopy screening can prevent colorectal cancer (CRC), which is mainly caused by polyps. Polyps are surface protrusions on the colon and rectum lining that can grow into malignant tumours and spread to surrounding organs. Segmentation is challenging because some polyps are flat, with low contrast to the mucosal border [6]. This requires the expertise of endoscopists to reduce the rate of missed examinations during colonoscopy. Most CRC patients (91% to 94%) do not have endoscopy before the disease; others (6% to 9%) have endoscopy but misdiagnose flat polyps [7]. Despite timely colon endoscopy, the leading cause of cancer is the high rate of misdiagnosis of flat polyps, which are also the focus of segmentation.
The subsequent efforts have been made to develop appropriate protocols to address the many challenges in polyp segmentation in colon endoscopy. Previous studies used hand-crafted methods [8], [9] to train a classifier model based on shape, color, texture and appearance. This method could not segment heterogeneous polyps well, resulting in low accuracy and performance [10]. Later studies used deep learning methods to extract features automatically [11]. Although these approaches improve traditional manual segmentation, using bounding boxes alone can only yield imagewise results that cannot distinguish the boundary line between the polyp and the mucosa [12]. FCN was proposed to segment polyps using pre-trained weights for pixel-wise results. However, this method lost spatial information at low dimensions and did not calibrate deep semantic information with shallow information. Inspired by FCN, UNet with an encoder-decoder U-shaped architecture was proposed [13], [14]. The network has been widely used in medical image segmentation since its introduction. However, the UNet model had limitations, such as generating redundant information and increasing computation with traditional convolution. This also provided an opportunity to improve performance further.
Since the encoding process of UNet directly uses a pooling operation to compress the resolution, it is easy to cause the loss of some spatial information. In addition, in skip architecture, the feature map of the encoder-decoder is directly concatenated as feature input easily increases the amount of redundant information. MRUNet provides a multi-scale and residual scheme, using multi-parallel and multi-scale convolution instead of the traditional convolution of the UNet encoder and decoder can effectively reduce the semantic information gap. Simultaneously, the skip connection part is replaced by the residual model. However, MRUNet's operation-only models skip architecture by specifying the feature maps of the encoding layers at a certain level to reduce the semantic information gap of the corresponding decoders. UNet++ takes into account the semantic differences between the encoding and decoding layers and designs a series of nested and dense jump paths at the skip architecture, allowing the decoding layer to take advantage of more rich contextual information in the encoding layer [15]. However, this approach introduces more complex computation and leads to more difficulty of optimizer and backpropagation [16]. To further enhance the ability of context information extraction and computing optimization, CPFNet discards a single stage to model the context information, using and modelling a double pyramidal module to extract the global context information [17]. Besides, PraNet applies region and boundary cues to design a parallel reverse attention mechanism that corrects some misaligned predictions. Considering the uncertainty region of the salient features of polyp segmentation, UACANet proposed an uncertaintyenhanced contextual attention model [18]. However, these methods cannot bridge the semantic information gap between different levels and utilize the global information to achieve an excellent segmentation result while maintaining real-time performance.
In this paper, we propose an adaptive global context architecture named AGCNet, equipped with two new multi-scale semantic extraction and dual attention approaches that can meet the current challenges in segmenting polyps under colonoscopy video. AGCNet compares with the currently existing methods. It models a long-range dependency by using contextual information at different scales without introducing extra computational costs to build confidence in the network to identify the large variation scale and shape of polyps. In addition, a novel method of dual attention mechanism is proposed to effectively suppress background noise without employing sophisticated non-local modelling techniques. The contribution of this work can be summarized as follows: • We propose a novel MSFM module to enhance the network's multi-scale representation at a more granular level and aggregate multi-scale contextual information to model long-range global dependency. Thus, it can be self-adaptive to scale-variant polyps.
• In order to solve the interference of background noise in the gastrointestinal channel, we further designed the CPAM module to extract more discriminative features by suppressing the interference of irrelevant information through a dual-channel attention approach.
• Extensive experimental studies on five publicly available datasets have confirmed that AGCNet can produce more competitive results when compared with other state-of-the-art network models.

II. METHOD
Colorectal endoscopic images from four publicly available datasets, Kvasir-SEG, CVC-ClinicDB, CVC-ColonDB and ETIS-LaribPolypDB, are preprocessed and passed through our AGCNet for feature extraction and segmentation of the polyp region. The whole network can be divided into a classical symmetric encoding and decoding region. The network framework is shown in Fig.2, which employs two novels and effectively validated new components Multi-scale Semantic Fusion Module (MSFM) and Context-aware Pyramid Aggregation Module (CPAM). To capture multi-scale contextual information, we use our proposed MSFM module in the encoding part of each layer, where the features are processed by multi-scale convolution and then passed back by the residual unit to obtain more discriminative representative features to fit the polyp size at different scales. At the bottom of the network, in order to mitigate the interference of the background noise of the polyps on the high-level semantic information, we further employ the CPAM module, which is used to enhance the target region and weaken the background region employing pyramid aggregation and a two-channel attention mechanism to exploit the global contextual information. The operation of the different modules in the AGCNet is discussed in detail in the following subsections.

A. MULTI-SCALE SEMANTIC FUSION MODULE
Due to the diversity of polyps and their different scales, the network tends to lose the boundary information in the downsampling process, resulting in the inability to identify variable polyps accurately. In this case, if the contextual information of the shallow represents information can be reasonably used, it will help the network to identify polyps of different shapes and scales. Inspired by Res2Net [19], we propose a Multi-scale Semantic Fusion Module (MSFM). Most existing methods use input feature maps with different resolutions to improve the multi-scale representation ability. However, it is easy to cause the loss of boundary information by reducing the fine-grained. We use convolution kernels with different scales to extract features at the more granularity level to increase the network's perception field and maintain the model's multi-scale representation ability. Then concatenate to output the final feature map. As shown in Fig.3 We extract feature from input feature X ∈ R C×H ×W by, including 1 × 1 convolution, Batch Normalization (BN) and activation function ReLU. Keeping the original input scale constant, we get a new feature map X ′ ∈ R C×H ×W , where C, H and W represent the number of channels, length and width of the feature map, respectively. Dividing features into four feature maps with an equal number of channels in channel dimension X ′′ = [X 0 , X 1 , X 2 , X 3 ] ∈ R C 4 ×H ×W , where X 1 , X 2 , X 3 are transformed via W 2 (·). Note that W 2 (·) includes 3 × 3 convolution and BN operations. We concatenate the feature maps W 2 (X 1 ) , W 2 (X 2 ) , W 2 (X 3 ) and X 0 transformed by W 2 (·) in the channel dimension in turn, as shown in the following equation.
From the above equation, we obtain contextual information at different scales based on the extraction of different convolutional kernels, thus increasing the receptive field of the layer network. Finally, we fuse the resulting feature outputs with the original input features via a residual network operation, as defined by the following equation.
where ⊕ represents the pixel-level additive summation operation, W 3 (·) represents the 1 × 1 convolution, BN and ReLU nonlinear activation functions. To summarize the advantages of MSFM, firstly, it is different from using the resolution feature map to enhance the representational power of the network. MSFM enhances the network's representational power at a finer granularity level. Designing different convolutional kernels for feature extraction and then concatenating enables the network to obtain multi-scale contextual information increasing the perceptual field of the network. Finally, the model models the long-range dependence between the model 59004 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.   and the original input image, allowing the network to retain sufficient spatial detail information.

B. CONTEXT-AWARE PYRAMID AGGREGATION MODULE
In order to reduce the misjudgment of foreground and background information, we need to exploit a more extensive range of contextual information, which requires not only modelling the global context to capture the long-range dependencies more efficiently but also deepening the network to guide it to focus on the region of interest. Current state-of-the-art approaches model more complex long-range dependencies mainly by correlating pixels or channels, which not only increases the computational effort but also introduces some unavoidable background noise, thus reducing the segmentation accuracy of the network. Inspired by ECA-Net [20] and PSPNet [21], we propose a Context-aware Pyramid Aggregation Module (CPAM), which adopts a more efficient context modelling to establish long-range dependencies and effectively enhances the information of cross-channel interactions. The procedure is divided into Context-Aware Fusion and Attention Calibration Operation. VOLUME 11, 2023 59005 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
Operation of the Context-Aware Fusion, as shown in Fig.3. Given the input feature map as D ∈ R C×H ×W , first extract four feature map D 0 ∈ R C×6×6 , D 1 ∈ R C×3×3 , D 2 ∈ R C×2×2 and D 3 ∈ R C×1×1 with different resolutions of constant number of channels using multiple pooling operations (MPA) at different scales, and then reduce the dimensionality of the above four feature maps by CBR.
where D ′ i ∈ R C 4 ×H ×W , i = 0, 1, 2, 3, and Up(·) denote bilinear interpolation up-sampling and β i is the correlation coefficient. Through the operation of the context-aware fusion part, we obtain rich contextual information that is sufficient for subsequent long-range dependency modelling, which enhances feature differentiation. We connect the obtained feature mappings into the channel dimension as follows.
The Attention Calibration Operations section is designed to enhance the network's representation capabilities further. In this section, Dual attention mechanisms are designed to model long-range dependency as shown in Fig.5. In the spatial attention mechanism, we first use convolution to reduce the dimension of the channel, get the attention weights through the Sigmoid activation function, then perform the attention matrix multiplication operation to reshape the weights of the original input feature maps, as follows.
where ⊗ is the multiplication of the attention matrix, σ (·) is the Sigmoid activation function to get the attention weight map, S 0 (·) is the 1 × 1 convolution operation, and α is the S 0 correlation coefficient. In addition, to explicitly model the relationship between context and channel, we used an efficient channel attention mechanism in the other part, which is expressed as follows.
where F Adaptive (·) enables local cross-channel information interaction, i.e., how many neighbours are involved in the prediction of a channel's attention. Also, F Adaptive (·) can adapt to the size of the kernel. G(·) denotes the global average pooling.
generates channel-wise statistics, θ is the correlation coefficient of G(·), and finally pixel-level multiplication is used to recalibrate the channel weights of D Cat . We aggregate the feature maps processed by the channel and spatial attention mechanisms.
where ⊕ denotes the pixel-level addition operation that achieves feature fusion, our proposed multiscale context module based on a dual-channel attention mechanism, which does not need to build complex pixel-level and channel-level long-distance dependency, is equally capable of capturing different contextual information and can obtain the same ability to suppress background noise. In addition, enhanced model discriminative power for different features and precise segmentation can be achieved between polyps' background and foreground information.
59006 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.

III. EXPERIMENT A. LOSS FUNCTION
The loss function is essential for polyp image segmentation in colorectal endoscopic scenarios. Since there is a severe imbalance between the foreground and background information in the polyp image, choosing the most appropriate loss function can help suppress the background noise and accelerate the better convergence of the network. The current mainstream and advanced two loss functions are binary cross-entropy loss, which is formulated as follows [22], [23].
where p i andp i denote the label value of the polyp and the predicted value of the polyp region, respectively, (h, w) denotes the image pixel coordinates, α represents the LaPlace smoothing factor used to accelerate the aggregation of the network. We set it to 1e-8 in our network and set up ablation experiments to find the best loss function for training our network model. Table 1 shows the segmentation results under three different loss functions, as shown in the table. The optimal segmentation results can be achieved when the combined mode's loss function is used, mainly due to the different scales of polyps and the interference of a large amount of background noise under the colorectal endoscope. When using a single loss function for optimization, the gradient of the polyp region will be affected by the gradient of other background regions, increasing the difficulty of network training and affecting the accuracy of the final training results. However, the combination of two loss functions can be used for targeted learning and optimization of polyp regions in the process of backpropagation to alleviate the severe imbalance between foreground and background information. Finally, we obtain the final loss function.
where λ 1 and λ 2 denote the loss function's relevant weight coefficient, we set them to 0.6 and 0.4, respectively. of 574 × 500, and it was extracted from 15 video sequences, each containing one polyp. The image resolution sizes ranged from 332 × 482 to 1920 × 1072. The experimental part of AGCNet unified the resolution of all images to 512 × 512 size. We followed PraNet's [6] setup of the dataset and used 900 and 550 images for training, respectively, from the datasets Kvasir-SEG and CVC-ClinicDB. To effectively validate the model, we used a mixture of datasets developed by different medical centers to test the generalization ability of the model, including the CVC-ColonDB and ETIS-Larib datasets. We kept 100 and 62 images for testing, respectively. In order to validate the performance of AGCNet from multiple perspectives, we used four main evaluation metrics to measure the effectiveness of different models for polyp image segmentation, mainly Recall, Precision, the Dice score and the Jaccard similarity coefficient, and also these metrics are widely used in the field of medical image segmentation. The specific equation is as follows.

B. DATASET AND EVALUATION
TP and TN represent true positives and negatives, indicating the network's ability to segment the polyp's foreground and background pixels correctly. Similarly, FP and FN are false positives and false negatives, respectively, representing the network metrics misclassifying the foreground and background pixels of polyps. A represents the set of polyp segmentation result pixels and B refers to the set of actual polyp data label pixels. In addition, Recall and Precision metrics only focus on the distribution of independent pixels, which may affect the final evaluation results, so we introduced the area under the curve (AUC) value in addition to the four main metrics mentioned above to enable a complete evaluation of the results. The GFLOPS stands for Giga Floatingpoint Operations Per Second, which is a unit of measure for the computational speed of a computer or a processor. The Parameter means the number of parameters the model learns during the training process, and we use millions as the unit of measurement. Finally, considering the need for inherent real-time performance in the colorectal endoscopy scenario, we introduce FPS (Frames Per Second) and execution time to evaluate the real-time performance to meet the clinical needs.

1) IMPLEMENTATION DETAILS
The choice of different hyperparameters is of great importance for improving network performance. For the two main current optimizers, SGD optimization [26] and Adam optimization [27]. We conducted comparative experiments as shown in Table 2, seeing that the SGD optimizer is more suitable for AGCNet and set the initial learning rate to 1e-3. We set the batch size to 4 and used an NVIDIA RTX3090 24GB graphics card for the experiments. We also used several data enhancements: RandomRotate, HorizontalFlip and RandomBrightnessContrast. The network tends to be stable when the number of epochs is 80.

2) ABLATION STUDIES
To demonstrate the performance of AGCNet, this paper leads ablation experiments to evaluate the effectiveness of both MSFM and CPAM modules on the dataset Kvasir-SEG. As shown in Fig.6, our ablation experiments compared some representative cases of visual challenges. As shown in the third column of Fig.6, the baseline model shows a mismatch of semantic information. After MSFM extracted the contextual information of multi-scale polyps and calibrated the semantic information of the encoding layer, as shown in the fourth column of Fig.6, we obtained more explicit polyp segmentation images. In addition, we also performed a comparison on the Kvasir-SEG dataset as shown in Table 3, and the Baseline+MSFM approach achieved 77.69%, 82.06% and 94.88% for IoU, Dice and Precision, respectively, which were ahead of Baseline by 7.45%, 3.99% and 5.37%. The result also means that MSFM can enhance the network to adapt to multi-scale polyps at a more granular level using rich semantic information. CPAM suppressed background noise in deep semantic information through a hybrid attention mechanism, which effectively enhanced the model's discriminatory ability between the target region and background tissue. Compared with the Baseline, our Baseline+CPAM can better suppress the background noise and obtain better polyp segmentation results, as shown in the fifth column of Fig.6. In addition, the advantages of CPAM for processing polyp images with low contrast between foreground and background can be seen in Table3, where CPAM achieves 71.53%, 79.25% and 89.66% for IoU, Dice and precision, respectively. Finally, we have seamlessly integrated MSFM and CPAM onto AGCNet. In this way, we not only obtain a more granular level of feature information in the encoder part and a wider range of contextual information through multi-level feature fusion but also strengthen the feature information of important regions and weaken the interference of background noise before decoding starts. AGCNet not only allows richer contextual information transfer between two adjacent layers in the coding layer but also enhances the ability of features to suppress the background noise of high-dimensional semantic information, as shown in the sixth column of Fig.6. 59008 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.

C. COMPARISON WITH STATE-OF-THE-ARTS
To further validate the segmentation performance of the proposed AGCNet, we compared several current state-ofthe-art models, including the U-Net [14], UNet++ [16], MRUNet [15], PraNet [6], CPFNet [17], and UACANet [28]. We set all models to the same computational environment and data augmentation for a fairer comparison. It is worth noting that all models are trained from scratch and are not loaded with pre-trained weights. We can see the main problems faced by the current polyp segmentation from the visualization of different competitors on the challenging cases ( Fig.7 to Fig.10). The irregular and variable scale of polyp shapes and excessive background noise interfere with the model's sensitivity to foreground information. Simply stacking the depth of the network by some simple convolution and pooling operations, as shown in UNet, cannot cope with these challenging cases. MRUNet utilizes multi-scale and multiparallel convolution to obtain richer contextual information and improve polyp segmentation results. By optimizing architecture connections to reduce the difference in semantic information between the encoding and decoding layers, UNet++ obtains more accurate segmentation results than UNet. Similarly, PraNet can effectively solve the problem of blurring between the target region and the background tissue by designing a parallel reverse attention mechanism that makes the boundaries of the polyp region more sensitive. CPFNet progressively develops and incorporates rich contextual information by modelling the global pyramid guide module and also obtains better segmentation results. UACANet constructs an improved version of the UNet architecture by using enhanced contextual information to capture those salient features that are easily overlooked. These methods described above remain deficient in polyp segmentation because they do not fully exploit the multi-scale contextual information in the feature extraction process,  which tends to result in the weak ability of the features to discriminate the target region from the Gut tissue. Our proposed AGCNet with MSFM and CPAM modules can fully solve the current problems, surpassing the abovementioned approaches.
As shown in the ninth column of Fig.8, AGCNet produces segmentation results that are closest to ground truth, which can effectively cope with multi-scale polyps and suppress background noise. Since there is an inherent need for  real-time performance in the polyp segmentation task, AGC-Net combines both accurate segmentation and guaranteed real-time performance.

D. RESULTS ON CROSS-DATASET
In addition, we conducted a statistical comparison for better quantitative analysis by collecting metric data for IoU, Dice, Precision, Recall, AUC, FPS, GFLOPS, Parameters(M) and Time(ms). As described above, we used two mixed datasets, Kvasir-SEG and CVC-ClinicDB, to train all the models during the execution of the experiments. Since different polyp datasets have different feature distributions, the model needs good generalization ability to obtain excellent segmentation results. In this part, we use Kvasir-SEG and CVC-ClinicDB  Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. datasets for testing and introduce ETIS-LaribPolypDB and CVC-ColonDB datasets to test the generalization ability of the model. Table 4 and Table 5 show the comparisons of quantitative result on Kvasir-SEG and CVC-ClinicDB datasets. As shown in Table 4, our proposed AGCNet achieves excellent performance on the Kvasir-SEG dataset. It outperforms the second-best UAUCNet with an IoU of 87.40% and a Dice of 92.63%, leading by 5.01% and 6.19%,respectively. It is worth mentioning that our model surpasses the benchmark model UNet by a large margin. Specifically, it achieves 17.16% and 14.56% higher on IoU and Dice, respectively. The SOTA model PraNet achieves the highest FPS metric of 28 and the shortest execution time of 166.09ms, but Dice and Jaccard are unsatisfactory. In Table 5, our proposed AGCNet achieved scores of 83.82%, 86.95%, 93.23%, and 88.93% on IoU, Dice, Precision, and Recall, respectively. These surpass all state-of-the-art models reported in the Table 5. Through the extensive experiments described above, the validity of AGCNet was verified, and it was able to cope with the main problems faced by current polyp segmentation, including the irregular shape and multiple scales of polyps, the slight difference between foreground and background information, and the inherent real-time needs of clinical applications.
We test the model's generalization ability across different datasets. In Table 6, on ETIS-LaribPolypDB dataset, our proposed AGCNet outperforms the models in the table, achieving 80.55%, 74.95% and 86.25% on IoU, Dice and Recall, respectively. As shown in Table 7, PraNet and UACANet all lead U-Net by 11% to 16% on average in IoU and Dice metrics. However, our proposed AGCNet is still ahead of it in IoU and Dice. The above comparison results show that AGCNet can maintain high accuracy while ensuring strong generalization ability.
The superiority of AGCNet over current state-of-the-art models originates from its two inherent submodules: MSFM and CPAM, each fulfilling a distinct purpose. Specifically, MSFM enhances the representation of target features at a finer granularity level. Meanwhile, CPAM suppresses background noise in deep networks and highlights polyp features through a dual attention mechanism. AGCNet has the following advantages: 1. Multi-level pooling: AGCNet performs multi-level pooling on the input features multiple times to extract multi-scale contextual information from the image. 2. Dual attention mechanism: AGCNet utilizes two integrated attention mechanisms to selectively focus on the most informative regions of the image that are most relevant to the segmentation task. 3. Feature fusion: AGCNet fuses the multilevel pooled features with the attention maps to generate enhanced feature representations that capture global and local information. Overall, AGCNet first obtains richer semantic information at a finer level of granularity. Then it strengthens the targets while weakening the background, offering an optimal solution for addressing similar segmentation tasks with blurred boundaries between different categories.

IV. CONCLUSION
In this paper, we propose a novel convolutional neural network, AGCNet, which can extract multi-scale contextual information to bridge the semantic information gap between different layers and effectively suppress the interference of background noise in deep semantic information by using dynamic modelling of long-range dependency. The AGCNet includes two novel modules: a multi-scale semantic fusion module (MSFM) and a context-aware pyramid aggregation module (CPAM). The MSFM enhances the representation capability by collecting contextual information at different scales to adapt to the problem of large variation in polyp size. The CPAM aggregates feature information across different regions to boost the network's ability to utilize global context and model long-range dependency through dual attention. More importantly, AGCNet can maintain excellent realtime performance while considering segmentation accuracy, which is of great significance for clinical practice. We have also done extensive experiments on the datasets Kvasir-SEG, CVC-ClinicDB, ETIS-LaribPolypD and CVC-ColonDB to confirm the effectiveness of AGCNet. Our subsequent work focuses on using more extreme cases to train AGCNet to strengthen its learning yet ability, enhance its robustness, and integrate it into the procedure of colonoscopy.