Self-Adaptive Teacher-Student framework for colon polyp segmentation from unannotated private data with public annotated datasets

Colon polyps have become a focal point of research due to their heightened potential to develop into appendiceal cancer, which has the highest mortality rate globally. Although numerous colon polyp segmentation methods have been developed using public polyp datasets, they tend to underperform on private datasets due to inconsistencies in data distribution and the difficulty of fine-tuning without annotations. In this paper, we propose a Self-Adaptive Teacher-Student (SATS) framework to segment colon polyps from unannotated private data by utilizing multiple publicly annotated datasets. The SATS trains multiple teacher networks on public datasets and then generates pseudo-labels on private data to assist in training a student network. To enhance the reliability of the pseudo-labels from the teacher networks, the SATS includes a newly proposed Uncertainty and Distance Fusion (UDFusion) strategy. UDFusion dynamically adjusts the pseudo-label weights based on a novel reconstruction similarity measure, innovatively bridging the gap between private and public data distributions. To ensure accurate identification and segmentation of colon polyps, the SATS also incorporates a Granular Attention Network (GANet) architecture for both teacher and student networks. GANet first identifies polyps roughly from a global perspective by encoding long-range anatomical dependencies and then refines this identification to remove false-positive areas through multi-scale background-foreground attention. The SATS framework was validated using three public datasets and one private dataset, achieving 76.30% on IoU, 86.00% on Recall, and 7.01 pixels on HD. These results outperform the existing five methods, indicating the effectiveness of this approach for colon polyp segmentation.

Comment 1.In the experimental results section, it seems that the distribution bias among the public datasets leads to unstable prediction results.Did the authors decide the selection of the three public datasets with any specific criteria in mind, or were any pre-processing methods used to enhance the stability of pseudo-label generation?

RESPONSE:
We sincerely thank you for raising these questions.
• The selection of the three public datasets involves following specific criteria: 1. Public availability and usage in the community.We selected three datasets, CVC-ClinicDB, CVC-ColonDB, and Kvasir, are widely recognized and used in the research community, facilitating comparison with other methods and ensuring reproducibility.
2. Annotation quality.We prioritized datasets with high-quality, expert-annotated ground truth to ensure reliable training of the teacher networks, thereby preventing cumulative errors in the student networks.

•
Distribution bias is addressed via our proposed UDFusion strategy, instead of pre-processing methods.
The core innovation of our method is to leverage the distribution biases among different datasets to generate pseudo-labels of varying quality.Instead of preprocessing to enhance the pseudo-label quality, we embrace these differences as they provide a rich source of information that our framework can utilize.Specifically, our UDFusion strategy dynamically adjusts the weights of the pseudo-labels generated by the teacher networks, which effectively integrates the varying levels of reliability from different datasets, improving the overall robustness and accuracy of the student network.
Manuscript changes -We have added the specific criteria about public dataset selection in our manuscript to make it more comprehensive, please see page 10, lines 302-304.We have also added the declaration that we specifically designed UDFusion strategy to utilize the pseudo-labels with different qualities instead of using any preprocessing methods to address the distribution bias, please see page 11, lines 319-323.
Comment 2. In line 176, it is mentioned that the UDFusion module uses ui to assess the reliability of pseudo-labels through contextual information.How is this contextual information represented?Is there a direct connection between formula (1) and the unstable distribution in the final visualized results?

RESPONSE:
We sincerely thank you for these good questions.

•
As explained in Eq. ( 4) and ( 5) in the manuscript, the contextual information for   is represented by the distribution distances between a private image and the public datasets.More specifically, the UDFUsion module pretrains three encoderdecoder networks with the three public datasets.By inputting a private image into these pretrained networks and evaluating the restored image qualities, the UDFUsion module can get the distribution distance between the input private image and the three public datasets.Because the distribution distance evaluation focuses on the overall contextual information between the input image and the restored image, we formulate that the UDFusion module uses   to assess the reliability of pseudo-labels through contextual information.
• Yes, there is a direct connection between formula (1) and the unstable distribution in the final visualized results.Formula (1) represents the process to build pseudo-labels which are used for training the student module.Due to the limited public dataset numbers, the pseudo-labels are unstable, further causes the unstable prediction in the final visualized results.
Manuscript changes -We have added the explanation about how the contextual information is represented for   to make the manuscript clearer, please see page 6, lines 178-179.We also added the discussion about the connection between formula (1) and the unstable distribution in the final visualized results, please see page 19, lines 513-515.
Comment 3. In line 189, there are two colons ":".Is this a typographical error?

RESPONSE:
We sincerely thank you for pointing out this typo.We have doublechecked every sentence to ensure there are no typos.