Entropy and distance-guided super self-ensembling for optic disc and cup segmentation

Segmenting the optic disc (OD) and optic cup (OC) is crucial to accurately detect changes in glaucoma progression in the elderly. Recently, various convolutional neural networks have emerged to deal with OD and OC segmentation. Due to the domain shift problem, achieving high-accuracy segmentation of OD and OC from different domain datasets remains highly challenging. Unsupervised domain adaptation has taken extensive focus as a way to address this problem. In this work, we propose a novel unsupervised domain adaptation method, called entropy and distance-guided super self-ensembling (EDSS), to enhance the segmentation performance of OD and OC. EDSS is comprised of two self-ensembling models, and the Gaussian noise is added to the weights of the whole network. Firstly, we design a super self-ensembling (SSE) framework, which can combine two self-ensembling to learn more discriminative information about images. Secondly, we propose a novel exponential moving average with Gaussian noise (G-EMA) to enhance the robustness of the self-ensembling framework. Thirdly, we propose an effective multi-information fusion strategy (MFS) to guide and improve the domain adaptation process. We evaluate the proposed EDSS on two public fundus image datasets RIGA+ and REFUGE. Large amounts of experimental results demonstrate that the proposed EDSS outperforms state-of-the-art segmentation methods with unsupervised domain adaptation, e.g., the Dicemean score on three test sub-datasets of RIGA+ are 0.8442, 0.8772 and 0.9006, respectively, and the Dicemean score on the REFUGE dataset is 0.9154.


Introduction
Glaucoma belongs to a neurodegenerative disease of the optic nerve, which is one of the leading causes of blindness in the elderly [1].The cup-to-disc-ratio (CDR) becomes larger will lead to a higher probability of developing into glaucoma and vice versa, hence monitoring the CDR is an important aspect [2].Segmenting OD and OC is a crucial step in accurately measuring the CDR and detecting changes in glaucoma progression over time.Recently, various models based on deep learning have been developed to accurately segment OD and OC from fundus images [2][3][4].However, a well-trained segmentation model on one dataset usually fails to generate satisfactory performance on an unseen dataset due to the problem of domain shift (seen in Fig. 1) [5], which greatly limits the application of the segmentation model in the practical field.Therefore, the need to address the domain shift problem has become more and more urgent, and how to design effective and robust segmentation methods has attracted much attention in the fields of clinical medicine and computer vision Domain shift refers to a situation where the statistical distribution of the source domain used for training a machine learning model does not align with the statistical distribution of the target domain used for testing the model, which usually leads to poor performance for the well-trained model.Recently, researchers have proposed various techniques for unsupervised domain adaptation to address the performance degradation resulting from domain shifts in OD and OC segmentation tasks [6][7][8][9].These techniques aim to improve the segmentation performance on a target domain with no labeled data.For instance, Javanmardi et al. [10] designed a domain discriminator method that trains in an adversarial network, which is designed to enforce domain invariance in the feature representations.As a result, the segmentation accuracy is improved when the testing data differs from the training data.Similarly, Wang et al. [11] designed an unsupervised domain adaptation method that combines the boundary and entropy information for fundus image segmentation.Moreover, Lei et al. [8] presented a space alignment approach based on an adversarial network to learn domain-invariant feature maps for OD and OC segmentation.Unsupervised domain adaptation methods have been demonstrated to improve semantic segmentation performance in these studies.However, most approaches adopt adversarial domain adaptation to learn domain-invariant feature maps, which usually leads to the training process being unstable and difficult to optimize.This is because the adversarial training process involves a delicate balance between the feature extractor and the domain classifier, which can be easily disrupted by small changes in the input data or the model parameters.
To alleviate this problem, many researchers are exploring alternative methods that are more reliable and robust.Some recently proposed unsupervised domain adaptation methods based on the self-ensembling model have attracted more and more attention [12][13][14].Self-ensembling [15] usually consists of one student network and one teacher network.In the self-ensembling, the weight of the student network is exponentially averaged to obtain the weight of the teacher network to ensure that the teacher network learns more reliable weight information from the student network.These methods based on self-ensembling usually incorporate adversarial training to further align the feature distributions of the source and target domains, or introduce consistency loss terms to promote the student network to learn consistent predictions across different perturbations of the input data.For instance, Zuo et al. [16] proposed a categorylevel adversarial framework, which focuses on aligning the features of the source and target domains.In another recent work, Xu et al. [17] designed a self-ensembling attention network to address domain shift, which is employed to promote the computation of consistency loss for the unlabeled domain.Furthermore, Perone et al. [18] expanded upon a previously developed method using self-ensembling for medical image segmentation, which can effectively improve the generalization of the models.These methods based on self-ensembling can solve the problem of the domain shift to some extent, however, their effectiveness is limited due to they only use one self-ensembling module and do not fully use the consistency of the feature maps and the weight disturbance between the teacher network and the student network.The consistency of the feature maps refers to the feature maps extracted by student and teacher networks that have consistent feature information.The student and teacher networks have the same input images, and their segmentation results should be the same in the ideal case.If the student and teacher networks have consistent feature maps, it indicates that the model has learned domain-invariant features.Thus, the well-trained self-ensembling can produce better segmentation results.
To overcome this limitation, developing an effective unsupervised domain adaptation method is very important to improve the consistency of the feature maps and the performance of model generalization.In this paper, we propose a novel self-ensembling network, called entropy and distance-guided super self-ensembling (EDSS), to enhance the precision of segmenting OD and OC.There are three main ideas behind the EDSS.First, one self-ensembling module led to ambiguous and noisy predictions because a single module might not have enough capacity to effectively capture the consistency regularity between the unlabeled image feature maps from the target domain.Hence, we construct a novel super self-ensembling framework that contains two self-ensembling modules to learn more discriminative features.Second, since the student network in self-ensembling is prone to overfitting in the training stage, the teacher network may further lead to ambiguous predictions on the target domain images.Therefore, we design a feasible solution to add noise to the weights of the proposed super self-ensembling to enhance the generalization of the model.Third, to learn more domain-invariant features, we design a multi-information fusion strategy, which fuses the entropy map, signed distance map and the initially predicted mask as the input of the second self-ensembling module to improve the accuracy of the OD and OC segmentation.
The contributions of EDSS can be summarized as follows: (1) A novel super self-ensembling (SSE) model is developed, which combines two selfensembling modules to learn more domain invariant information.
(2) Exponential moving average with Gaussian noise (G-EMA) is proposed to improve the robustness of each self-ensembling, and the experiments prove that can produce more accurate OD and OC segmentation.
(3) A simple yet effective multi-information fusion strategy (MFS) is designed to integrate entropy maps, signed distance maps and the initially predicted mask into self-ensembling, which are effective for guiding feature alignment.
(4) We conduct extensive experiments on two public fundus image datasets to demonstrate the effectiveness of the proposed EDSS network.The experimental results demonstrate the effectiveness of EDSS as well as each of its components.

Methodology
In this section, we introduce the proposed EDSS in detail.Figure 2 shows the architecture of EDSS for OD and OC segmentation.EDSS is comprised of two self-ensembling models, and the Gaussian noise is added to the weights of the whole network.The second self-ensembling network takes a concatenated input consisting of predicted masks, Shannon entropy maps, and signed distance maps.The second self-ensembling is used to improve the domain adaptative performance.Specifically, SSE is designed to effectively capture the consistency of features and reduce noise in the predicted segmentation results generated by the teacher network.SSE contains two self-ensembling modules, and each self-ensembling module utilizes UNet as the backbone network.The first self-ensembling is utilized to get the initial prediction results of both optic disc and cup, while the second self-ensembling is used to learn more domain-invariant information to obtain accurate segmentation results.In each self-ensembling, we utilize the exponential moving average with Gaussian noise (G-EMA) of the weights of the student network to acquire the weights of the teacher network, which can increase the parameter disturbance to improve the stability of the SSE.Specifically, in the first self-ensembling network, predicted masks are generated, followed by the calculation of the Shannon entropy map and signed distance map based on the predicted masks, then we fuse the predicted results, Shannon entropy map and signed distance map and employ them as the input of the second self-ensembling model.In each self-ensembling, the student network is optimized by minimizing the consistency loss L con (X T ) and the supervised loss L seg (X S ), while the teacher networks are optimized by G -EMA of the student networks.

Super self-ensembling
An ensemble of models often performs better than a single model because it averages out individual model's biases and reduces variance.Inspired by this idea, French et al. [15] proposed a usual ensemble method named self-ensembling, which adapts the exponential moving average of the weights of the student network f TN as the weights of the teacher network f SN to improve the stability of the model.Recently, self-ensembling has been proven to be effective for the unsupervised domain adaptation method, such as improving the generalization of the model, the robustness of the model to noise and obtaining better performance on small datasets [19,20].Hence, a lot of unsupervised domain adaptation methods are proposed based on employing self-ensembling as the backbone.However, if only one self-ensembling is applied to guide the consistency regularity between predicted results of the target domain, the predicted results output by the f TN may be unreliable and present a little noise.
A recent study in the field of unsupervised domain adaptation suggests that incorporating additional consistency loss between predicted masks, in addition to using the segmentation result, can greatly enhance the network's ability to learn domain-invariant information [21].This, in turn, leads to improved accuracy in semantic segmentation tasks.Inspired by this idea, we designed a novel super self-ensembling named SSE, which consists of two self-ensembling structures to obtain a consistent relationship between feature maps of the target domain.In the proposed SSE, UNet (four layers) and light-UNet (two layers) are employed as the backbone network in the first self-ensembling and second self-ensembling, respectively.The input of the second self-ensembling consists of the predicted results, entropy map and signed distance map.Thus, adopting a light-UNet as the backbone of the second self-ensembling can already extract semantic features well.In addition, it is worth noting that the first and second self-ensembling have different inputs, allowing different modules to extract different discriminative features.
The total loss function is used to optimize the weights of the network, which contains two parts: a supervised loss L seg (X S ) and an unsupervised loss L con (X T ).Specifically, the dice loss is very effective in the scenario where the number of labeled data from different classes is seriously unbalanced.Thus, it is utilized as L seg (X S ) to optimize the parameters of the SSE.Furthermore, since the segmentation results of the unlabeled target domain usually have some noise (irrelevant or misleading information), we choose to mean squared error loss as L con (X T ) to adjust the learning bias between the predicted results of the target domain respectively generated by the f SN and f TN .Finally, the total loss L total (X S , X T ) is the weight fusion of the L seg (X S ) and L con (X T ), as shown follows: where X S and X T represent the source and target images, respectively.L seg (X S )and L con (X T ) are the supervised loss and the unsupervised loss, respectively.Supervised loss L seg (X S ) is a dice loss function to compute the segmentation error of SSE for the source images, which can be represented as: where X S is the source images, f i is the probability of pixel i being classified as the class label in the ground truth, and g i ∈ {0, 1} indicates the corresponding class label in the ground truth.Unsupervised loss L con (X T ) is mean square error loss function, which measures the consistency of the segmentation results respectively output by the f SN and the f TN for the same target input X T under different disturbances.L con (X T ) is shown in following the formula: where E x T ∼X T denotes the mathematical expectation, x T ∼ X T represents x T obeys the probability distribution of X T , s is a softmax activation function that is used to compute the probability of prediction maps.
It should be mentioned that self-ensembling is trained by forward propagation of inputs, computation of a loss function, and backpropagation of gradients to update the weights of the student network.The weights of the teacher network are updated to be a weighted average (Gaussian noise EMA, the details contents will be introduced in Section 2.2) of their current values and the student network's new weights.Thus, the teacher network represents a smoothed or regularized version of the student network, often leading to better generalization performance on unseen data.In the test step, the output of the teacher network is taken as the final result.

Gaussian noise EMA
In the domain adaptation methods that utilize consistency regularization, the perturbation of either the data or network parameters plays a crucial role.This is due to some methods [20,22,23] pointing out that under different disturbances the network can learn meaningful features and improve the ability of generalization.
Data disturbance generally includes random rotation, random gamma, horizontal flip, vertical flip, and so on.Many existing methods adopt these data disturbance methods to disturb unlabeled data in image segmentation tasks.Parameter disturbance means adding noise to the parameters of the network, which belongs to network disturbance like the dropout operation [24].For instance, Wan et al. [25] designed a regularization approach using drop connect, which shows that adding noise to the weights of a neural network can improve its generalization performance.Moreover, Navid et al. [26] proposed a noisy activation function that introduces noise into the network, which can improve the network's performance and robustness.Although parameter disturbance is effective, few studies have considered adding Gaussian noise for self-ensembling to alleviate domain shift.
The framework of self-ensembling is a semi-supervised learning method [15], which can integrate previous parameters of the f SN as the parameters of the f TN through an exponential moving average.Inspired by parameter disturbance methods, introducing parameter disturbance into the self-ensembling can make the results predicted by f TN and f SN more robust, but this is ignored by the current related research.Therefore, we design a novel Gaussian noise EMA (G-EMA) that adds noise to student network parameters in SSE.It should be mentioned that the existing methods commonly employ Gaussian noise to disturb the input data.In contrast, our approach introduces noise to disturb the student network's parameters in each iteration.This is because parameter disturbance of the student network can encourage SSE to learn more robust and domain-invariant features and prevent overfitting for SSE.After adding the noise to the parameters of the f SN , the parameters of the f TN in our proposed SSE can be calculated by the G-EMA formula: where s i noise and t i noise are student networks' weights with Gaussian noise and teacher networks' weights with Gaussian noise at each training step i, respectively.α represents an exponential moving average decay factor that is the updating rate, it is set to 0.999 by empirical studies that work well in many cases.

Multi-information fusion strategy
When we obtain the initial predicted results by the first self-ensembling in the proposed EDSS, we find that the OD prediction results are good enough but the OC segmentation result is less satisfactory, especially around the boundary.Therefore, we propose to utilize the entropy map and signed distance map computed based on the initial predicted results to further enhance the domain adaptative performance of EDSS.The reasoning behind employing this approach is that high entropy (under-confident) typically indicates significant uncertainty in segmentation.Previous research [27] has shown that the signed distance map contains more domain-invariant features.By calculating the distance from each pixel to the nearest object boundary, the signed distance map can distinguish between pixels that are inside or outside the object, which is useful for predicting segmentation tasks.Moreover, the signed values of the signed distance map contain the direction of the nearest boundary, which can be used to improve the accuracy of boundary segmentation and feature extraction, even in cases where the object has complex shapes or is partially occluded.Since the OD and OC are usually occluded by capillaries, using a signed distance map can help to effectively segment them.
In detail, given a predicted mask probability map p, the entropy map E can be calculated by the Shannon Entropy.The entropy map can be calculated by the following formula: where N is the total number of classes of each pixel, H and W represent the height and width of the entropy map, respectively.The signed distance map ϕ(x t ) is defined as follows: where x t is the pixel value of a point in images, y t is the pixel value of a point in the OD region, Ω OD represents the region of OD, Ω OC denotes the region of OC, and Ω OC∪OD denotes the image region that does not contain the regions of OD and OC.After obtaining the above results, we concatenate entropy map, signed distance map and the initially predicted masks as the input of the second self-ensembling to get more refined segmentation results for OD and OC.

Datasets
In the experiment, like Ref. [9], we first utilize the RIGA+ [9] dataset to evaluate our proposed EDSS.RIGA+ is a mixed dataset with the domain shift which has five sub-datasets Magrabia, BinRushed, BASE1, BASE2 and BASE3.Magrabia and BinRushed are used as the source domain, while BASE1, BASE2 and BASE3 are employed as target domain 1, target domain 2 and target domain 3, respectively.BinRushed and Magrabia separately consist of 195 and 95 labeled retinal fundus images for supervised training.BASE1, BASE2 and BASE3 have both labeled and unlabeled retinal fundus images, in which unlabeled images are utilized for unsupervised training, while the labeled images are utilized for testing the segmentation accuracy for the network.Specifically, BASE1 includes 227 unlabeled images and 35 labeled images, BASE2 consists of 238 unlabeled images and 30 labeled images, and BASE3 has 252 unlabeled images and 27 labeled images.The samples in RIGA+ are provided by the official dataset, which is ROIs (OD region) with 800×800 pixels cropped from the original images.Table 1 provides details about the RIGA+ dataset.Besides the RIGA+ dataset, we also utilize the REFUGE dataset [28] to evaluate our proposed EDSS.REFUGE includes 1200 original images with resolution from 1634 × 1634 to 2124 × 2056.In our experiment, the training set of REFUGE including 400 labeled images is used as the source domain, while the validation set images of REFUGE including 400 unlabeled images are utilized as the target domain, and the test set of REFUGE is used for test data.Since the OD and OC segmentation is achieved based on the ROIs of original images in REFUGE, we first detected the center of OD by a popular pre-trained disc detect model [4], then the proposed method by Ref. [29] is the best-compared method in Table 6 of the revised manuscript, and Ref. [29] cropped ROI with 800 × 800 pixels from each image for REFUGE.Hence, to ensure a fair comparison, we cropped each image to obtain ROI with 800 × 800 pixels for REFUGE to test the performance of the proposed EDSS.Moreover, the image is resized to 400 × 400 as the network input.A detailed description of the REFUGE dataset is shown in Table 2.

Implementation details
To prove the domain adaptation performance of EDSS, we only adopt UNet as the backbone network without using any pre-training model.Following Ref. [9], we achieve segmentation in the ROIs (OD regions) of fundus images.Data augmentation techniques include random flip, random brightness contrast, adding Gaussian noise, transposing, changing the hue and saturation value, etc.To adapt the ESDD's receptive field, the ROIs are resized to a compact dimension of 400×400.In all experiments, we set the initial learning rate to 1×10ˆ(-4), the weight decay parameter to 5×10ˆ(-4), and the Adam [30] is utilized to optimize their parameters.Each model is implemented by PyTorch 1.4 using an NVIDIA GeForce GTX 1080.The Dice coefficient (Dice) and the absolute error for the vertical optic cup-to-disc ratio (δ CDR ) are used as evaluation metrics to quantitatively compare segmentation results outcomes achieved by the different methods.Higher values of Dice OD , Dice OC and Dice mean and the lower value of δ CDR indicate better OD and OC segmentation results, respectively.These evaluation metrics are expressed as shown below.
where TP, TN, FP, and FN correspond to true positive, true negative, false positive, and false negative, respectively.VD p cup and VD p disc represent the predicted vertical diameters, VD g cup and VD g disc represent the ground truth of vertical diameters, respectively.
To be specific, UNet is an encoder-decoder network and it is a widely used baseline in medical image segmentation.CyCADA is an input-level domain adaptation model that considers the cycle consistent and is broadly applicable to segmentation problems on general images.BEAL and pOSAL are output-level domain adaptation methods for retinal fundus image segmentation.DSBN is a feature-level domain adaptation model that takes into account batch normalization of the corresponding parameters of the model.There has been impressive progress with DSBN in image segmentation tasks such as prostate, OD and OC segmentation.DoCR is a Fourier transform-based unsupervised domain adaptation method that utilizes auxiliary highfrequency reconstruction modules to produce high-quality segmentation results.DoCR has much superior performance than many unsupervised domain adaptation methods.The other domain adaptation methods include U-D4R, FSM, and Zhou et al. [39] which enhance the accuracy of segmentation results by the source-free unsupervised domain adaptation method.Table 3-5 gives the comparative experiments obtained from the recent comparison methods, where the results of U-D4R, FSM and ProSFDA are provided by Ref. [38], and the results of other compared methods are provided by Ref. [9].As shown in Tables 3-5, UNet as the backbone model of EDSS is trained only on the source domain (Source only) in a supervised manner.Though UNet is inferior to others without domain adaptation methods e.g., DeepLebv3+, the segmentation effect of EDSS is still better than pOSAL which adopts DeepLebv3+ as the backbone, thus this indicates that our proposed domain adaptation strategy is very effective.We also compared EDSS with the fully supervised methods trained on data from all domains (all labeled samples of one source and one target domain), respectively.The results of fully supervised methods are utilized as the upper bound (benchmark) to further evaluate the performance of domain adaptation methods, as shown in Tables 3-5.2) Segmentation results on BASE2 Table 4 presents the segmentation performance of the proposed EDSS and comparison methods on BASE2.It can be seen that EDSS gains the highest comprehensive performance, and the Dice Mean score obtained by EDSS is increased by 0.0021 compared to the second-highest Dice Mean score obtained by Zhou et al. [39].In addition, most unsupervised domain adaptation methods achieve relatively high scores compared to UNet because they can make the segment model better adapt the data distribution of target images.The experimental results of UNet and U-D4R are not very well, especially Dice OC score, which demonstrates they are not effective for the BASE2 dataset.
3) Segmentation results on BASE3 We also compare the proposed EDSS with nine comparison methods on the BASE3 dataset, as seen in Table 5.Table 5 demonstrates that UNet achieve Dice Mean score of 0.8643, while other compared methods obtain higher Dice Mean scores ranging from 0.8866 to 0.9108.Our proposed EDSS significantly improves the Dice OD , Dice OC and Dice Mean score by 0.0048, 0.0077, 0.0062 compared with the best comparison method DoCR on the BASE3 dataset.

4) Segmentation results analysis
Comparing the segmentation results in Tables 3-5, our proposed EDSS shows better results than the compared methods under these metrics Dice OD , Dice OC and Dice Mean .Specifically, 1) UNet as a method without domain adaptation is trained only on the source domain in a supervised manner, which cannot effectively learn domain-invariant features, thus its segmentation results are inferior to other methods including the domain adaptation strategies.From Tables 3-5, we can also find that the segmentation results of the proposed EDSS are close to the upper bound, which indicates that EDSS has good domain adaptation ability.2) The performance of CyCADA is generally better than that of UNet on three target domain test sets.This can be ascribed to the source-like images generated by CyCADA can effectively minimize the dissimilar appearance between source and target images.3) BEAL achieves a great performance improvement compared with the other two methods (UNet, CyCADA), this is because BEAL can utilize the additional boundary features to further improve the segmentation accuracy.4) pOSAL obtains better segmentation results than BEAL, this may be because pOSAL is a method based on the morphology-aware segmentation loss, while BEAL is a method based on a boundary-aware loss which requires more training data and is not as stable as pOSAL.5) DSBN outperforms pOSAL, one reason for this may be that DSBN uses an adaptive loss to learn both local and global spatial information more effectively than the loss of pOSAL.6) DoCR achieves better segmentation results than UNet, CyCADA, BEAL, pOSAL and DSBN, the reason for this is that DoCR reconstructs the high-frequency image and adopts a domain-specific module to remove low-frequency domain-sensitive information to boost the ability of domain-invariant feature extraction.7) Since U-D4R and FSM employ some new source-free unsupervised domain adaptation methods based on UNet architecture, they obtain segmentation results roughly comparable to CyCADA, BEAL, pOSAL, and DSBN.8) ProSFDA and Zhou et al. [39] are better than U-D4R and FSM because they enhance the effectiveness of domain adaption by explicitly reducing the differences between domains.9) Our proposed EDSS is superior to other comparison methods, which indicates that the super self-ensembling guided by entropy and distance is more effective in improving the consistency of the feature maps and the generalization ability of the model.
A good network should be able to achieve effective and stable segmentation performance on different datasets simultaneously.From Tables 3-5, we can easily find that 1) the Dice OD of EDSS in Tables 3-5 is increased by 0.93%, 1.32%, and 0.28% compared to the best-compared method on the different datasets, which also proves that EDSS has better stability than other comparison methods.2) these methods overall demonstrate superior and more stable results than UNet.This is due to these methods can effectively tackle the issue of domain shift.3) The methods including U-D4R and FSM achieve a good segment result on the BASE1, but the performance declines on BASE2 and BASE3.This phenomenon could be caused by relatively large differences in data distribution between the source and target domains.4) All methods generally obtain excellent performance on the BASE1 test set than that on test sets of BASE2 and BASE3.One possible reason is that the feature alignments are much harder between source and target domains for BASE2 and BASE3 than for BASE1, thus making OD and OC segmentation on BASE2 and BASE3 more challenging.
To demonstrate that the observed differences in Dice coefficients are meaningful, we perform statistical analysis on the data presented in Tables 3-5 by calculating the results (the mean and standard deviation) of EDSS and each compared method on three target domain datasets (BASE1, BASE2 and BASE3).That is, we calculate the mean and standard deviation values of results to statistically compare the stability of different methods, which are shown in Table 6.From Table 6, we can obverse that the segmentation results of EDSS are higher than those of all compared methods, which indicates that EDSS has better and stable OD and OC segmentation results.

5) Qualitative results and analysis
In this section, we presented the visualization of segmentation results of different methods to further prove the effectiveness of the EDSS.From Fig. 3, we can see that EDSS can well distinguish the region of OD, OC (target region) and background, and the boundary of OD and OC is very clear and very close to that in the ground truth.Selecting the first row in Fig. 4 as an example, UNet DeepLebv3+ and BEAL cannot segment the target region well due to the boundary of this retinal fundus image is too vague, while the segmentation results of EDSS are still well and close to the ground truth.It indicates that EDSS can obtain better segmentation results by extracting much more domain-invariant features.

1) Quantitative results and analysis
To further demonstrate the effectiveness of our proposed method for challenging OD and OC segmentation tasks, we conduct a validation on another widely used dataset, the REFUGE dataset.Additionally, we perform a comparative analysis with other state-of-the-art methods including UNet (Source only) [31], DeepLebv3 [40], DeepLebv3+ [32], CycleGAN [41], Pix2Pix [42], SynSeg-Net [43], SIFA [44], AdaptSegNet [45], BEAL [11], ESS-Net [46], BBUDA [47], one model (pOSAL) [34], an ensemble of five models (pOSAL*) [34], IOSUDA [6]and He et al. [29] to highlight the advantages of our proposed approach.We select these methods as comparison methods since they have shown promising performance on the REFUGE dataset.Table 7 lists the comparison results, where the results of pOSAL, pOSAL* and IOSUDA are provided by Ref. [6], and the results of other compared methods are provided by Ref. [29].Since Ref. [34]  only reports the results of pOSAL*, the results of pOSAL in Table 7 are provided by Ref. [6].It should be mentioned that the reason pOSAL* proposed in Ref. [34] is slightly superior to EDSS is that the results of pOSAL* are calculated by an ensemble of five models.If only compared to the results of one model (pOSAL) proposed in Ref. [34], EDSS has shown better performance on OD and OC segmentation.Moreover, from Table 7, we can gain the same conclusion as Tables 3-6, which further proves the effectiveness of our proposed method.

2) Qualitative results and analysis
We visualize the segmentation results of different methods on the REFUGE dataset and the corresponding ground truth, as shown in Fig. 4. From Fig. 4, we can see that: 1) Our proposed method is superior to UNet, DeepLebv3 and DeepLebv3+, this is because our methods can capture more domain invariant information for different domain images.2) Compared to the UNet, our proposed method has a substantial improvement in the segmentation results, especially in making the predicted boundary close to the ground truth, which also demonstrates our proposed modules of EDSS are valid.
3) Verifying the performance of EDSS on different domains We take the REFUGE dataset as an example to further validate the effectiveness of the entropy and distance-guided super self-assembling model (EDSS) can improve segmentation performance when the problem of domain shift exists.Specifically, we first divided all images of the REFUGE training set into 320 labeled training samples and 80 labeled test samples, which aims to test the segmentation performance of EDSS on the source domain.Then, we adopted the 320 labeled images from the REFUGE training set and 320 unlabeled images from the REFUGE validation set as training samples, and randomly selected 80 images from the REFUGE test set as test samples, which aim to test the segmentation performance of EDSS on the target domain.
The experimental results are shown in Table 8.From Table 8, we can observe that the results of EDSS on the target domain are very close to the results obtained by the source domain.This indicates that EDSS can alleviate the domain shift problem and achieve relatively better OD and OC segmentation performance, which verifies our hypothesis well.

Ablation study analysis
In this section, we evaluate the segmentation performance of each component within EDSS including SSE, G-EMA and MFS.Meanwhile, UNet is utilized as the baseline.Table 9 presents the experimental results of the BASE2 test set of RIGA+.From Table 9, we can summarize the following points.
1) Self-ensembling performs better than UNet, which can be due to the structure of the student-teacher networks that enable self-ensembling more robust to domain shift problems in the medical image.
2) As shown in 3) When G-EMA is added to the network to update weights of the self-ensembling in SSE, the results (SSE+G-EMA in Table 9) show that almost all evaluation metrics are increased, which indicates that more domain-invariant features are captured with the help of G-EMA.Especially, Dice OC is increased by 2.21% for the BASE2 dataset compared to the SSE without G-EMA.
4) When further introducing the MFS into the framework of SSE+G-EMA to test its effectiveness, the results prove that MFS can effectively improve the segmentation performance.
Our proposed simple yet effective MFS fuses entropy map, signed distance map and the initially predicted mask that ensures the network can consider both segmentation uncertainty and distance of space to further refine the segmentation results.Moreover, we also visualized the ablation results of different methods on the BASE2 test set of RIGA+.From Fig. 5, we can intuitively find that the figure has the same conclusion as Table 9.Specifically, our proposed EDSS demonstrates a remarkable enhancement in the OD segmentation compared with the baseline (blue bar in Fig. 6).In addition, gradually adding the proposed components can result in a noticeable improvement in the OD segmentation performance (gray bar in Fig. 6), which proves the validity of each proposed components in our method.

Discussion and conclusion
In this study, a novel entropy and distance guided super self-ensembling framework is designed to solve the problem of OD and OC segmentation in fundus images.We first conduct a brief literature review, then introduce the details of our work, and finally design comprehensive experiments to verify that the proposed EDSS can outperform other comparison unsupervised domain adaptation methods for OD and OC segmentation.In conclusion, this research promotes self-assembling network design and improves OD and OC segmentation results.
Though the proposed EDSS has a good OD and OC segmentation performance it still has a minor limitation.Specifically, when the target domain dataset has an intra-domain gap (The images of the same target domain dataset have dissimilar appearance distributions, images have different appearance distributions from the left to the right column), the performance of EDSS is degraded.Therefore, in future work, we will design a strategy that minimizes both the inter-domain and intra-domain gap to extract domain-invariant features more effectively.In addition, segmentation models trained on comprehensive datasets (manufacturers/models, lighting conditions, operator skills, etc.) will likely perform robustly in clinical practice without frequent re-training.Periodically re-training models with added data from all domains can ensure continuous adaptation to clinical conditions, ensuring that the model can handle changing clinical environments.However, these methods usually require many high-quality labeled medical images, which is time-consuming and tedious.Compared to the above-mentioned methods, our proposed EDSS can reduce the burden of manually labeling a large number of medical images.In future work, we aim to combine the above-mentioned training methods with EDSS to promote the robustness of EDSS.
Funding.Science and Technology Development Plan Project of Jilin Province, China (20240101382JC); National Natural Science Foundation of China (62272096); Education Department of Jilin Province (JJKH20241463SK, JJH20221328SK).

Fig. 1 .
Fig. 1.Illustration of domain shift.The images in different rows are from the source domain (training images), target domain (validate images), and target domain (test images) of the REFUGE dataset, respectively.

Fig. 2 .
Fig. 2. Overview of the proposed EDSS.EDSS has three main parts: the SSE, the G-EMA and the MFS.

Fig. 3 .
Fig. 3. Visualization of segmentation results obtained by the different methods on the BASE2 dataset.

Fig. 4 .
Fig. 4. Visualization of segmentation results obtained by the different methods on the REFUGE dataset.

Fig. 5 .
Fig. 5. Comparison results of different methods on the BASE2 test set.

Table 3
presents the quantitative comparison results of our EDSS against nine other comparison methods on the BASE1 dataset.It shows that all methods have good accuracy in terms of the Dice Mean score except for UNet.For example, Dice Mean score achieved by UNet is 0.8359, while other methods increase Dice Mean scores by 0.0493 to 0.0837 compared with UNet.Moreover, our proposed EDSS obtains the Dice Mean score of 0.9196, which significantly improves the segmentation performance compared to all comparison unsupervised domain adaptation methods for BASE1.

Table 9 ,
the Dice Mean obtained by the original self-ensembling is 0.8912, while the Dice Mean obtained by the proposed base model SSE, which is constructed by U-Net, is 0.8989.This indicates that SSE improves Dice Mean by 0.0077.When introducing our proposed domain adaptation strategy into SSE, the Dice Mean obtained by the final EDSS (SSE + G-EMA + MFS) is 0.9184.This indicates our proposed domain adaptation strategy improves Dice Mean by 0.0159.From the above analysis, it can be seen that compared with SSE, domain adaptation strategy plays a greater role in improving performance.In this part, only the predicted masks as the input of the second self-ensembling model of SSE, which is done to clarify and verify the effectiveness of SSE.