GDCSeg-Net: general optic disc and cup segmentation network for multi-device fundus images

: Accurate segmentation of optic disc (OD) and optic cup (OC) in fundus images is crucial for the analysis of many retinal diseases, such as the screening and diagnosis of glaucoma and atrophy segmentation. Due to domain shift between different datasets caused by different acquisition devices and modes and inadequate training caused by small sample dataset, the existing deep-learning-based OD and OC segmentation networks have poor generalization ability for different fundus image datasets. In this paper, adopting the mixed training strategy based on different datasets for the first time, we propose an encoder-decoder based general OD and OC segmentation network (named as GDCSeg-Net) with the newly designed multi-scale weight-shared attention (MSA) module and densely connected depthwise separable convolution (DSC) module, to effectively overcome these two problems. Experimental results show that our proposed GDCSeg-Net is competitive with other state-of-the-art methods on five different public fundus image datasets, including REFUGE, MESSIDOR, RIM-ONE-R3, blurred. The results show that the proposed DSC and MSA modules are beneficial for OC segmentation as well.


Introduction
Optic disc (OD) and optic cup (OC), retinal vessel and macula are three most salient features in retinal fundus. The accurate segmentation of OD and OC in fundus images (as shown in Fig. 1) is crucial for the analysis of many retinal diseases, e.g., OD and OC segmentation based optic cup-to-disc ratio is one of the main criteria for clinical screening and diagnosis of glaucoma [1]. Due to the similar color and adjacent location, OD segmentation will greatly influence the atrophy segmentation, especially the peripapillary atrophy segmentation [2].  Figure 2 shows typical fundus images from 5 public datasets, including MESSIDOR [3], Drishti-GS [4], IDRiD [5], RIM-ONE-R3 [6] and REFUGE [7]. As can be seen from Fig. 2, the appearance differences between the images from different dataset, called as domain shift [8], are obvious because of the different acquisition devices and modes, which is one of the major reasons for the poor generalization ability of deep-learning-based OD and OC segmentation methods. In addition, the sample numbers of different public datasets are various. For example, REFUGE and MESSIDOR datasets both contain 1200 fundus images, while Drishti-GS, RIM-ONE-R3 and IDRID only contain 101, 159 and 81 ones respectively. Due to the insufficient training, the small sample training based segmentation network usually has bad segmentation performance and generality. The segmentation of OD and OC has been one of the popular research topics in fundus images analysis for years. The methods can be classified as traditional algorithms and deep-learning-based algorithms. In traditional algorithms, Mittapalli et al. presented a glaucoma expert system based on the segmentation of OD and OC, in which OD was segmented with an implicit region based active contour model and OC was segmented based on its structural and gray level properties [9]. Morales et al. proposed a mathematical morphology and principal component analysis based method for the extraction of OD contour [10]. Aquino et al. used morphology and edge detection techniques and Hough circle transform to approximate a circular OD boundary [11]. Joshi et al. presented an automatic OD parameterization technique based on the segmentation of OD and OC in monocular retinal images [12]. In their following work, they proposed a depth discontinuity based approach to estimate OC boundary [13]. Cheng et al. proposed a super-pixel classification based OD and OC segmentation method for glaucoma screening [14].
In deep-learning-based algorithms, many convolutional neural network (CNN) based methods such as fully convolutional network (FCN) [15], U-Net [16] and its variants and adversarial learning based networks have been proposed for OD and OC segmentation in fundus images. Mohan et al. presented a CNN-based network names as Fine-Net for OD segmentation, in which full-resolution residual networks (FRRN) and atrous convolution were adopted for the efficient feature extraction [17]. In their following work, they introduced a prior CNN called as P-Net, which was cascaded with the Fine-Net, to generate a more accurate OD segmentation map [18]. Jiang et al. presented an end-to-end region-based convolutional neural network for the joint segmentation of OD and OC [19]. Since Ronneberger et al. proposed U-Net for medical image segmentation, there have been many CNN based algorithms for OD and OC segmentation that used U-Net as the baseline network and achieved good performances [20][21][22][23]. To learn discriminative representations and produce segmentation probability map, Fu et al. proposed a multi-scale U-shape convolutional network with the side-output layer named as M-Net for OD and OC segmentation [20]. Gu at al. proposed a context encoder network (CE-Net) for 2D medical image segmentation [21]. Compared with M-Net, CE-Net performed better in OD segmentation. Shah et al. proposed a parameter-shared branched network (PSBN) to learn the optic disc and cup masks and weak region of interest model-based (WRoIM) segmentation network to jointly segment OD and OC [22]. Shankaranarayana et al. proposed a novel depth estimation guided OD and OC segmentation network [23]. In addition, to overcome the problem of domain shift between different fundus image datasets, adversarial learning based methods are introduced. Wang et al. presented a patch-based output space adversarial learning framework (pOSAL) to jointly segment OD and OC, which uses the DeepLabv3+ architecture [8]. In their following work, they presented an unsupervised boundary and entropy-driven adversarial learning (BEAL) framework to improve OD and OC segmentation performance [24]. Recently, graph convolution has also been applied in image segmentation. Tian et al. proposed a segmentation network based on graph convolution for OD and OC segmentation, which achieved good performances on REFUGE and Drishti-GS datasets [25].
As can be seen from Fig. 2, although the appearance differences such as image size, image resolution, color gamut and field of view (FOV) between different fundus images from different public datasets are obvious, the highlighted and circle-like characteristics of OD and OC are common. In this paper, based on the mixed training strategy of different datasets, an encoderdecoder structure based general OD and OC segmentation network is proposed, which can effectively overcome the problems of the appearance differences caused by different acquisition devices and inadequate training caused by small sample dataset. The major contributions of this paper can be included as follows: -Mixed training strategy is adopted for the first time to overcome the problems of domain shift caused by different acquisition devices and modes and inadequate training caused by small sample dataset.
-An encoder-decoder structure based network with multi-scale information fusion and attention mechanism for general OD and OC segmentation in multi-device fundus images is proposed, named as GDCSeg-Net.
-A novel multi-scale weight-shared attention (MSA) module is proposed and embedded into the top layer of the encoder to integrate the multi-scale OD and OC feature information with channel and spatial attention mechanisms.
-A novel densely connected depthwise separable convolution (DSC) module is proposed and embedded as the output layer of the GDCSeg-Net, which fully fuses the multi-scale features extracted by depthwise separable convolution layer-by-layer via dense connections and leads the network to efficiently focus on the targets.
The remainder of the paper is organized as follows. In Section 2, the proposed method is described in detail. In Section 3, experimental results are shown and analyzed, and followed by the conclusions and discussions in Section 4. Figure 3 shows the overall framework of the proposed OD and OC segmentation method, which mainly includes two parts: region of interest (ROI) extraction and the proposed GDCSeg-Net for OD and OC segmentation.

ROI extraction network
Motivated by Ref. [26], we use the pre-trained U-Net to segment OD roughly and extract the ROI. After the OD is roughly segmented based on the pre-trained U-Net, the centroid of OD is located and the ROI with size of 512×512 is cropped around the centroid of OD, which is taken as the input of GDCSeg-Net.

GDCSeg-Net architecture
As shown in Fig. 3, with the U-shape structure, the proposed GDCSeg-Net mainly includes feature encoder, multi-scale weight-shared attention (MSA) module, densely connected depthwise separable convolution (DSC) module and feature decoder. The basic U-shape encoder-decoder model with pre-trained ResNet34 [27] backbone as feature extractor is taken as our Baseline network.

Multi-scale weight-shared attention (MSA) module
As can be seen from CE-Net, CPFNet [28] and DenseASPP [29], multi-scale feature information can improve the performance of semantic segmentation. However, how to further effectively utilize the multi-scale feature information is still worth studying. As shown in Fig. 4, motivated by the recent multi-scale feature and attention mechanism based approaches [31][32][33][34], we propose a novel multi-scale weight-shared attention (MSA) module, which includes the depthwise separable convolution based multi-scale feature extractor and channel and spatial attention modules, to obtain OD and OC feature information effectively.
In the multi-scale feature extractor, we use four parallel depthwise separable convolutions with different dilation rates of 1, 3, 5 and 7 to capture multi-scale information. To reduce the model parameters and the risk of overfitting, these four depthwise separable convolutions have shared the weights. The output of multi-scale feature F D ∈ R C×H×W can be computed as: where F ∈ R C×H×W denotes the input feature map, C is the channel of feature map, H is the height of feature map and W is the width of feature map. Concat represents the concatenation operation, D 2 * i+1 represents the depthwise separable convolution with dilation rate of 2 * i+1. In the attention module, channel and spatial attention mechanisms are applied for the feature refinement. In channel attention module, the max-pooled and average-pooled features go through two fully connected layers, ReLU and sigmoid activation function to produce the channel attention map (F C ′ ∈ R C×1×1 ). The output of channel attention F C ∈ R C×H×W can be computed as: where Sig denotes the sigmoid function, and ⊗ denotes element-wise multiplication. f 1 represents the first fully connected layer (FC 1 ) that compresses C channels into C/r ones, where r is the reduction ratio and is set to 16 in this paper. f 2 represents the second fully connected layer (FC 2 ) that restores channel to C channels. In spatial attention module, a spatial attention map is produced according to the spatial relationship between features. Similar to the channel attention, the max-pooling and averagepooling operations are applied to generate the max-pooled and average-pooled features respectively. These two feature maps are concatenated and sent to a standard 7 × 7 convolution to generate a spatial attention map F S ′ ∈ R 1×H×W . The output of spatial attention module F S ∈ R C×H×W can be computed as: where Concat represents the concatenation operation, Sig denotes the sigmoid function, and f 7×7 represents a 7 × 7 convolution operation. The overall MSA module can be summarized as:

Densely connected depthwise separable convolution (DSC) module
Generally, in the output layer of most U-shape networks, simple bilinear interpolation based upsampling is adopted to output the final segmentation results [16,21,28]. As the feature maps shown in Fig. 5 (c), this simple upsampling method pays less attention to the target. In order to result in better response to the segmentation target, a densely connected depthwise separable convolution (DSC) module is presented and embedded as the output layer of the GDCSeg-Net, which is shown in Fig. 6. In the DSC module, considering the size of the input feature map, four depthwise separable convolutions with different dilation rates (1, 6, 12 and 18) are adopted to capture different scale information. Through the dense connections, the multi-scale features can be fully fused layer-by-layer. As can be seen from Fig. 5 (d), DSC module focuses on the target features precisely, which will improve the segmentation performance.

Loss function
To effectively solve the data imbalance problem in the training process, the combination of Dice loss and binary cross-entropy (BCE) loss is adopted as the total loss function, which can be defined as follows: where N indicates the batch size, y i ∈ [0, 1] and y i ∈ [0, 1] denote the predicted probability and ground truth label respectively. ε is a small smoothing factor.

Dataset
We carry out extensive validations for the proposed GDCSeg-Net on publicly available datasets including REFUGE, MESSIDOR, RIM-ONE-R3, Drishti-GS and IDRiD. A summary for each dataset including number of images, image resolution and availability of the OD and OC ground truth and the data division strategy for OD and OC segmentation experiments are shown in Table 1. Due to the high resolution of fundus image and the small target property of OD and OC, a 512×512 region of interest (ROI) is cropped by the ROI extraction network and taken as the input of the proposed GDCSeg-Net. iter denotes the number of iterations, total_iter denotes the total number of iterations, and power is set to 0.9. The stochastic gradient descent (SGD) algorithm with an initial learning rate of 0.01, momentum of 0.9 and weight decay of 0.0001 is used to optimize the network. Besides, the batch size is set to 4 and the number of epochs is 80. We have released our codes on Github [30].

Data augmentation strategies
To increase the generalization of the model and reduce the risk of overfitting, we adopt online data augmentation strategies including left and right flipping, up and down flipping, random rotation (range from -30°to 30°) and additive Gaussian noise addition. For each round of training, 2-5 of these augmentation methods are used.

Evaluation metrics
To quantitatively evaluate the segmentation performance, two common segmentation evaluation metrics including Dice coefficient (Dice) and intersection over union (IoU) are used, which are defined as follows: where TP denotes true positive, FP denotes false positive and FN denotes false negative. Seg and GT denote segmented mask and ground truth, respectively. T-test with α=0.05 is adopted to evaluate the statistical differences between different methods.

Optic disc segmentation
For OD segmentation, five datasets including REFUGE, MESSIDOR, RIM-ONE-R3, Drishti-GS and IDRiD are used. According to the mixed training strategy, a total of 2541 fundus images from five datasets are randomly splitted into training set (1340), validation set (460) and test set (741). The details about the data division for OD segmentation are listed in Table 1. Comparison experiments and ablation experiments are performed to verify the superiority of the proposed GDCSeg-Net compared with other state-of-the-art methods and the effectiveness of the proposed MSA and DSC modules, respectively. (

1) Comparison experiments
With the same data split strategy, we compare our method with other excellent CNN based methods, including FCN [15], U-Net [16], CE-Net [21], CPFNet [28], Attention U-Net [35], U-Net++ [36], Deep ResU-Net [37], ResU-Net++ [38], CS 2 Net [39] and SegNet [40]. Table 2 presents the comparison results on REFUGE, MESSIDOR, Drishti-GS, RIM-ONE-R3 and IDRiD. As we can see, the proposed GDCSeg-Net outperforms the mentioned state-of-the-art CNN based methods. As can be seen from Table 2, all of the 11 networks perform well on REFUGE dataset. The major reason is that the contrast between OD and the background is obvious, which can be seen from Fig. 2 (e) and (f) and the first row of Fig. 7, making the OD segmentation relatively easy. On MESSIDOR dataset, our method achieves 0.9435 and 0.9700 in IoU and Dice respectively, better than other methods with statistical significance except the Dice index compared with CPFNet (p=0.064). On Drishti-GS dataset, although the proposed GDCSeg-Net performs slightly worse than CE-Net without statistical significance (the p-values for Dice and IoU are 0.387 and 0.396 respectively), the overall results still show that our GDCSeg-Net outperforms CPFNet and other methods. As can be seen from Fig. 2 (d) and the second row of Fig. 7, the contrast between OD and the background is low in the images of RIM-ONE-R3 dataset, which increases the difficulty of OD segmentation. So the performances of Deep ResU-Net, U-Net++, U-Net, Attention U-Net, CS 2 Net, SegNet, FCN and ResU-Net++ degrade significantly on RIM-ONE-R3 dataset. The proposed GDCSeg-Net significantly outperforms the other methods except CPFNet (p=0.111 and p=0.103 for IoU and Dice respectively). IDRiD dataset consists of 81 images with four types of retinal lesions, includes hemorrhage (HE), microaneurysm (MA), hard exudate (EX) and soft exudate (SE). These lesions affect the OD segmentation to some extent. Besides, the dataset also suffers from low contrast between OD and the background. Therefore, although the proposed GDCSeg-Net outperforms the other methods, the improvement of IoU and Dice indexes are not statistically significant compared with FCN, U-Net, Attention U-Net and CE-Net.

(2) Ablation experiments
In order to verify the validity of the proposed MSA module and DSC module, four ablation experiments are conducted. The results of each test are shown in Table 3, in which "Baseline" represents the U-shape encoder-decoder model with pre-trained ResNet34 backbone.
As shown in Table 3, the embedding of the proposed DSC module (Baseline + DSC) achieves substantial improvement over the Baseline in Dice and IoU metrics, especially on the Drishti-GS dataset. Meanwhile, the embedding of MSA module (Baseline + MSA) also helps to improve the performance. For example, compared with Baseline, the Dice index increases 0.94% and reaches 0.9532 on the RIM-ONE-R3 dataset. As shown in Fig. 7, the proposed method obtains more accurate segmentation results than Baseline, especially on the IDRiD, Drishti-GS and

Optic cup segmentation
For OC segmentation, we use three datasets including REFUGE, RIM-ONE-R3 and Drishti-GS. According to the mixed training strategy, a total of 1460 fundus images from three datasets are randomly divided into training set (600), validation set (240) and test set (620).The details about data division are listed in Table 1. Similar to OD segmentation, comparison experiments and ablation experiments are also performed and analyzed. (

1) Comparison experiments
With the same data split strategy, we compare our method with other excellent CNN based methods, including FCN, U-Net, CE-Net, CPFNet, Attention U-Net, U-Net++, Deep ResU-Net, ResU-Net++, CS 2 Net and SegNet. Table 4 presents the segmentation results on the REFUGE, RIM-ONE-R3 and Drishti-GS.
Compared with OD segmentation, OC segmentation is more difficult due to the blurred boundary between OC and OD as well as the smaller OC region. As can be seen from Table 4, on REFUGE dataset, the proposed GDCSeg-Net significantly outperforms the other methods except CPFNet (p=0.206 and p=0.22 for IoU and Dice respectively). On RIM-ONE-R3 dataset, CPFNet achieves the best performance in Dice, while the proposed GDCSeg-Net achieves the best performance in IoU. Through T-Test analysis, there is no significant differences between the proposed GDCSeg-Net, CPFNet and CE-Net on the RIM-ONE-R3 dataset. On Drishti-GS dataset, the proposed GDCSeg-Net significantly outperforms the other methods. Figure 8 shows five OC segmentation results of different methods, which reveal that the proposed method obtains more accurate segmentation results, especially on the RIM-ONE-R3 and Drishti-GS datasets.

(2) Ablation experiments
In order to verify the validity of the proposed MSA and DSC module, we also conduct four ablation experiments on OC segmentation. As shown in Table 5, compared with the Baseline, the IoU and Dice of OC segmentation increase obviously with the addition of DSC and MSA modules. Especially, the IoU and Dice increase from 0.6919 and 0.7975 to 0.7237 and 0.8237 on images from RIM-ONE-R3 dataset, in which the boundaries between OC and OD are very blurred. The results show that the proposed DSC and MSA modules are beneficial for OC segmentation as well.

Generalization experiments
In order to verify the effectiveness of mixed training strategy for small sample datasets such as Drishti-GS, RIM-ONE-R3, IDRiD and an in-house dataset (144 fundus images with myopic from the First People's Hospital Affiliated to Shanghai Jiao Tong University, in which the OD and OC ground truth were annotated under the supervision of an experienced ophthalmologist.), we compare the results of single dataset training based GDCSeg-Net (named as "Single training" in Table 6 and  Table 6 and Table 7, the results show that the mixed training strategy has a significant improvement in the OD and OC segmentation of small sample datasets. In particular, the IoU index improves from 0.9302 to 0.9501 for OD segmentation and from 0.7727 to 0.8344 for OC segmentation on Drishti-GS dataset.

Comparison of the state-of-the-art OD and OC segmentation methods
To further prove the effectiveness of the proposed method, we compare the performance of the proposed method with the state-of-the-art OD and OC segmentation methods. As shown in Table 8, the results indicate that our proposed GDCSeg-Net is competitive with other state-of-theart methods on five different public fundus image datasets. Among them, Tian et al. achieved the best performances on REFUGE and Drishti-GS datasets, especially in OC segmentation. The possible reasons are as follows: first, the cropped images input for OC segmentation is only 70% of those for OD segmentation in the training, which greatly reduces the interference of OD and background. Second, graph convolution can be used to predict the object contour with obvious boundary. As can be seen from Fig. 2 (b), (e) and (f), the boundaries of OC and OD in images from REFUGE and Drishti-GS datasets are relatively obvious. Shankaranarayana et al. achieved the best OD and OC segmentation performances on RIM-ONE-R3. The possible reason is that they used the initial weights obtained from ORIGA dataset (650 images, not available now) [41] to continuously train their network for RIM-ONE-R3 dataset. The proposed GDCSeg-Net achieves best OD segmentation performances on MESSIDOR, Drishti-GS and IDRiD datasets.

Conclusion and discussions
The OD and OC segmentation in fundus image is an important basis for the analysis of glaucoma.
In this paper, we adopt the mixed training strategy of different datasets for the first time. Based on the U-shape encoder-decoder structure, a general domain-adaptive OD and OC segmentation network (GDCSeg-Net) is proposed, which effectively overcomes the problems of domain shift caused by different acquisition devices and modes and inadequate training caused by small sample dataset. The proposed MSA module is embedded into the top layer of the encoder to integrate the multi-scale OD and OC feature information with channel and spatial attention mechanisms. The DSC module is proposed and embedded as the output layer of the GDCSeg-Net, which fully fuses the multi-scale features extracted by depthwise separable convolution layer-by-layer via dense connections and leads the network to efficiently focus on the targets. The proposed MSA and DSC modules are effective and universal, which can be easily introduced into other encoder-decoder networks. The comparison experimental results show that the proposed GDCSeg-Net achieves the best OD and OC segmentation performance on five fundus image datasets, including REFUGE, MESSIDOR, RIM-ONE-R3, Drishti-GS and IDRiD datasets. Although CE-Net has achieved comparable performance with the proposed GDCSeg-Net in OD segmentation on REFUGE, IDRiD and Drishti-GS datasets, it is unable to perform well in OC segmentation on REFUGE, RIM-ONE-R3 and Drishti-GS datasets. Similarly, CPFNet performs comparably to the proposed GDCSeg-Net in OD segmentation on REFUGE and MESSIDOR, it does not perform well in OD and OC segmentation on Drishti-GS dataset. These results suggest that the proposed GDCSeg-Net is more general and effective than the state-of-the-art segmentation networks in OD and OC segmentation task.
In addition, compared with state-of-the-art OD and OC segmentation methods, our method has achieved competitive performance in OD and OC segmentation on five fundus image datasets. As one of our future focuses, we will try to improve the performance of OC segmentation by integrating the newly proposed self-attention-based transformer structure [44,45] into our proposed GDCSeg-Net, which may focus on the blurred boundary between OD and OC. To validate the generality of the proposed GDCSeg-Net, other segmentation tasks such as diabetic retinopathy related fundus lesions and retinal vessel segmentation in fundus images with the proposed GDCSeg-Net will be explored in the future.