Weakly supervised serous retinal detachment segmentation in SD-OCT images by two-stage learning

: Automated lesion segmentation is one of the important tasks for the quantitative assessment of retinal diseases in SD-OCT images. Recently, deep convolutional neural networks (CNN) have shown promising advancements in the field of automated image segmentation, whereas they always benefit from large-scale datasets with high-quality pixel-wise annotations. Unfortunately, obtaining accurate annotations is expensive in both human effort and finance. In this paper, we propose a weakly supervised two-stage learning architecture to detect and further segment central serous chorioretinopathy (CSC) retinal detachment with only image-level annotations. Specifically, in the first stage, a Located-CNN is designed to detect the location of lesion regions in the whole SD-OCT retinal images, and highlight the distinguishing regions. To generate available a pseudo pixel-level label, the conventional level set method is employed to refine the distinguishing regions. In the second stage, we customize the active-contour loss function in deep networks to achieve the effective segmentation of the lesion area. A challenging dataset is used to evaluate our proposed method, and the results demonstrate that the proposed method consistently outperforms some current

manpower and resources. Hence an automatic segmentation method is highly desired in clinical applications.
Serous retinal detachment, such as pigment epithelial detachment and neurosensory retinal detachment (NRD), is a prominent characteristic of CSC. It is well known that accurate and automatic segmentation of CSC in SD-OCT images is very challenging, as shown in Fig. 1. There are two main reasons why this challenge has been allowed to continue. First, the region of lesion is almost the same with background, and the size of lesion region would generally have a large difference, so it's difficult to precisely locate for the lesion region. Second, compared with natural images or MRI images, SD-OCT images are low contrast and weak boundaries, which is hard to achieve accurate segmentation. To address this problem, many studies have been reported to develop automatic CSC segmentation methods [5][6][7]. Especially, with the drastic advance of deep learning, recent deep networks have demonstrated successful performance in image segmentation task [8][9][10][11]. However, a major bottleneck to good performance is the high cost of getting high-quality annotations in the fully supervised learning method. Since it does not require expensive human efforts, weakly supervised image recognition methods [12][13][14] have been extensively studied. Previous work has explored various alternative weak annotations, such as points [15], object bounding boxes [16], and scribble [17]. Among them, one of the most attractive approaches is to segment images only from image-level annotations methods [18][19][20] that require less human efforts. For such methods, the most critical problem to be solved is how to accurately and intensively locate the target area, so as to obtain high-quality target clues, and further improve segmentation model training performance [21,22]. And such methods achieved good results, even close to fully supervised learning performance on Pascal VOC-07/10/12 object detection datasets [23]. However, weakly supervised learning only used image-level label has not made a significant breakthrough and doesn't show striking improvement on benchmark tasks in biomedical field. The reason is that, compared with annotation of the natural image, biomedical image segmentation data require professional labeling and a great deal of patience. Many researchers have also made many attempts at weakly supervised learning in the medical field [24][25][26][27][28][29]. Whether it's prostate cancer detection, chest X-ray localization or gastric tumor segmentation. Unfortunately, weakly supervised learning has not been involved in SD-OCT images.
In this work, we propose a two-stage weakly supervised learning method to segment CSC accurately and automatically in SD-OCT images only by using image-level labels. During the first phase, a discriminative region of the lesion can be obtained by our proposed Located-CNN. Then, we evolve it as the initial contour of the level set method and produce pseudo pixel-level labels. During the second phase, the generated segmentation labels obtained above are used to train the segmentation network, which uses a active-contour loss function. To sum up, the main contributions of this work are three-fold: • A pinpoint Located-CNN based on classification network is trained by simply using image-level label.
• We customize the active-contour loss function in deep networks to achieve the effective segmentation of the lesion area.
• To the best of our knowledge, we are the first to use weakly supervised learning to solve CSC segmentation problems. Without using pixel-level ground truth in the whole segmentation process, our segmentation results show that the effectiveness of our proposed segmentation method is as competitive as those relying on stronger supervision.

Conventional segmentation methods
For the subretinal fluid segmentation, various of unsupervised approaches have emerged to deal with this problem, whether the thresholding-based algorithms [30], or more complicated methods based on enface fundus driven method (EFD) [5]. Active contour models (ACMs) have shown better performance as represented by the active contour without edge (ACWE) model [31,32]. In Chan and Vese's work, level set functions are introduced to formulate the segmentation model treated as an energy minimization problem solved through dealing with partial differential equations (PDEs); Then, because of low contrast and speckle noise in retinal SD-OCT image, the approaches of semi-supervised learning have begun to use some prior knowledge to overcome these problems. Wang et al. [7] utilized the label propagation and higher-order constraint-based segmentation of fluid-associated regions in retinal SD-OCT images, but the performance is heavily influenced by the key slice selected. Wu et al. [6] proposed a three-dimensional continuous max flow optimization-based serous retinal detachment segmentation approach to segment NRD and PED under the restriction of the fluid region selected; In addition, supervised learning methods, including random forest [33] and K nearest neighbor [34], have been introduced to identify the fluid region from the background.

CNN-based segmentation methods
Methods Based on Supervised Segmentation: Recent years have witnessed the successful application of deep learning in image segmentation, because it has the ability of automatic feature extraction. The DeepLab-v1 model [35] brings together methods from atrous convolution and conditional random fields (CRFs) for addressing the task of pixel-level classification. The DeepLab-v2 model [36] has better segmentation of objects at multiple scales by using atrous spatial pyramid pooling (ASPP). The DeepLab-v3 model [37] is general and could be applied to any network. Ronneberger et al. [38] proposed a novel method named U-Net to segment medical images automatically. Recently, for segmenting subretinal fluid in SD-OCT images, Gao et al. [39] proposed a novel image-to-image double-branched and area-constraint fully convolutional networks (DA-FCN). With strong supervision from pixel-level masks, the above approaches have greatly boosted the performance of segmentation. However, the problem that how to achieve good segmentation performance under weak supervision remains open.
Methods Based on Weakly Supervised Segmentation: There are various of ways of weakly supervised segmentation. Among them, the most attractive one is learning to segment images from only image-level annotations. Image-level label, which is easy to obtain, is the simplest supervision for leaning to segment. Some works [40][41][42][43][44] utilize deep activation for location with image-level labels. This method can aggregate the features of the last convolutional layer to generate discriminative class activation maps (CAM) [45]. However, we observe that some critical issues exist in such solutions, mainly including: failing to localize integral regions of the target objects densely within an image. Using more discriminative regions found, Wei et al. [46] and Zhang et al. [47] trained extra independent networks for generating class-specific activation maps with the assistance of the pre-trained networks in a post-processing step. However, the above methods are not apply to the medical fields which are a small set of samples and weak boundaries, especially for the precise segmentation by means of weakly supervised learning. In this case, we proposed our two-stage learning architecture, which will be demonstrated in Section 3.

Proposed segmentation method
In this section, we describe the details of the proposed method for automated serous retinal detachment segmentation. First, we present Located-CNN, which is a more precision way for producing object localization maps. Then, the proposed level set is presented for evolving high-quality pseudo object pixel-wise labels. Finally, the segmentation labels obtained by above procedure are used as supervision, so as to train the segmentation network with active-contour loss function. Architecture of our approach is illustrated in Fig. 2. (ii) If input is judged to be NRD will highlight the region by Located-CNN, and we regard this significance region as the initial contour of level set method to obtain pseudo pixel-level labels. Otherwise, the non-NRD images generate empty labels. (iii) Furthermore, the generated pixel-level annotations are employed to train the segmentation model with active-contour loss function.

Highlighting the lesion area by located-CNN
CAM [45] shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels. It shows that the deep neural network really pays attention to object localization in the classification process. However, this approach is hard to accurate locate the position for CSC lesion area in SD-OCT images. There are two reasons for this: (1) the region of lesion is almost the same with background area; (2) SD-OCT images are low contrast and weak boundaries, which are also hard to achieve precise location of lesion area. Based on the characteristics of CSC, we customized Located-CNN to highlight the lesion area. Specifically, we have made two improvements to CAM: (i) The fully connected layer is discarded to avoid the influence of weight on the location feature map. Because we found that the effect of weights on the localization map can cause the network attention to deviate. (ii) To locate the lesion more accurately, a spatial attention module was added further to capture the lesion area's location information. Through the above two improvements, our method can overcome the above two difficulties.
For a given image, suppose we are given a convolutional neural network with last convolutional feature maps denoted as S ∈ R K×H×W , here H × W is the spatial size of feature map and K is the number of channels. In our method, we denote the max value of the P k feature map by: where P k is achieved by aggregating the K × H × W matrix of output for H × W different positions of the input window using a global max-pooling operation into a single K×1 vector. Inspired by [48], we generate a spatial attention map by utilizing the inter-spatial relationship of features. The spatial attention focuses on 'where' is an informative part. To compute the spatial attention, we apply mean operation along the channel axis to generate an efficient feature descriptor, which is defined as: where σ denotes the sigmoid function. After above operations, we optimize the classification network by minimizing binary cross-entrophy (BCE) loss: where u and v represent predicted value and image-level label. Through the modification of the above network, we can obtain the location of the lesion area after training. Specifically, the feature map obtained by the convolution of the last layer of the network is defined as follows: where add all the features S k H,W along the K dimension to get F H,W . Then simply upsampling the F H,W to the size of the input image denote as F. After the above series of operations, we can find that part of the lesion area is accurately highlighted by Located-CNN. It should be noteworthy to mention that the Locate-CNN is a classification network. After this procession, classification network can tell whether input has lesion or not. Only when input are judged to be NRD will highlight the area by Located-CNN.

Pseudo pixel-level label obtained by using level set
In medical segmentation field, the quality and quantity of pixel-level label can great influence the performance of deep learning. Unfortunately, in our work, we do not have accurate pixel-level label to train a segmentation network. So in order to get the pseudo pixel-level label of NRD lesion area in SD-OCT image, we use the level set method to segment automatically. However, this method needs an accurate initial contour when evolving toward object boundary. In level set based image segmentation techniques, initial contours should be generated carefully since their sizes and locations can affect the segmentation performance. To automatically obtain initial contour, a suppression mask is designed to select the highlighted regions from Section A, defined as: where applying hard thresholding on the heatmap F reveals the discriminative region M. Then, we use level set method to evolve the obtained significant seed regions M. Chan-Vese model (C-V model) [31], a kind of level set method based on region can perform well on medical image segmentation by statistical intensity distribution on different regions. The energy functional is defined as follows: where µ , λ 1 and λ 2 are positive constants. The first term of energy functional is length regularization term, which is used to smooth the evolution region Ω; the second and third term are fidelity terms, which are responsible for attracting the evolution region Ω to the expected boundary. ϕ(x, y) is an level set function. I(x, y) represents the original image. c 1 and c 2 are global average intensity inside and outside the contour, which can be defined as follows: where H ε (ϕ(x, y)) is Heaviside function : where ε is a tiny positive constant. By minimizing the energy functional of level set, we can get the confidence regions as pixel-level labels of lesion area. In our work, the lesion area can be accurately segmented after 300 iterations.

Segmentation network with active-contour loss function
The pixel-level labels obtained by above procedure are used as supervision to train a segmentation network. From the above Section 3.2, we can see that the closed lesion area can be evolved by the level set method from a significant located seed. Essentially the level set approach is based on the feature of the image itself, such as gray scale distribution, foreground and background. However, these characteristics are only based on some low level features of the image itself; for example, color, texture, etc., yet it can not deal with some high-level semantic features of the target that related to the object category itself. So we consider giving the level set method the ability to combine high-level semantic features.
Inspired by [49], we try to embed the level set into the neural network as the loss function of the deep neural network. In the process of deep learning, deep features are rich in semantic information, and iteratively optimize the whole neural network through the gradient descent of the neural network. Furthermore, for different loss functions, the predicted value will be close to the real value in different ways. In this way, the level set can be fully considered as a loss function combined with the high-level semantic features in the training of neural network, while maintaining the influence on the low-level texture boundary features. The active contour loss function (ACLoss) can be expressed as: different from the C-V model above, d 1 and d 2 are represented as the gray mean of inside (foreground) and outside (background). And in here, due to supervised-learning framework, d 1 and d 2 can be simply defined as constants in advance as d 1 = 1 and d 2 = 0. u and v are represented as output of network and a pixel-level label respectively. It can give full play to the role of level set method in maintaining boundaries and the effective combination of deep learning in extracting high-level features of targets. So far, the proposed framework above, to generate pseudo pixel-level labels and combine with the segmentation network, is named "Ours".

Implementation details
For the first stage of classification network, we choose a popular CNN, ResNet [50]. Specifically for ResNet-101, we remove the layer4 and layer5 after the layer3 and replace them with global max pooling followed by averaging operation. We use Adam [51] with an initial learning rate of 1e-4 to update the weights. In our classification task, SD-OCT image with the normal and NRD are labeled as 0 and 1, respectively. For the second learning stage, we use DeepLab [37] as our segmentation network and use sigmoid function in the final output. Our backbone network is ResNet-101, and dilated rate in ASPP is 1, 6, 12, 18 respectively. Furthermore, the generated pixel-level annotations are employed to train the segmentation network with active-contour loss function. In practice, we set λ 1 =λ 2 =1. The classification network and segmentation network is implemented by using Keras and Pytorch; The level set method and statistical analysis of experiment are implemented by using MATLAB R2017b. Both two networks are trained on NVIDIA GeForce GTX 1080Ti with 11GB memory.

Dataset and performance metrics
Dataset In our work, 23 volume cases from 12 patients containing NRD-fluid from [5,52] make up our dataset. Each case contains 128 images, and the size of each image is 1024×512. This challenging dataset is used to evaluate our proposed method. Of particular note is that a patient may have more than one case, which means that a patient may have similar tissue characteristics.To ensure that patients are independent of each other, we divided 12 patients into four groups on a patient-by-patient basis for cross-validation. Then the results of the four groups of experiments were averaged as the final evaluation index. In the meantime, this work was approved by the Institutional Review Board (IRB) of the First Affiliated Hospital of Nanjing Medical University with informed consent. We obtained two sets of segmented ground truth labeled by two experienced ophthalmology experts.
Metrics We define the following evaluation criteria to indicate the accuracy of the positioning. Precision can be quantified as: where M is the highlighted region of applying hard thresholding and P is the ground truth labeled by experts. By the area of the intersection of M and P can reflect the size of accuracy of regional location of lesion. The larger the value, the more accurate our position is.
There are three criterions, which are the true positive volume fraction (TPVF), positive predicative value (PPV) and dice similarity coefficient (DSC) respectively, to evaluate the segmentation results: where |.| represents the volume of slices, V R is the predicted results of our method, V TP and V FP represent the true positive and false positive volumes of the results of our method, respectively and V G is the ground truth labeled by the experts.

Location precision and evolution evaluation
To get the high-quality pseudo pixel-level labels for the segmentation network's training, we customized a Located-CNN to highlight NRD lesion regions and then utilized the level set to generate pixel-level labels automatically. To illustrate the importance of Located-CNN, we compared the classification accuracy and location precision with CAM [45]. From Fig. 3, we can observe that the customized location approach can more highlight the lesion region than CAM that focuses on the retinal. Then we evaluated the classification accuracy of CAM and Located-CNN as 98.44% and 97.25%, respectively. Although there's not much difference between CAM and Located-CNN in the classification accuracy, the focus of lesion region is more accurate. And the localization precision can reach 90.25%. It can be seen from the results that our Located-CNN does not significantly reduce classification performance, at the same time, it can ensure the higher localization precision. Furthermore, the positioning region by our method is more prone to consider as initial contour to generate pixel-level compared with CAM.
In addition, using the location seeds obtained above, we apply the level set method to evolve the lesion region. Then, we quantitatively evaluate all the test sets as shown in Fig. 4.   Fig. 4. The quantitative results obtained from the evolution of level set method. Blue and orange represent the assessment results of two related experts.

Comparison of four loss functions
In this case, we use four different loss functions to train a segmentation model for pseudo pixel-level labels generated in the previous section. Three different loss functions, MSEloss (MSE), BCELoss (BCE) and DiceLoss (Dice), are defined as follows: At the bottom of Table 1, we quantitatively analyze the influence of different loss functions on the segmentation network. On the whole, active-contour loss function is better the other three loss functions. Figure 5 shows segmentation results of four examples using DeepLab and four different loss functions. From left to right, the segmentation results by Ours+MSE [53], Ours+BCE, Ours+Dice [54] and Ours+AC [49] are shown respectively. It can be directly seen from the Fig. 5 that the result of ACLoss as a loss function is closer to the ground truth.

Comparison with existing methods
In this paper, eight state-of-the-art methods are used to verify further the validity of our model, including label propagation and higher-order constraint (LPHC) [7], a random forest classifier based method (RF) [33], a stratified sampling k-nearest neighbor classifier based algorithm (SS-KNN) [34], an Enface fundus-driven method (EFD) [5], a continuous max-flow approach (CMF) [6], the fully convolutional networks (FCN) [21], a fuzzy level set with cross-sectional voting (FLSCV) [32], and the double-branched and area-constraint fully convolutional networks (DA-FCN) [39]. As presented in Table 1, the quantitative comparison of various methods indicates that our approach can achieve more accurate segmentation than other methods in some aspects. Note that our method is not comparable to other methods. On the one hand, other methods rely either on pixel-level labels or on the layer information of lesion area; on the other hand, our method only uses image-level labels from the beginning to the end. The results show that our proposed method is very close to the segmentation results of fully supervised learning, even on TPVF the index to get the best results. In addition, we evaluated from the perspective of qualitative segmentation of NRD volume. Through the qualitative analysis of Fig. 6, the proposed method can locate and segment NRD regions more accurately by comparing with other existing methods. Figure 9 shows the 3D volume segmentation results of ours presented in Table 1. From the figure, we can also visually see the continuity and integrity of our segmentation result. We use linear correlation analysis and Bland-Altman reproducibility approach shown in Fig. 8 to compare our approach with the segmentation result of expert 1 and expert 2 respectively, which indicates that the prediction results show high agreement with the ground truth.

Multi-lesions segmentation analysis
At present, our learning framework can effectively segment the single lesion area (only NRD) with image-level annotations. Further more, our model is verified by multi-lesion data (NRD&PED), and Fig. 7 shows the segmentation results of two examples containing both NRD and PED using DeepLab and two different loss functions. From left to right, the example-original SD-OCT image, ground truth, segmentation results by Ours+AC, Ours+Dice are shown respectively. It fully shows that our method has good boundary retention effect on multi-lesion segmentation, which further demonstrates the robustness of our method.

Discussion
Training an effective segmentation network requires precise pixel-level labels, whereas the available generation of pixel-level labels is critical for weakly supervised learning. In this work, we propose a two-stage weakly supervised learning method to segment CSC accurately and automatically in SD-OCT images only by using image-level labels. A challenging dataset is used to evaluate our proposed method. The results demonstrate that the proposed method consistently outperforms some current models trained with different supervision levels and is even as competitive as those relying on stronger supervision. Furthermore, it can be applied to segment fluid tissues of the OCT image, such as lamellar macular holes and macular pseudo holes. At the same time, this will significantly reduce labeling for professional doctors and provide solutions to related problems.
In our work, the critic here is whether the localization maps are in the lesion area rather than feature localization manifestation. From Fig. 3, it can be observed that our proposed Located-CNN is a more precise way for producing object localization maps than conventional CAM. Because of Located-CNN, our method can automatically evolve with the help of the level set method and obtain high-quality pseudo labels. Figure 4 shows that the evolution result of the level set method can get available quantitative results. Although the results are relatively satisfactory, it has been found that the pseudo pixel-level label obtained depends on the location precision, and the selection of threshold itself will cause some significant regions to be missed. Another reason for this result is that the level set method is based on the gray distribution in the evolution process. Therefore, when the lesion area and tissue are close to each other in gray scale distribution, excessive segmentation will occur. In conclusion, we define the above mentioned abnormal samples as noise labels in pseudo pixel-level labels obtained by level set method. Some of the problems mentioned above inspire us to try to solve noise labels in the future.
Fortunately, the number of noise labels described above accounts for only a small fraction of the number of training samples. Because of the similarity among samples, the proportion of the complete lesion segmentation results are much higher than that of noise labels. The segmentation network is more inclined to the distribution of complete lesion segmentation results in the learning process, but will play a role in correcting some noise labels. Additionally, the loss function is a measure tool that can well reflect the gap between the predicted output and the ground truth. Therefore, we adopted four different loss functions to train the segmentation model. It can be directly seen from the Fig. 5 that the result of ACLoss as a loss function is closer to the ground truth. At the bottom of Table 1, we quantitatively analyze the influence of different loss functions on the segmentation network. Although BCE can achieve the best performance on the TPVF index, it gets the worst performance in PPV and DSC. On the contrary, AC is very balanced and very close to the fully supervised segmentation method. To sum up, we choose the active-contour loss function in our segmentation network, named Ours+AC. To further verify the validity of our model, eight state-of-the-art methods are used to compare as presented in Table 1. The results show that our proposed method is very close to the segmentation results of fully supervised learning. Besides, we can see more intuitively from the last two columns of Fig. 6 that our method is very close to DA-FCN. In the last part of the experiment, we further verify our model's sensitivity and robustness for multiple lesions. From Fig. 7, we can intuitively see that Ours + AC can achieve a better segmentation effect than Ours + Dice. Although our proposed method can achieve encouraging results using a weakly supervised learning method, there are still some limitations in our proposed segmentation method. Firstly, the noise labels in pseudo pixel-level labels affects the segmentation accuracy of our model to some extent. However, there does not include noise label suppression module in our proposed method. Therefore, we would like to design a noise label suppression module in the future to improve the quality of pseudo pixel-level labels and further improve the performance of the segmentation network. Secondly, a large amount of data can be beneficial to segmentation performance of the deep learning model, ensuring the diversity of data learning by the model and overcoming some test data's uncertainty. Nevertheless, the limitations of our approach lie not only in a larger dataset and the quality of the data, but also in the difficulty of annotating the data and the need for a professional physician to annotate the data, which is a long and costly process. Meanwhile, our proposed method focuses on the segmentation of a single lesion area. It is unable to identify each type of lesion when having multi-lesion areas in the SD-OCT image. In the future, we hope to efficiently segment serous retinal detachment by combing semi-supervised and active learning methods.

Conclusion
In order to solve the problems that obtaining pixel-level needs lots of time and money in medical image segmentation, in this work, a two-stage learning architecture is proposed for weakly supervised SD-OCT retinal image segmentation. Extensive experiment results validate our proposed weakly supervised learning architecture on a highly challenging dataset. Compared with other methods, our segmentation results are even as competitive as those relying on stronger supervision. The proposed weakly supervised learning architecture could greatly reduce the cost of obtaining pixel-level label and the limitations of ophthalmic image processing. In addition, we found that for this task, there is some independence in our methods. We try to break that independence by mining the potential of deep learning. In the future, we hope to realize automated serous retinal detachment segmentation in a more efficient way.