Multi-scale GCN-assisted two-stage network for joint segmentation of retinal layers and disc in peripapillary OCT images

An accurate and automated tissue segmentation algorithm for retinal optical coherence tomography (OCT) images is crucial for the diagnosis of glaucoma. However, due to the presence of the optic disc, the anatomical structure of the peripapillary region of the retina is complicated and is challenging for segmentation. To address this issue, we developed a novel graph convolutional network (GCN)-assisted two-stage framework to simultaneously label the nine retinal layers and the optic disc. Specifically, a multi-scale global reasoning module is inserted between the encoder and decoder of a U-shape neural network to exploit anatomical prior knowledge and perform spatial reasoning. We conducted experiments on human peripapillary retinal OCT images. The Dice score of the proposed segmentation network is 0.820$\pm$0.001 and the pixel accuracy is 0.830$\pm$0.002, both of which outperform those from other state-of-the-art techniques.


Introduction
Glaucoma is the leading cause of irreversible blindness globally, which affects approximately 64.3 million individuals worldwide [1]. In China, around 13.12 million people were affected by glaucoma in 2015, and the number is projected to reach 25.16 million by 2050 [2]. This will cause a heavy burden on public health. Currently, the most effective way to prevent glaucoma-related vision loss is early diagnosis and early intervention. In particular, detecting small morphological changes of retinal layers, such as the thinning of the retinal nerve fiber layer (RNFL) and ganglion cell layer (GCL), has critical value on the precise diagnosis of glaucoma [3].
Optical coherence tomography (OCT) as a non-invasive three-dimensional imaging modality is commonly used in eye clinics for retinal inspection. Due to the micro-meter-level axial resolution, it provides a unique capability to directly visualize the stratified structure of the retina and access their corresponding thicknesses. OCT-derived thickness of peripapillary RNFL is a common indicator in early-stage glaucoma diagnosis [4]. Therefore, a precise tissue segmentation of retinal OCT images becomes a critical step towards successful early diagnosis of glaucoma.
However, manual segmentation is time-consuming and laborious, while an accurate automated algorithm is desirable by both clinicians and researchers. Numerous automated retinal OCT segmentation techniques have been proposed in the past decades [5][6][7][8][9][10][11][12][13][14][15][16][17]. OCTExplorer is a prominent example for retinal layer boundary extraction, which is based on a conventional graph-theory algorithm [8]. Alonso-Caneiro et al. and Tian et al. designed graph-theory-based boundary detectors to extract choroidal boundary after image pre-processing [6,14]. Mayer et al. proposed an energy-minimization-based algorithm for retinal nerve fiber layer surface segmentation in circular OCT images [10]. Lang et al. presented a random forest boundary classifier to segment eight retinal layers in macular cube images [9]. Chiu et al. reported a kernel regression-based segmentation method for retinal OCT images with diabetic macular edema [7]. Recently, convolutional neural networks (CNN) have been widely applied to segment images obtained from various modalities and thus enable exciting applications [18][19][20][21][22][23][24][25][26]. Fully convolutional networks (FCN) [27] and U-Net [28] are two popular candidates for medical image segmentation. For retinal OCT image segmentation, most state-of-the-art models [29][30][31][32][33][34][35][36][37][38] could be considered variants of the encoder-decoder architecture like FCN and U-Net. Roy et al. proposed ReLayNet for end-to-end segmentation of macular OCT B-scans into retinal layers and fluid masses [34]. This work is among the first deep learning-based methods for automated segmentation of retinal OCT images. Yang et al. designed an attention-guided channel-to-pixel convolution network for retinal layer segmentation with choroidal neovascularization [37]: a channel-to-pixel block and an "edge loss function" were used to segment the retinal layer with blurry boundaries. To address the large morphological variations of the retinal layers, they also employed the attention mechanism. However, these two techniques were mainly targeting macular retinal image segmentation rather than that of peripapillary retinal images. Zang et al. developed an automated segmentation method for peripapillary retinal layer segmentation [38]. The left and right boundaries of the optic disc were first determined based on the estimated position of Bruch's membrane opening in radially resampled B-scans. The retinal layer boundaries were then segmented by combining a convolutional neural network with a multi-weight graph search algorithm. Devalla et al. proposed a dilated-residual U-Net (DRUNET) to facilitate end-to-end segmentation of the individual neural and connective tissues of the optical nerve head [30]. However, they only segmented the retina into five layers and did not fully segment the optic disc from its connected tissues. In summary, most aforementioned techniques perform segmentation based on the textural features of the OCT images, while abundant anatomical priors available in the peripapillary retinal OCT images are not utilized.
In this manuscript, we report our recent study on explicit exploiting the prior knowledge existed in the peripapillary OCT images. We argue that all peripapillary OCT images obtained by following a strict clinical protocol should share a similar anatomical arrangement: the optic nerve head, which is a large structure, is located in the center region of the image, while much thinner retinal layers are stratified on both sides, as shown in Fig. 1. Inspired by Jamal et al.'s work that uses a graph to represent the domain knowledge and the structural relationship of the tissues [39], we designed a novel multi-scale graph convolutional network (GCN)-assisted two-stage network for joint segmentation of retinal layers and optic disc in peripapillary OCT images to fully take advantage of the anatomical priors. To show the efficacy of the proposed framework, experiments were conducted on a collected peripapillary OCT dataset, which consists of a total number of 122 OCT B-scans from 61 patients, and another public dataset [40]. The proposed model demonstrated superior performances on both datasets in comparison with the baselines and the state-of-the-arts. In the future, we plan to integrate the proposed segmentation framework into a diagnostic workflow for early-stage glaucoma detection. The dataset and the source codes are now publicly available online at https://github.com/Jiaxuan-Li/MGU-Net.

Overview of segmentation framework
The schematic diagram of the proposed segmentation framework is given in Fig. 2. The entire framework consists of three components: the optic disc detection network, the retinal layer segmentation network, and the fusion module. An input OCT image is first processed by the optic disc detection network, through which a mask indicating the location of the optic disc and the corresponding feature map will be obtained. We then apply the mask on the input image to generate a disc-free image, which is later fed to the retinal layer segmentation network without being sliced. Similarly, a feature map that delineates the nine retinal layers is also obtained and later concatenated with that of the optic disc from the first stage. Finally, a softmax activation function is used to generate the segmented output based on the concatenated feature map in the fusion module. The entire framework is trained in an end-to-end fashion: two loss functions are defined to penalize both the intermediate disc detection and the final segmentation. Specifically, the optic disc detection network and the retinal layer segmentation network are designed to exploit the anatomical priors of the peripapillary region of the retina and to address the segmentation challenges imposed by the variation of the thicknesses among different retinal layers as illustrated in Fig. 3. In the right panel of Fig. 3, it is observed that the non-disc and disc regions are horizontally arranged in a "non-disc"-"disc"-"non-disc" fashion on the global scale. On the other hand, if we zoom in to the "non-disc" region as shown in the left panel of Fig. 3, we could appreciate that the retinal layers with various thicknesses are instead vertically arranged. Therefore, the optic disc detection and the retinal layer segmentation networks are devised with graph reasoning blocks with different design goals: the former is to perform long-range horizontal spatial reasoning, while the latter is to capture the multi-level vertical structures.

Graph reasoning block
The key module used in both the optic disc detection network and the retinal layer segmentation network is the graph reasoning block.
Inspired by the graph-based global reasoning network [41,42], we devised a graph reasoning block to effectively extract the global features of 9 retinal layers and optic disc. The schematic diagram of the graph reasoning block, which consists of four operations, is depicted in Fig. 4. First of all, a local feature map X ∈ R C×H×W in the latent space is fed to two convolutional layers in parallel to generate two maps: one feature map with reduced dimension and one projection matrix. After that, the reduced dimension feature is reshaped to X r ∈ R C r ×HW , while the projection matrix is reshaped and transposed as X a ∈ R C n ×HW . A matrix multiplication between X r and X a is then performed to obtain a node feature map H ∈ R C r ×C n before its being sent to a GCN block. We further connect X to a convolutional layer to create an inverse projection matrix X d ∈ R C n ×H×W . The output of the GCN block is multiplied by the reshaped X d ∈ R C n ×H×W to transform back to the original latent space, which is then reshaped and undergo another convolutional layer to eventually obtained the feature map M ∈ R C×H×W . Finally, we perform an element-wise addition of M and X to acquire the new feature map Y ∈ R C×H×W It is clear that the new feature map Y contains both the information from the global feature and the original feature, which enables its capability of processing long-range contextual information. Fig. 3. Graph-based representation of peripapillary retinal OCT image. The image possesses a horizontal layout as "non-disc"-"disc"-"non-disc" on the global scale. A zoomed-in view of the "non-disc" region presents a stratified structure instead. Our segmentation framework is designed to exploit the anatomical priors of the peripapillary region of the retina and to address the segmentation challenges caused by the variation of thickness among retinal layers.

Multi-scale global reasoning module
To address the segmentation challenges caused by the large variation of the retinal thicknesses between different layers, we propose a multi-scale global reasoning module (MGRM) to conduct global reasoning on high-level semantic features in all nine retinal layers. MGRM is composed of multi-scale pooling operators and graph reasoning blocks. It uses multiple effective receptive fields to capture and learn the features of the retinal layers with different thicknesses.
The MGRM split the input into four different paths, three of which are equipped with pooling layers of different kernel sizes followed by graph reasoning blocks: inspired by [43], the kernel sizes are set to 2 × 2, 3 × 3 and 5 × 5 to encode the information of retinal layers with different thicknesses to global feature maps. Then, the global features are up-sampled to match the size of the original input feature map by bilinear interpolation. Finally, all four paths are re-combined and the features are concatenated. The entire procedure is illustrated in Fig. 5.

Multi-scale GCN-assisted U-shape network
We use a U-shape network developed on the basis of the classic U-Net [28] as the backbone of the proposed multi-scale GCN-assisted U-shape network (MGU-Net). The schematic diagram of MGU-Net is presented in Fig. 6. MGRM is located in the center of the network to connect the encoding and decoding paths. It captures additional long-range contextual features, which are difficult to acquire in conventional neural network. After several convolution operations and max-pooling operations in the encoder part, the feature map provides rich spatial features, which are informative for aggregating features and extracting nodes in the following MGRM. It should be noted that the sizes of the max-pooling kernels are different for the optic disc detection network and the retinal layer segmentation network: the size of each max-pooling kernel is set to (2,2,2) in the retinal layer segmentation network as in Fig. 6, while that of the optic disc detection network is set to be (2,4,2) to better capture the larger-scale semantic information represented by the optic disc.

Loss function
In our two-stage segmentation framework, two loss functions L Seg 1 and L seg 2 are proposed to supervise these two stage MGU-Nets and to enforce them to segment optic disc and nine retinal layers in an end-to-end fashion more accurately. The total loss L in this study is the sum of the two losses. The total loss function is shown as follow, where weights the two losses. L seg are defined as the sum of Dice loss and Cross-Entropy loss, which can be described as where in which and indicates the ground truth and the probability in prediction of pixel belonging to of class , is the number of classes in the segmentation network. For each subject, 2 radial OCT B-scans were randomly selected to ensure mutual exclusion. Two graders annotated these images manually through ITK-SNAP software [44] into optic disc and nine retinal layers under the supervision of a glaucoma specialist. While one image is only annotated by one grader, the final ground truth is obtained from the consensus of all personnel.

Datasets and implementation details
For the experiment, the data were randomly divided into three mutually exclusive subsets for training, validation, and testing on the patient level. The ratio between these sets was 6:2:2. In order to enlarge the size of our dataset, we performed data augmentation including horizontal flipping, additive Gaussian noises, and contrast adjustment. In addition, we tested our proposed technique on the Duke SD-OCT dataset, which was collected by Chiu et al. using a Spectralis HRA+OCT (Heidelberg Engineering, Heidelberg, Germany) [40]. It consists of 110 OCT B-scans obtained from 10 patients with diabetic macular edema (DME) with a size of 496 × 768 pixels. More details about this dataset could be found in [40].

Experimental details
The proposed method was implemented in PyTorch and trained on NVIDIA Tesla V100 GPUs. During the training, the initial learning rate was 0.001 and was reduced by an order of magnitude after every 20 epochs. The number of epochs was 50. Momentum and weight decay coefficients were set to 0.9 and 0.0001, respectively. We used Adam optimizer to train the model in mini batches of size = 1. Parameter in Eq. (1) is empirically set as 2. To ensure a fair comparison, the training hyperparameters were kept constant to achieve the best performance for all the comparative methods.

Comparisons with the state-of-the-arts on collected dataset
We compared our model with state-of-the-art techniques, including U-Net [28], ReLayNet [34] and DRUNET [30]. U-Net is a popular segmentation network used for medical image. ReLayNet is specially designed for segmenting retinal layers and fluid in macular OCT images. DRUNET is proposed to segment optic nerve head tissues in peripapillary OCT images using a dilated and residual U-shape network. It is worth noting that U-Net has four down-sampling and four up-sampling operators, while ReLayNet and DRUNET both perform three down-sampling and three up-sampling operations.

Comparisons with state-of-the-arts on public dataset
We repeat the experiments on the Duke SD-OCT dataset, which is macular centered. Since the proposed two-stage framework is originally designed for segmenting peripapillary OCT B-scans, we removed the first stage and only use the second stage to compete against the state-of-the-art models including U-Net, ReLayNet, DRUNET and published results on this public dataset [34,35,40,45].

Ablation study
To assess the contribution of each component of the proposed framework, we performed several ablation studies on collected dataset. We compare the performance of the proposed model with that of (1) one-stage baseline, (2) two-stage baseline, (3) without graph reasoning blocks in multi-scale global reasoning module, and (4) with single-scale global reasoning module in the two-stage segmentation framework.

Evaluation Metrics
The Dice score (DSC) and pixel accuracy (PA) between predictions and segmentation references were used for quantitative evaluation of segmentation performance. They are calculated as: and where is the region of prediction and is the region of ground truth. Dice score was used to measure the overlap between the prediction and ground truth. Pixel accuracy was used to calculate the true positive rate of predicted results compared with their ground truth.

Collected dataset
The experimental results on the collected dataset are listed in Table 2 and Table 3. The proposed method outperforms the selected state-of-the-art methods in most optic nerve head tissue categories except for the optic disc (both Dice and pixel accuracy) and RPE (pixel accuracy): the overall average Dice score for the proposed network is 1.6%, 1.5%, and 1.4% higher than that of ReLayNet, U-Net, and DRUNET, respectively. A similar trend is observed in the pixel accuracy results. Fig. 7 shows the segmentation results obtained by various techniques on a normal peripapillary OCT image. The segmented image obtained by the proposed method as shown in Fig.7(f) presents the best visual quality, while various artifacts are visible in the others. We roughly categorize the artifacts into two groups, 1) layer discontinuities, where the stratified retinal layers are unexpectedly terminated in the horizontal direction as observed in ReLayNet (Fig. 7(d)) and DRUNET (Fig. 7(e)) and marked by white stars; 2) scattered labels, where an isolated region enclosed by a certain layer to be recognized as others as pointed out by the yellow arrows in Fig. 7(c)-(d). This type of errors is observed in all but the proposed method.
We suggest that the improved performance could be ascribed to the additional prior knowledge incorporated in the proposed framework. For conventional CNN-based algorithms, the pixel-level classification is often sensitive to the textural details, while we regularize this with extra spatial constraints in the proposed technique: the segmentation results have to comply with the learned spatial layout, which dictates that the retinal layers must be arranged as a horizontally continuous and vertically stratified structure.
A similar observation could be made on a diseased sample with retinal lesion as illustrated in Fig. 8. It is clear that the blurred boundaries and the reduced contrast in the lesion area as circled out in Fig. 8(a) are challenging for conventional CNN-based algorithms. On the other hand, while mis-labelling are also observed in the proposed method, the stratified structure of the retinal layers is well preserved, which again demonstrates the efficacy of the proposed technique. Table 2. Dice score (%) for the segmentation on collected dataset obtained by different methods. The best performance is marked by "*", the second-best performance is indicated by "**". Improvement is defined as the difference between the proposed method and the best performance obtained among other techniques.  Table 3. Pixel accuracy (%) of segmentation results on collected dataset by different methods and improvement in comparison with the best performance from other methods. The best performance is marked by "*", the second-best performance is indicated by "**".

Public dataset
The experimental results obtained on the Duke SD-OCT dataset are reported in Table 4. It is worth mentioning that the Duke dataset is macular-centered, while our segmentation framework is designed for disc-center images. Therefore, only one stage of proposed MGU-Net is used in this experiment. Nonetheless, the proposed model achieved best performance in ONL-ISM layer and second-best performance in four retinal layers. The average Dice score achieved by the proposed MGU-Net is the highest if we do not take the results reported by Roy et al into account, while it does outperform the ReLayNet we reproduced. We also displayed the segmented images in Fig. 9. The proposed MGU-Net manifests better visual quality in comparison with other OCT retinal image segmentation methods. Consistent with the observations we have made in Section 4.1, artifacts such as layer discontinuities, which are marked by white stars, are presented in the image segmented by U-Net and DRUNET in Fig. 9(c) and Fig. 9(e), respectively.

Ablation study on the collected dataset
The quantitative results of the ablation study on our dataset are listed in Table 5 and Table  6. Baseline is the U-shape network with three down-sampling operations, which is one level shallower than the U-Net used in previous section. We first compared the results of two-stage baseline with that of one-stage baseline. If a two-stage network is adopted to segment optic disc and retinal layers separately, the Dice score and pixel accuracy were improved by 0.4% and 1.0% respectively. If we insert single-scale graph reasoning block (GRB) into the two-stage baseline, the Dice scores were greatly increased compared with two-stage baseline. However, pixel accuracy on RNFL dropped slightly. On the other hand, if we apply multi-scale pooling (MSP) to the two-stage baseline, an immediate boost is also observed from all the results except for the pixel accuracy on RNFL. We suspect that the lack of improvement by adding GRB or MSP might be ascribed to the complicated morphology of the image: single-scale GRB could not capture all spatial information at one time, while MSP could not perform spatial reasoning in global level. After introducing the multi-scale global reasoning module into the two-stage network, which is the combination of GRB and MSP, all the Dice score and pixel accuracy were both lifted, achieved an improvement of 0.9% and 1.2% compared with that of two-stage baseline.
The results of ablation study demonstrate the effective of the added modules and indicate that the proposed multi-scale GCN-assisted two-stage U-shape network significantly improved the performance of segmentation in peripapillary OCT image.

The limited size of the collected dataset
In the current study, 122 OCT B-scans from 61 individuals are manually annotated and included in our experiment. While the size of the dataset is relatively small, it should be noted that the Table 4. Dice score for segmentation results on Duke SD-OCT dataset by different methods and expert 2 annotations. The best performance is marked by "*", the second-best performance is indicated by "**".  qualified human data are difficult to acquire and manually annotating 10 layers in one OCT image is very expensive and time-consuming. To partially overcome this issue, we performed data augmentations on the training set including horizontal flipping, additive Gaussian noises, and contrast adjustment.

The impact of imaging artifacts and label noises on the segmentation results
It is well known that the commonly presented artifacts including vessel shadows and retinal lesions might influence the automated segmentation algorithms. Those artifacts often cause Table 5. Ablation study of each parts in our framework through comparing the Dice score (%). The best performance is marked by "*" (%).  Table 6. Ablation study of each parts in our framework through comparing the pixel accuracy (%). The best performance is marked by "*" (%). blurred layer boundaries, diminished tissue texture, and altered image contrast, which could potentially lead to a decrease in the segmentation accuracy. A good illustration is provided in Fig. 8, where the presented retinal lesion is circled out in the right panel of Fig. 8(a). For conventional deep learning-based algorithms such as U-Net, ReLayNet, and DRUNET, labelling errors are visible as shown in Fig. 8(c)-(e). On the other hand, the proposed method is more robust to this perturbation. We believe this might because these algorithms mainly depend on the texture details of the images to perform the pixel-level classifications, while the proposed method explicitly imposes spatial constraints on to the task which regularizes the task and ensures a better visual outcome in this case as illustrated in Fig.  8

Method
It is also worth mentioning that the proposed method might be affected by the label noises. Due to the limited resources, the manual segmentation of the OCT images is performed collaboratively by two graders under the supervision of a retinal specialist, such that one image is only segmented by one grader. Therefore, it is possible that small label noises are introduced, because the two graders might possess different preferences and styles during the annotation [45]. This might slightly impair the performance of the segmentation.
To further address these issues, we plan to perform detection, removal, and inpainting for the artifact regions and tackle the label noise in the future [46,47].

The proposed framework relies on standardized images
One of the potential limitations of the proposed framework is that it requires the input images to be well standardized such that all anatomical assumptions or spatial constraints we have made are valid. Take the first stage, the optic disc segmentation network, as an example, it relies on the presumption that the optic disc region and the non-disc regions are arranged in a proper horizontal order as mentioned in Section 2.1. We could be done by perform a registration process prior to segmentation with a goal of registering the optic disc region with a retinal template in the future.

Conclusion
To address the challenges imposed by the multi-scale features presented in the optic disc and the retinal layers with various thicknesses as well as exploiting the existing anatomical priors, a multi-scale global reasoning module, which is capable of long-range contextual spatial reasoning, is proposed and integrated into a U-Net backbone. Specifically, a two-stage framework is constructed to sequentially segment the optic disc and the retinal layers in peripapillary OCT images. We validated the proposed framework on a collected dataset as well as a public dataset. The experimental results on both datasets showed that the proposed method could considerably improve the segmentation performance of optic nerve head tissues compared with other state-of-the-art techniques. The proposed method achieved 82.0% and 83.0% in terms of Dice score and pixel accuracy on average, which is 1.6% and 2.8% higher than the performance of ReLayNet. More importantly, the visual quality of the segmented images is greatly enhanced, thanks to the anatomical constraints imposed by the multi-scale global reasoning module. In the future, we will incorporate the proposed segmentation network into the workflow of early-stage glaucoma diagnosis. We also believe the proposed architecture could be domain transferred to other biomedical image segmentation tasks where an abundance of anatomical priors is available. To facilitate the progression of the filed, we make our segmentation dataset as well as the codes available. To our best knowledge, this will be the first public peripapillary retinal OCT dataset.

Acknowledgement
The computations in this paper were run on the 2.0 cluster supported by the Center for High Performance Computing at Shanghai Jiao Tong University. We sincerely appreciate reviewers for their precious suggestions which help improve this work substantially.