Robust deep learning method for choroidal vessel segmentation on swept source optical coherence tomography images.

Accurate choroidal vessel segmentation with swept-source optical coherence tomography (SS-OCT) images provide unprecedented quantitative analysis towards the understanding of choroid-related diseases. Motivated by the leading segmentation performance in medical images from the use of deep learning methods, in this study, we proposed the adoption of a deep learning method, RefineNet, to segment the choroidal vessels from SS-OCT images. We quantitatively evaluated the RefineNet on 40 SS-OCT images consisting of ~3,900 manually annotated choroidal vessels regions. We achieved a segmentation agreement (SA) of 0.840 ± 0.035 with clinician 1 (C1) and 0.823 ± 0.027 with clinician 2 (C2). These results were higher than inter-observer variability measure in SA between C1 and C2 of 0.821 ± 0.037. Our results demonstrated that the choroidal vessels from SS-OCT can be automatically segmented using a deep learning method and thus provided a new approach towards an objective and reproducible quantitative analysis of vessel regions.


Introduction
Choroid is a tissue layer beneath the retina with the most abundant blood flow of all the structures in the eye. It is crucial for the normal function of retinal pigment epithelium and outer retina in terms of the oxygenation and metabolism activity [1]. Structural and functional abnormality of choroid is related to many ocular diseases, such as Age-related Macular Degeneration (AMD) [2], Polypoidal Choroidal Vasculopathy (PCV) [3], Choroidal Neovascularisation (CNV), Multifocal Choroiditis and Panuveitis (MCP). To analyze the choroid, an advanced non-invasive scanner, Swept-Source Optical Coherence Tomography (SS-OCT) for the imaging of eye fundus, has demonstrated great potential in its ability to quantify the choroid layer. Compared to the traditional Spectral Domain OCT (SD-OCT), SS-OCT produces a deeper penetration of the fundus, and therefore it is capable of generating greater detailed acquisitions of the choroid layer and the choroidal vessels within the layer [4].
To quantitatively analyze the choroid layer, it is necessary to segment the individual choroidal vessels which includes measuring the total volume of the vessels, distribution of the vessels within the layer, and the volume ratio of the vessels in relation to its background. This analysis facilitates the understanding of choroid-related diseases, such as AMD, PCV and high myopia. However, current approaches to these analyses rely on tedious, time-consuming, and non-reproducible manual processes. Although there are automated segmentation methods developed for SD-OCT, which with adaptation, may be applicable for choroid vessel quantification from SS-OCT, these methods are however optimized for the visual properties of the SD-OCT which markedly differs from that of SS-OCT's vessel characterization and thus are not comparable. This study is the first to present quantitatively analysis for choroidal vessels segmentation of SS-OCT images with performance that is consistent to clinician's manual analysis.

Related work
We have divided related works into three main categories: SS-OCT segmentation methods, conventional choroidal vessels segmentation methods, and deep learning based OCT segmentation methods. Our separation is mainly attributed to the fact that there are limited number of works on automated segmentation for SS-OCT images. We therefore also included works on the segmentation of SD-OCT, which is the most relevant imaging modality to SS-OCT and have discussed these studies in reflection to how it may potentially relate to SS-OCT.
For SS-OCT segmentation, Li Zhang et al. [5] proposed to use shape model of the Bruch's membrane with soft-constraint graph-search for segmenting choroidal boundaries. Zhou et al. [6] proposed to use attenuation correction approach to denoise the input SS-OCT images which thereby improved the segmentation of the choroidal boundaries. Although these automated segmentation methods have demonstrated accurate results. Unfortunately, these methods have designed for segmenting layer structures e.g., searching minimum pathways. Consequently, it will be challenging to adopt these methods for segmenting circular choroidal vessel structures.
Conventional choroidal vessels segmentation methods generally require: i) a priori knowledge of the data, ii) the need to tune a large number of parameters, and iii) pre-and/or post-processing techniques to denoise the input image and refine the segmentation results, respectively. Duan et al. [7] and Kajic et al. [8] both applied multi-scale adaptive thresholding methods for choroidal vessels segmentation. Li Zhang et al. [9] proposed the use of 3D tubelike models to fit into the choroidal vessels as a mean of vessel detection. The results were then processed by multi-scale Hessian filter and thresholding to refine the segmented choroidal vessels boundaries. In their experiments, only the reproducibility was demonstrated using repeated scans of the same eye while segmentation accuracy was not reported. In another work, Srinath et al. [10] proposed to initially identify the upper and lower boundaries of the choroid layer and then apply a level set method on the detected choroid layer to iteratively segment the vessels. By isolating the choroid layer, they were able to improve the identification of the vessels and its subsequent segmentation. Unfortunately, in all the above methods, their segmentation performances were only visually demonstrated and did not include quantitatively evaluations with manually annotated ground truth data. Consequently, it is challenging to determine their accuracy and reproducibility.
Recently, deep learning methods have achieved great success in medical image segmentation tasks [11,12]. Such success is primarily attributed to the ability of the deep learning methods to leverage large data sets to hierarchically learn the image features that best correspond to the appearances, as well as the semantics, of the images. Motivated by this success, many investigators have also attempted to adapt deep learning based methods for OCT images. Fang et al. [13] adopted an 8-layer convolutional neural networks (CNN) to classify whether each pixel is located in the retinal layer. However, this patch based method is inefficient, where accurate segmentation requires a prediction for every pixel in the image. In addition, as the patches are independently trained and used in the prediction, this resulted in a loss of spatial context, meaning that the segmentation results lacked consistency with inconsistent labelling of neighborhood pixels. To overcome these limitations, fully convolutional networks (FCN) based methods have been proposed. FCN was derived from CNN (VGGNet [14]) to provide efficient dense inference, where the classification modules (fully connected layers) were replaced with deconvolutional layers (transposed convolutional layers) to upsample the learned features and output the segmentation results. For instance, Sui et al. [15] embedded segmentation results from three FCNs for choroid layer segmentation. Venhuizen et al. [16] introduced a modified U-shaped FCN (U-net) for retina thickness segmentation. Xu et al. [17] used a dual stage FCN to progressively segment the pigment epithelium detachment (PED) structures from SD-OCT images. Although these FCN based methods have demonstrated accurate segmentation results for layer structures, unfortunately, they have not been validated for small choroidal vessels segmentation.

Our contributions
In this study, we propose the adoption of an FCN variant named RefineNet proposed by Lin et al. [18] which achieved the state-of-the-art segmentation results on natural images. RefineNet differs from commonly used VGG-FCN (FCN using the VGGNet as the backbone [14]) in two aspects: (1) RefineNet propagates and integrates the multi-scale intermediary results in an end-to-end manner. In contrast, VGG-CNN propagates in a single-scale. The multi-scale capability allows for improved segmentation of regions of various sizes that are evident in our OCT images; (2) RefineNet uses deeper ResNet backbone architecture (101 layers) [19] instead of the conventional VGGNet (usually less than 20 layers). In our experiments, the deeper layers and residual connection only used in ResNet backbone allowed for the extraction of more meaningful image features that resulted in improved segmentation. In this study, we introduce the following contributions: • Development of a deep learning based choroidal vessels segmentation method for SS-OCT for use in choroidal vessels analysis.
• Our choroidal vessels segmentation method is an end-to-end model for training and testing, which means that it does not require any pre-and post-processing steps e.g., image denoising and filtering, or manual feature / parameter selections. Hence, our approach is practically suitable for choroid vessels which usually consist of large number of regions with complex patterns that are difficult to manually analyze.
• We evaluated our segmentation results with manual annotations derived from two clinicians, and further compared our results with implementations of conventional vessels segmentation methods. Furthermore, we present the first inter-observer variabilities analysis of the manual choroidal vessels segmentation from SS-OCT images.

Image acquisition and manual annotation
The images used in our study were acquired using SS-OCT (model DRI OCT-1 Atlantis; Topcon) with 12-line radial scan patterns having a resolution of 1024 × 12. Each image is an average of 32 overlapped consecutive scans focused on the fovea and has a resolution of 1024 × 992 representing an actual area of 12 mm × 2.6 mm. This study was approved by the Institutional Review Board of Shanghai General Hospital, Shanghai Jiao Tong University and was conducted under the tenets of the Declaration of Helsinki. Informed consent was obtained from each participant. 10 subjects were randomly selected, 5 emmetropes and 5 high myopes (refractive error≤-5.0), respectively. For each subject, 4 slices were selected, which were 45°, 90°, 135°and 180° from the horizontal view, representing the four main directions of eye fundus scanning. In total, 40 images were selected for the manual annotation for use as the ground truth. Those images were annotated by two clinicians independently. For each image, there were 97.9 ± 28.2 annotated regions. In total, approximately 3900 regions were annotated. Figure 1 shows 3 example images from 3 studies and their annotations (annotated by two clinicians). In total, each clinician spent about 10 hours to annotate these 40 images.

Fully convolutional networks (FCN)
The traditional FCN architecture [11] was converted from convolutional neural networks (CNNs) (VGGNet [14]) for efficient dense inference. It contains downsampling and upsampling path. The downsampling path has stacked convolutional layers to extract highlevel abstract information and has been widely used in convolutional neural networks (CNN) for image classification related tasks [20]. The upsampling part has stacked deconvolutional layers, which are transposed convolutional layers that upsample the feature maps derived from downsampling part to output the segmentation results.
Then an FCN network can be defined as: where Y is the output prediction, I is the input image, S F denotes the feature map produced by the stacked convolutional layers with a list of downsampling factors S , S U denotes the deconvolutional layers that upscale the feature map by a list of factors S to ensure both the output Y and input I have the same size (height and width). θ and φ are the learned parameters. To explore the fine-scaled local appearance of the input image, the skip architecture was employed to combine the output feature maps of both lower convolution layers and the higher convolutional layer for more accurate inference [11,12].
For training of FCN for choroidal vessels segmentation, the whole architecture can be defined as minimizing the overall loss between the predicted results and the ground truth annotation: Here,  calculates the loss (per-pixel cross entropy loss) of the ground truth annotation Z and the predicted results. The network parameters θ and φ can then be iteratively updated using stochastic gradient descent (SGD) algorithm. For inference, FCN takes an image of arbitrary size and outputs a probability map of the same size that indicates the choroidal vessels area.

RefineNet for choroidal vessels segmentation
RefineNet is another variation of FCN and has two major differences when compared with the traditional FCN (VGG-FCN) architecture. On the downsampling path, traditional FCN architecture is based on the VGGNet architecture [14] and therefore, its downsampling path usually has limited capacity to add additional layers. Experimental data has shown that beyond certain depths, adding extra layers results in higher training errors and therefore, it is challenging to optimize very deep networks with many layers [21]. In addition, traditional FCN usually relies on skip connection with deconvolution operation on the upsampling path for producing the final segmentation results. Unfortunately, the deconvolution operations are not able to recover the smaller objects e.g., choroidal vessels which are lost after the downsampling operations on the downsampling path [18]. Therefore, they are unable to output accurate segmentation for individual small choroidal vessel regions.
To overcome the limitation on the down-sampling path, a 101-layer residual network (ResNet) was used [19] for visual features learning and representations. The ResNet architecture consists of a number of residual blocks with shortcuts that bypasses few convolutional layers at a time (as exemplified in Fig. 2). Therefore, the ResNet architecture enables having multiple down-sampling paths and thus allow deriving optimal results by averaging the output of different paths. For improving the segmentation results on the upsampling path, multiple RefineNet modules were used to fuse high-level features by incorporating low-level features derived from downsampling path in a step-wise manner (Fig.  2). Compared with the skip connection with the deconvolution operations, the upsampling and fusion process allows to retain the image resolution while accurately segment the choroidal vessels.

Implementation details
There is a scarcity of medical images with annotations for use as training data due to the cost and complexity of the acquisition procedures [22,23]. In contrast to the limited data in the medical domain, there are greater availability of general image data [24]. Existing works [22] have shown evidence that the problem of insufficient training data can be alleviated by finetuning (continue training the model trained on general images), where the lower layers of the fine-tuned network are more general filters (trained on general images) while those in the higher layers are more specific to the target problem (trained with specific medical images) [22,25]. Therefore, we used the pre-trained 101-layer-ResNet (trained on ImageNet) for initialization. The implementation was based on the MatConvNet library [26].
We used a 5-fold cross-validation approach to evaluate the proposed method. For each fold, 8 studies (32 images, ~2880 annotated regions) were used as the training data and 2 studies (8 images, ~720 annotated regions) were used as the testing data. Equal number of randomly sampled manual annotations from C1 and C2 were used as part of training data. We then rotated the training and testing data for 5 times to cover all 10 studies. Data augmentation techniques, including random crops and flips, were used to further improve the robustness [27,28]. Due to the limited GPU memory, we have to reduce the size of the input image instead of using the full resolution. However, downsampling the input image will result in losing all the small choroidal vessels. Therefore, input image was cropped into 500 × 500 smaller patches and the cropped small patches were then resized to fit into the pre-trained model input size. It took an average of ~10 hours to train one fold, where each fold was trained for 100 epochs at a learning rate of 0.0005 and batch size of 1. The framework was implemented in MATLAB R2017a running on a desktop PC with a 11GB NVidia GTX 1080Ti GPU with Intel Core i7 2.60 GHz. The average running time of RefineNet per image is 0.72 seconds.

Experimental setup
We performed the following experiments: (a) evaluation of the segmentation accuracy of the RefineNet method compared to manual ground truth (from observers); (b) comparison of inter-observer variability; and (c) comparison of our RefineNet with other choroidal vessels segmentation methods. These comparison methods are: (1) VGG-FCN -traditional fully convolutional network based on VGGNet architecture [11]; (2) LS -level set based segmentation method, which was used in [10]; and (3) AT -adaptive thresholding based segmentation method, which was used in [7].
The VGG-FCN was trained with the same 5-fold cross-validation training process as used in RefineNet with pre-trained ImageNet [24] model. It took an average of ~6 hours to train one fold, where each fold was trained for 200 epochs at a learning rate of 0.0001 (decrease by 10 times for every 100 epochs) and batch size of 20. We used the same patch-based method (the image was cropped into 500 × 500 small patches. The cropped small patches were then resized to fit into the pre-trained model input size) at both training and inferencing stages. Both LS and AT methods required sophisticated choroid layer detection approach as the preprocessing step. In order to minimize the pre-processing errors and also for fair comparison purpose, one clinician manually delineated the choroid layer for these two comparison methods.
For evaluation, we used segmentation agreement (SA), mean absolute differences (MAD), intra-class correlation coefficient (ICC) and Bland-Altman plots to measure the segmentation performance and the variability between Our method (Ours) and clinician 1 (C1), Ours and clinician 2 (C2), and C1 and C2. SA was defined as: where AP represents the pixels both agreed to annotate (e.g., both C1 and C2 annotated), NP represents the pixels both agreed not to annotate (background) and FP are the pixels which are not agreed. MAD was defined as: where 1 P is the number of total pixels annotated by observer 1 (e.g., clinician 1), 2 P is the number of total pixels annotated by observer 2 (e.g., Our method), and R is the zoom ratio of the image to the actual tissue. In this study, eye tissue of 12.00 × 2.60 mm was scanned into an image of 1024 × 992 pixels, which equates to a zoom ratio of: 12.00 × 2.60/(1024 × 992). Mean absolute differences (MAD) and intraclass correlation coefficient (ICC) were used to access the interobserver reproducibility of the measurements. Bland-Altman plots were used to graphically derive interobserver variability measurements [29].  Table 1 shows that our method achieved an average SA of 0.840 ± 0.035 with C1 and 0.823 ± 0.027 with C2. The inter-observer variabilities with C1 and C2 of SA resulted in an average of 0.821 ± 0.037. In comparison, VGG-FCN had an average SA of 0.780 ± 0.057 with C1 and 0.785 ± 0.044 with C2. AT and LS achieved a lower SA, where AT had an average SA of 0.720 ± 0.051 with C1 and 0.679 ± 0.045 with C2, respectively, while LS has an average SA of 0.670 ± 0.113 with C1 and 0.625 ± 0.124 with C2, respectively.  Figure 3 shows the results from various segmentation methods (rows) derived from three randomly selected patient studies (columns) for visual qualitative assessment. Figure 4 shows the Bland-Altman plots comparing inter-observer variability among different segmentation approaches.

Discussion
We presented the results from a deep learning based segmentation analysis, as well as the inter-observer segmentation variability analysis for choroidal vessels from SS-OCT. Our experiments in Table 1 indicate that our RefineNet method resulted in higher segmentation performance for choroidal vessels when compared to the other comparison methods. In general, AT and LS performed poorly with results ~17% lower compared to RefineNet in SA measure, on challenging choroidal vessels that have inhomogeneous variations. This is as expected as they are not able to understand image-wide semantics, e.g., the relationships between the choroidal vessels and the surrounding structures (as exemplified in Fig. 3). It is important to note that AT and LS are not fully automatic and that they are reliant on pre-and post-processing techniques and various parameter selections (i.e. handcrafted features) that restricts them from generalizability. In contrast, the deep learning approaches of VGG-FCN and RefineNet achieved higher segmentation accuracy in an end-to-end manner without preor post-processing techniques or parameter selections. Table 1, and Figs. 3 and 4 validates that the RefineNet achieved higher segmentation accuracy compared with the VGG-FCN method in both the qualitative and quantitative measurements. We attributed the improvements of RefineNet compared to the VGG-FCN to the fusion of multi-scale intermediary segmentation results, which thereby can better capture the small choroidal vessel representations. In contrast, the VGG-FCN's inherent dependence of a single-scale processing means that it cannot provide accurate segmentation results of the choroidal vessels which widely vary in sizes. In addition, VGG-FCN only supports limited number of convolutional layers and therefore is unable to derive the same amount of meaningful feature representations, i.e., high-level semantics, compared to RefineNet. We suggest that it is these deeper layers that produced better segmentation on challenging choroidal vessels e.g., vessels that are closer to the choroid layer boundaries. This effect is exemplified in Fig. 3 which shows several examples where VGG-FCN fails to segment several choroidal vessels (indicated by the arrows). In contrast, RefineNet was able to accurately segment the choroidal vessels and had higher correlation with the segmentation results derived from two clinicians (as exemplified in Fig. 4).
The results also demonstrate that RefineNet achieved competitive segmentation accuracy with low inter-observer variability compared with two clinicians (C1 and C2). The results in Table 1 shows that RefineNet resulted in a lower inter-observer variability with higher segmentation agreement, when compared with the inter-observer variability among C1 and C2. More specifically, RefineNet was ~2% higher in terms of segmentation agreement (SA) with the two clinicians, which suggests that RefineNet as a fully automated method is able to segment the choroidal vessels that are consistent to what clinicians would produce in a manual manner. In addition, the Bland-Altman plot in Fig. 4 shows that both RefineNet with C1 and RefineNet with C2 have better reproducibility compared to the reproducibility of C1 with C2. We attribute this better reproducibility to the training process which allows to retain important characteristics of the choroidal vessels structures and resulted in consistent segmentation results.
Automatic segmentation of choroidal vessels on SS-OCT, which removes the need for manual, time consuming and error-prone annotation process, provides opportunities to visualize the full thickness (e.g. volumetric rendering) of the choroidal blood flow in a noninvasive manner. Figure 5 exemplifies the segmented choroidal vessels of high myopia and emmetropia. With the segmentation derived from RefineNet, we can clearly visualize that the abnormal high myopia studies (A) have markedly less segmented regions compared to the normal emmetropia studies in (B). Our analysis thus makes it practical and facilitates future researches about choroidal vessels, such as in clarifying discrepancy about the status of choroid and choroidal vessels in ocular diseases, creating population based choroidal disease models.
In addition, our experiments have demonstrated the usefulness and potential broader adaptation of SS-OCT for quantitative analysis of choroid vessels. Currently, researchers have limited imaging modalities to obtain information about the choroidal vessels. Traditional indocyanine green angiography (ICGA) is the gold standard in clinical practice for detecting abnormality in the choroidal vessels. ICGA provide 2D images of the choroid vasculature, which can show the exudation or filling defects. However, ICGA does not provide 3D choroidal structure or the volume of the whole choroidal vessel networks, and the ICGA images overlap retinal vessels and choroidal vessels together, thereby making it hard to independently observe and analyze the choroidal vessels quantitatively. OCT Angiography (OCTA) can clearly show the blood flow from superficial and deep retinal capillary network, as well as retinal pigment epithelium to superficial choroidal vascular network; however, it cannot show the blood flow in deep choroidal vessels. In this work, we focused on the automated choroidal vessels segmentation and therefore, we didn't explore different visualization options that are possible with the segmentation results, such as the face projection and 3D volume rendering. Our experiments demonstrate that 2D approach is sufficient for the purpose of segmentation analysis (both quantitative and qualitative). Nevertheless, as future work, we will explore the usefulness of our segmentation with 3D visualizations.

Conclusions
We proposed a RefineNet deep learning method to automatically segment the choroidal vessels on SS-OCT images. Our experiments demonstrated that RefineNet achieved higher segmentation accuracy with low inter-observer variabilities compared to the measurements between the two clinicians. Further, quantitative evaluation to baseline VGG-FCN resulted in higher segmentation performance. The RefineNet improvements were attributed to the iteratively propagation of multi-scale intermediary segmentation results, and the usage of deeper architecture. As future work, we will investigate the adaptation of RefineNet for quantitative analysis and as well as potential clinical applications.