ReLayNet: Retinal Layer and Fluid Segmentation of Macular Optical Coherence Tomography using Fully Convolutional Network

Optical coherence tomography (OCT) is used for non-invasive diagnosis of diabetic macular edema assessing the retinal layers. In this paper, we propose a new fully convolutional deep architecture, termed ReLayNet, for end-to-end segmentation of retinal layers and fluid masses in eye OCT scans. ReLayNet uses a contracting path of convolutional blocks (encoders) to learn a hierarchy of contextual features, followed by an expansive path of convolutional blocks (decoders) for semantic segmentation. ReLayNet is trained to optimize a joint loss function comprising of weighted logistic regression and Dice overlap loss. The framework is validated on a publicly available benchmark dataset with comparisons against five state-of-the-art segmentation methods including two deep learning based approaches to substantiate its effectiveness.


Introduction
Spectral Domain Optical Coherence Tomography (SD-OCT) is a non-invasive imaging modality commonly used for acquiring high resolution (6µm) cross-sectional scans for biological tissues with sufficient depth of penetration (0.5 − 2 mm) [1,2]. It uses the principle of speckle formation through coherence sensing of photons backscattered within highly scattering optical media like biological soft tissues [3]. It has found its application in medical imaging ranging across retinal pathology investigation, to skin imaging for monitoring wound healing [4] and intravascular imaging for effective stent placement [5], lumen detection [6] and plaque detection [7]. OCT is the preferred modality of choice for cross-sectional imaging of the retina on account of its high resolution favoring clear visualization of the various constituent layers of the retina.
Diabetes is a widely occuring chronic, metabolic disease with an estimated incidence in about 415 million people (roughly 8.3% of human adult population) [8]. Diabetic individuals often are under high risk of developing vision-related co-morbidities, which is reported at a significant 28% [9]. The degradation of quality of vision in diabetics is often associated to diabetic retinopathy (DR), which results in damage of retinal blood vessels and accumulation of fluid between the retinal layers [10,11]. Thus, proper monitoring of the retinal layer morphology and fluid accumulation is necessary for diabetic patients to prevent chances of occurrence of blindness.
Acquisition of retinal OCT centered at the optical nerve and fovea is highly challenging due to the presence of micro-saccadic eye movements resulting in motion artifacts, variations in tissue inclination with respect to the coherence wave surface and poor signal to noise ratio with increasing imaging depth. The acquisition is also particularly difficult in cases of highly myopic eyes. These inherent challenges associated with the modality makes the interpretation of an OCT image difficult and often highly variable among experts. Specifically, due to the highly diffused nature of the boundaries between two retinal layers. This causes manual annotation of the layer boundaries to be very subjective and time consuming. This has motivated a lot of research for developing automated methods for segmenting different retinal layers from OCT images and aid in accurate diagnosis with minimum subject variation in reporting [12][13][14][15][16].
Towards this end, we propose a deep learning based end-to-end learning framework for segmentation of multiple retinal layers and delineation of fluid pockets in eye OCT images, called ReLayNet (short for Retinal Layer segmentation network). To the best of our knowledge, this is the first time deep learning based fully convolutional end-to-end method leveraged towards this application. Figure 1 previews the results of the proposed ReLayNet for two OCT slices (without and with fluid mass).

State of the art
The task of segmenting a retinal OCT scan involves partitioning the image into the constituent retinal layers and delineating the fluid pool (if present). OCT is often posed as a graph (called as Graph Construction (GC)) and layer label assignment is solved with dynamic programming (DP) approaches [12][13][14][15][16]. Particularly, Chiu et al. used intensity gradients to estimate the graph edge-weights followed by DP to solve the shortest path problem, thus estimating the layer boundaries [13]. It was further improved in a subsequent method by using hard and soft constraints to add prior information from a learned model within GC [16]. Alternately, Srinivasan et al. proposed using sparsity based image denoising, support vector machines and heuristic priors in GC [14]. In a relatively recent work, Chiu et al. demonstrates the use of kernel regression based methods to classify the layers and fluid masses followed by refinement with GC and DP [15]. In similar lines, Karri et al. proposed to reinforce GC by learning layer specific edges using structured random forests [20]. Also, spatial consistency across consecutive OCT frames was improved by incorporating appropriate constraints within the DP paradigm for segmentation [21]. Recently, Fang et al. combined CNNs with graph search methods for automatically segmenting nine retinal layer boundaries. Their proposed approach uses the probabilistic predictions generated by the CNN within a graph search method that delineates final retinal layer boundaries [22].
Parallel approaches inspired by early developments in image segmentation in computer vision applications have also been investigated for OCT segmentation. These include the use of texture information together with diffusion maps [12], probabilistic approach for modelling retinal layers using layer boundary specific shape regularizers [23] and deployment of two parallel active contours acting simultaneously to segment the retinal boundaries [24].
The aforementioned approaches towards retinal layer segmentation are not end-to-end paradigms. Often, heuristics and hand-crafting is employed in choosing the graph weights GC and subsequent DP. The segmentation is achieved in multiple stages involving pre-processing stages like denoising followed by post-processing stages of refinement. Though these additional steps do not limit the usability of these methods, it must be noted that these require significant domain knowledge and modeling approximations. Methods exclusively targeted at layer segmentation often did not consider the presence of fluid filled regions which could lead to potentially erroneous results in pathological settings. In addition to the above limitations, the testing phase of these methods is typically slow (with the graph optimization often being the computational bottleneck). This limits their potential for deployment in time-constrained settings like during interventions. With the main focus of addressing these issues,in this work we propose a deep learning based approach towards generating layer segmentations of a whole B-scan slice in an end-to-end fashion. Deep learning based approaches provide the advantage of learning discriminative representations from the data sans the need for handcrafting features. In particular, we propose a deep learning architecture that falls under the family of fully convolutional neural networks (F-CNN) that are specifically tailored for semantic segmentation which predicts the label for all the image pixel together [17][18][19].
Of late, there has been considerable amount of work for semantic segmentation using deep learning methods within the computer vision and medical imaging communities. The seminal work on fully convolutional semantic segmentation proposed by Long et al. [17] is particularly relevant in the context of this work. They effectively adapt state of the art networks trained for image classification into fully convolutional networks fine-tuned for segmentation tasks. Particularly, they introduced the notion of skip connections that effectively combines higher-order semantic information from deeper coarsely-resolved layers together with appearance information from shallow finely-resolved layers to improve segmentation detail. A significant improvement over within this family of models was achieved by using an encoder-decoder based framework, termed as DeconvNet and the introduction of unpooling layers instead of interpolation to improve the spatial consistency of segmentation [18]. In an alternate work, Chen et al. proposed the concept of using atrous convolutional kernels instead of interpolation to get much smoother version of feature maps that are better suited for semantic segmentation [25]. Within the medical imaging community, Ronnerberger et al. proposed the U-Net architecture that leverages an encoder-decoder architecture and introduces skip connections across them [19]. They demonstrate that such an architecture can be trained effectively in presence of limited training data when appropriate data augmentation and gradient-weighting schemes are employed. It must be noted that the architecture presented in this work is inspired in part by U-Net [19] and DeconvNet [18].
The salient contributions presented in the paper can be listed as: (i) To the best of our knowledge, this is the first work employing fully-convolutional deep learning approach for retinal OCT layer and fluid segmentation, (ii) ReLayNet is an end-to-end learning approach that is driven  entirely by the OCT data without employing any heuristics or hand-crafting of features and is has highly competitive testing time (10 ms per B-Scan), (iii) Our model uses an encoder-decoder configuration which is tailored for the task at hand by incorporation of unpooling stages with skip connections for improved spatial consistency and ease of gradient flow during training, (iv) ReLayNet is trained with a composite loss function comprising of a weighted logistic regression loss along with Dice loss for improved segmentation. The weighting scheme employed effectively compensates for imbalanced classes and selectively penalizes misclassification at layer boundaries.
In the rest of the paper, we detail the methodology of the proposed framework in Sec. 3, followed by the experimental setup in Sec. 4, results analyzing segmentation performance and retinal thickness estimation are discussed in Sec. 5 and finally concluding remarks in Sec. 6.

Problem statement
Given a retinal OCT image I, the task is to assign each pixel location x = (r, c) to a particular label l in the label space L = {l} = {1, · · · , K } for K classes. We treat the current segmentation task as a K = 10 class classification problem. The tissue classes include 7 retinal layers illustrated in Figure 1, Region above the retina (RaR), Region below RPE (RbR) and accumulated Fluid.

Network architecture
The network architecture of the proposed ReLayNet is illustrated in Figure 2. It consists of a contracting path of encoder blocks followed by an expansive path of decoder blocks with skip connections relaying the intermittent feature representations from encoder blocks to their matched decoder blocks through concatenation layers, followed by a classification layer. The individual constituent blocks are detailed as follows:

Encoder block
Each encoder block consist of 4 main layers, in sequence: convolution layer, batch normalization layer, ReLU activation layer and max pooling layer. The convolution kernels for all the encoder blocks are defined of rectangular size 7 × 3 to be consistent with OCT image dimensions, along with bias. The kernel size is chosen ensuring that the receptive field at the last encoder block encompasses the whole retinal depth. The feature maps are appropriately zero padded so that the dimension before and after the convolution layer remains the same. A batch normalization layer is introduced after the convolution layer to compensate for the covariate shifts and prevent over-fitting during the training procedure [26]. ReLU introduces non-linearity in the training. This is followed by a max pooling layer which reduces the feature map dimensions by half. The pooling indices during this operations are stored and used in the corresponding unpooling stage of decoder block to preserve spatial consistency.

Decoder block
Each decoder block consists of 5 main layers, in sequence: unpooling layer, concatenation layer, convolution layer, batch normalization and ReLU activation function. The unpooling layer upsamples a coarsely-resolved feature map from the preceding decoder block to a finer resolution by using saved pooling indices from the matched encoder block and imputes zeros at the rest of the locations (schematically shown in Figure 3). Such an unpooling layer ensures that spatial information remains preserved, in contrast to using interpolation for upsampling [18]. This is of particular importance for accurately segmenting layers near the fovea as they are often just a few pixels thick and bilinear interpolation could potentially lead to highly diffused boundaries and hence unreliable estimation of layer thickness. This unpooling layer is followed by a skip connection that relays the output feature maps of the matched encoder block which are in turn concatenated with the unpooled feature maps within the concatenation layer. The advantage of such a skip connection is two fold: (i) it aids the transition to finer resolution by adding an information rich feature map from the encoder part, and (ii) it aids the flow of gradients to the encoder part during training, thus minimizing the risk of vanishing gradient as model depth increases. The concatenated feature map is followed by convolutional layer, batch normalization and ReLU. These layers in effect densify the sparse unpooled feature maps. The kernel size of convolution layer is kept constant at 7 × 3 with appropriate padding similar to the encoder blocks.

Classification block
The final decoder block is followed by a convolutional layer with 1 × 1 kernels (used for reducing channels of the feature map without changing spatial dimensions) to map the 64 channel feature map to a 10 channel feature map (for 10 classes). At the end, a softmax layer estimates the probability of a pixel belonging to either of the 10 classes.

Loss functions
The ReLayNet is trained by jointly optimizing the following loss functions: Weighted multi-class logistic loss: Cross-entropy provides a probabilistic similarity between the actual label and the predicted value at the current state of the network. The average cross-entropy of all the classes defines the logistic loss, which penalizes at each pixel location x the deviation of the estimated probability p l (x) from 1 and is defined as follows: where p l (x) provides the estimated probability of pixel x to belong to class l, and ω(x) is the weight associated with pixel x. where g l (x) is a vector with one for the true label and zero entries for the others representing the ground truth probability of pixel at location x to belong to class l. We utilize a weighted version of logistic loss for our application to compensate for class-imbalance and encourage kernels that are discriminative towards layer transitions. Dice loss: Along with the multi-class logistic loss function, we use Dice loss that evaluates spatial overlap with ground-truth.
Here, we use a differentiable approximation of dice loss, defined as follows [27]:

Weighting scheme for loss function
Let ω(x) be the weight associated with a particular pixel x ∈ Ω in Eq. (1). The pixels proximal to tissue-transition regions as evidenced from the ground-truth annotations are often the most challenging to accurately segment as the tissue boundaries may be diffused owing to speckle noise and limited OCT resolution. To encourage the ReLayNet kernels to be sensitive to these, we boost the gradient contributions from such pixels by weighing them with a factor of ω 1 .
The retinal layers are also heavily imbalanced in contrast to the dominant class (background) and the weighting scheme also aims at compensating this by weighing the contribution of under-represented classes (retinal layers and fluid masses) with a factor of ω 2 . Thus, the final weighting scheme is derived as follows (illustrated in Figure 4): where I(logic) is a indicator function which is one if the (logic) is true, else returns zero. '∇' represents the gradient operator.

Optimization
During training the ReLayNet, we optimize these losses with an additional weight decay term for regularization, defined as follows: with weight terms λ 1 , λ 2 and λ 3 and W (·) F represents the Frobenius norm on the weights W of the ReLayNet. The training problem is to estimate the weights and bias Θ = {W (·) , b (·) } associated with all the layers, so that to minimize the overall cost function: where Θ * is the optimal parameter set that minimizes the overall cost. This cost function is optimized using stochastic mini-batch gradient descent with momentum and back propagation. The derivative of the cost function w.r.t. the parameters Θ is given by δΘ is estimated via chain rule by back propagating the gradients. The first term, δ J overall δ p l (x) is estimated as: The derivative terms of the individual losses are derived as:

OCT B-scan slicing and data augmentation
Training of a deep ReLayNet model with full-width OCT images is limited by the available RAM in the GPU. This requires us to train with smaller batch size, but it often leads to very noisy gradients while training and the loss curve tend to diverge [26]. To address this issue, we used a data slicing approach wherein an OCT B-scan is sliced width-wise into a set of non-overlapping B-scan lines as shown in Figure 5. Further, we augment the sliced data by introducing random horizontal flips and slight spatial translations and cropping. It must be noted that due to the resolution-preserving nature of ReLayNet, during test time we use the whole B-scan image, thus obtaining a seamless segmentation without any slicing induced artifacts as shown in Figure 5.

Dataset
The proposed framework is evaluated on the Duke SD-OCT publicly available dataset for DME patients [15]. The dataset consists of 110 annotated SD-OCT B-scan images of size 512 × 740 obtained from 10 patients suffering from DME (11 B-scans per patient). The 11 B-scans per patient were annotated centered at fovea and 5 frames on either side of the fovea (foveal slice and scans laterally acquired at ±2, ±5, ±10, ±15 and ± 20 from the foveal slice). These 110 B-scans are annotated for the retinal layers and fluid regions by two expert clinicians. The details of the acquisition process is reported in [15].

Experimental settings
Following the standard convention of splitting this dataset as reported in [15], we divide the data with subject 1-5 in the training set and subject 6-10 in the testing set (55 B-scans in each set).

Comparative methods and baselines
The performance of the proposed ReLayNet is evaluated against state-of-the-art retinal OCT layer segmentation algorithms, specifically Graph based dynamic programming (GDP) [13] (CM-GDP), Kernel regression with GDP [15] (CM-KR) and Layer specific structured edge learning with GDP [20](CM-LSE). To contrast ReLayNet with state of the art deep FCN architectures, we include comparisons with U-Net architecture [19] (CM-Unet) and FCN architecture proposed by [17] (CM-FCN). Due to limited training data, we reduced the layer depth of CM-Unet in comparison to the original architecture proposed in [19]. For a fair comparison, the depth, kernel size, number of channels are kept identical to ReLayNet. This effectively factors out our incremental contributions (unpooling layers with composite loss for training) as the contributing elements for the contrastive results. In addition to the above comparative methods, we also present several plausible variations ReLayNet have been set as baselines for comparison, specifically to highlight the importance of each of the proposed contributions. All the baselines are detailed below, with the salient aspects of each baseline detailed in Table 1.

Evaluation metrics
The comparative analysis of segmentation performance is done based on 3 standard metrics as reported in [15,20]. These include the Dice overlap score (DS), estimated contour error for each layer (CE) and the error in estimated thickness map (MAD-LT) for each layer. The lateral resolution of the OCT B-scans is between 10.9µm to 11.9µm [15]. As the Duke SD-OCT dataset does not report the individual scan resolutions, we resort to reporting our error metrics in pixels as the nearest surrogate.

Qualitative comparison of ReLayNet with comparative methods
We present a qualitative comparison of ReLayNet in contrast with the comparative methods for two cases: an pathological OCT B-scan with DME (as shown in Figure 6) and for an OCT B-Scan sans fluid accumulation distal from the fovea (as shown in Figure 7). OCT B-scan with fluid accumulation: Foveal scan (presented Figure 6(a)) is representative of a challenging pathological case due to the existence of accumulated fluid masses and relatively thin retinal layers at the foveal region (as indicated by yellow arrow in Figure 6). We further observe that a small fluid pool towards the right of the B-Scan (shown with a white box) is successfully segmented by ReLayNet, while CM-Unet, CM-FCN and CM-KR prediction fails to capture the small structure ( Figure 6(b-d)). Evaluating segmentation performance of layers proximal to the fovea (indicated by yellow arrow), the predictions of CM-GDP is observed to be highly smoothened with lack of detail and it particularly over-predicts Class NFL-IPL and under-predicts the lower retinal layers. In comparison, CM-GDP and CM-LSE result in predictions with greater detail. However, these methods do not consider the presence of fluid while segmenting and the resultant thickness maps may be erroneous at locations proximal to fluid structures. We also observe that the segmentation of ReLayNet and CM-Unet are of high quality and comparable to that of another human expert, indicating the promising potential of F-CNN based frameworks. We also note that CM-FCN performs very poorly at the fluid class and suffers from high confusion between the Fluid and RbR class. This factors the contribution of encoder-decoder based architectures and use of weighted loss function which CM-FCN lacks in comparison to ReLayNet and CM-Unet. OCT B-scan without fluid accumulation: The frame presented in Figure 7(a) is representative of a fluid-free OCT scan acquired distal from the fovea. We can consistent performance across the comparative methods. This observation also indicates that the comparative methods have been trained fairly and the major distinction arises in the presence of pathology where such a objective segmentation tool is often needed.

Quantitative comparison of ReLayNet with comparative methods
Towards quantitative evaluation of performance against the comparative methods, we report the three metrics, namely DS, MAD-LT and CE for each of the retinal layers in Table 2. We also report the DS for the fluid class. From an overall perspective, ReLayNet demonstrates highest segmentation efficacy in 9 classes out of 10 with above 0.9 DS for RaR, ILM, NFL-IPL, ONL-ISM, ISE and OS-RPE layers. CM-Unet has the second best performance for 5 classes among all the comparative methods. In the particular case of ONL-ISM layer, ReLayNet has the second best performance in DS (0.93) which is highly comparable to the best performing comparative method CM-LSE. Further, the OPL layer is the most challenging retinal layer to segment (evident from the low DS of 0.74 between two expert observers). In this challenging layer, ReLayNet achieves a DS of 0.84 which is substantially improved over the other comparative methods with improvements of 0.17, 0.10 and 0.07 over CM-GDP, CM-KR and CM-LSE respectively. In addition to improved layer segmentation, we also observe substantial improvement in the segmentation of fluid masses and report a DS of 0.77. ReLayNet significantly outperforms CM-Unet and CM-FCN in fluid segmentation by margins of 0.10 and 0.49 respectively in DS. CM-FCN exhibits the worst performance of 0.28 DS for fluid class in comparison to all the other comparative methods.
In terms of the MAD-LT metric, ReLayNet achieves consistently superior performance for all the constituent layers. Specifically, CM-GDP has the worst performance in thickness estimation for layers ILM, NFL-IPL, OPL and ONL-ISM. In contrast, CM-KR and CM-LSE exhibit comparable performance and better than CM-GDP due to GC related improvements incorporated in them. Our overall qualitative and quantitative analyses substantiate that ReLayNet performs better than comparative methods based on the introduction of key contributions including that of (i) Dice loss function and (ii) use unpooling layer instead of convolution transpose layers or interpolation which differentiates ReLayNet from CM-Unet and CM-FCN respectively. It also demonstrates that ReLayNet is able to estimate layer thickness better than graph-based comparative approaches and exhibits consistency across pathological variations, despite diffused layer boundaries and presence of speckle noise. Comparison with additional human expert observer: We also compare the agreeability between the two human expert annotations (Expert 1 vs. Expert 2) and ReLayNet performance (ReLayNet vs. Expert 1) and report the observed metrics in Table 2. The low observer agreement between the two experts reflected particularly by the low DS in retinal layers INL (0.79), OPL (0.74), OS-RPE (0.82) and the fluid class (0.58) shows that the task of retinal segmentation is highly subjective and challenging. This substantiates our premise for the need for an objective solution. Comparing Expert 2 annotations to that predicted by ReLayNet, we can observe a higher agreement with the ground truth (Expert 1) for ReLayNet.

Importance of ReLayNet contributions
Importance of skip connections BL-1-3: Contrasting with BL-1 (sans any skip connections), we observe that ReLayNet outperforms in segmentation performance across all the retinal layers and fluid masses. Particularly, a significant improvement of 0.09 in DS is observed in the fluid class. This is also reflected in MAD-LT and CE of the ONL-ISM layer where ReLayNet improves over BL-1 by a margin of 0.6 and 0.08 pixels respectively. This observed improvement is owed to the introduction of skip connections which improves trainability of deep models and provides additional contextual information derived from encoder-features maps for improved segmentation [28]. To further understand the relative importance of various levels of skip connections within ReLayNet, we contrast with BL-2 (only coarse-resolution skip connections) and BL-3 (only fine-resolution skip connections). In addition to consistently superior performance of ReLayNet, we particularly observe an increase of 0.08 and 0.11 dice score for the fluid class over BL-2 and BL-3 respectively. Specifically for the ONL-ISM layer, a increase in MAD-LT of 1.6 and 1.5 pixels for BL-2 and BL-3 respectively is observed. These observations affirm our premise that skip connections at all levels of resolution are highly contributory and introducing them induces significant improvements in network performance.

Effect of joint loss functions BL 4-5:
We contrast ReLayNet with BL-4 (only weighted logistic loss) and BL-5 (only dice loss) to ablatively test the effect of the joint loss. From Table 3, we observe that the loss functions are complementary in nature and improve segmentation performance. Particularly for ILM and OS-RPE, BL-5 is better than BL-4 by a margin of 0.03 and 0.05 DS. These two layers represent the two boundaries of retina and are very susceptible to be confused with the background classes. A similar observation is made for retinal thickness estimation (increase in error of 1.3 and 1.6 pixels for ILM and OS-RPE respectively) and contour estimation (increase in error of 0.3 and 0.5 pixels for ILM and OS-RPE respectively) comparing BL-4 vs. BL-5. This effect is much more dominant with the joint action of dice loss along with logistic loss as proposed for ReLayNet. Effect of depth of network BL 6-7: The choice of network depth is closely related to model complexity. Ideally, we need a network that is sufficiently deep to learn discriminative hierarchy of task-specific kernels while simultaneously not over-fitting to the training data. In light of this empirically accepted design rule, we explored three plausible architectures with varying depths as comparative baselines to the ReLayNet architecture. These architectures are addressed as x − 1 − x, which symbolizes an architecture with x encoder blocks, 1 low resolution bottleneck convolutional block connecting the encoder and decoder followed by x matched decoder blocks. The architectures explored are 2 − 1 − 2 (BL-6), 3 − 1 − 3 (ReLayNet) and 4 − 1 − 4 (BL-7). The performance of all these architectures are reported in Table 3. We observe that the DS of BL-6 deteriorates in comparison to ReLayNet. Particularly, a decrease in 0.09 dice is observed for fluid class, with similar trends in MAD-LT and CE. Contrasting with BL-7, we can observe for most of the classes, the dice performance is almost same as ReLayNet except for the classes INL, OPL with an decrease of 0.03 and 0.02 dice scores. We can conclude that though this is not a conclusive case of over-fitting, the additional model complexity with increased depth offers limited improvement and the proposed layer configuration (3 − 1 − 3) is an optimal. Importance of pixel-wise weighted loss function BL-8: In Sec. 3.3.2, we motivated the weighting scheme to encourage learning kernels sensitive to layer transitions and efficiently compensate class balance. To ablative test such a scheme, we introduce BL-8 sans any weighting  (3)) and tabulate the observed performance metrics in Table 3.
Comparing against BL-4 (network with weighted logistic loss), we observe a fall of 0.06 in DS for the fluid class, which is the most under-represented class within the training data. In terms of MAD-LT, BL-8 exhibits the lowest performance for INL, OPL and ISE amongst all the baselines. For CE, BL-8 reports the lowest performance for INL, OPL, ONL-ISM and ISE amongst all the baselines. This fall in performance reaffirms the importance of a weighting scheme within the loss function as motivated earlier, for better segmentation at layer boundaries and compensating for class imbalance.

Folded cross validation
To fully utilize the benchmark dataset, we performed additional k-fold cross-validation by splitting the dataset into non-overlapping subsets of 8 patients for training and held out 2 patients for testing. Within the training dataset, 8-folded cross-validation was performed, resulting in eight independently trained ReLayNet models. We also report the ensemble performance of these folded-models on test data and compare it against the model trained with the standard 50 -50 % split (discussed earlier in Sec. 5.2). The fold-wise and ensemble results evaluated with standard metrics Dice, MAD-LT and CE are tabulated in Table 4. We also contrast the ensemble results with the results of the standard split (tabulated as 50 -50 Split in Table 4) and observe a significant increase of 6% in the Dice metric for the Fluid class and consistent improvements across the retinal layers. These observations support that testing with ensemble of ReLayNet models leads to improved segmentation performance but the testing time is traded-off.

Analysis of ETDRS grid across patients
Given a retinal OCT volume acquired for a specific subject, the thickness of the overall retina is generally reported as an Early Treatment of Diabetic Retinopathy Study (ETDRS) grid, providing the average thickness in 9 different spatial zones of the retina [29]. The spatial zones are centered at the fovea and are delineated as shown in Figure 8. The severity of an edema is assessed based on the thickness of these spatial zones by assessing deviations from normal ranges. As the Duke SD-OCT dataset does not contain the local anatomical information (left / right eye) and annotations are made by sparsely sampling the OCT along the azimuthal direction, we assume all the image volumes are aligned left to right with the temporal to nasal axis in our evaluation. The azimuthal resolution of the frames varies between 118-128µm. We assume it as 122µm for placing it in the grid. It must be noted that due to this assumption, the computed chart is not exact but does show its potential use for computing the grid provided exact resolutions are known. We compute the overall retinal thickness by combining the delineated retinal layers and report the mean absolute difference in overall retinal thickness for the different comparative methods at each of the spatial zones in Table 5. We observe that the performance of ReLayNet is significantly better than all the comparative methods for all the 9 zones. Zone 1 indicating the foveal region (most clinically significant) has an very low error of 0.3 pixels. CM-Unet has the second best performance for 4 out of 9 zones (zones 5,7,8,9), especially in regions far from the fovea. The consistent performance of ReLayNet in thickness estimation makes it a good tool for estimating the ETDRS grid.  Table 5. Difference in retinal overall thickness (in pixels) for 9 zones in ETDRS grid across testing subjects. The best performance is shown by bold, the second best is shown by and the worst shown by †.

Conclusion
In this paper, we have proposed ReLayNet, an end-to-end fully convolutional framework for semantic segmentation of retinal OCT B-scan into 7 retinal layers and fluid masses. We train and validate it on a publicly available benchmark of expert annotated OCT B-scans acquired from 10 patients. The training of ReLayNet involves minimization of a combined loss comprising of weighted logistic loss and dice loss. ReLayNet is particularly suited for clinical applications owing to its improved test time in the order of 0.01 seconds to segment a single B-Scan. The proposed ReLayNet framework has been compared and validated against five state-ofthe-art retinal layer segmentation methods including ones using graph-based dynamic programming [13,15,20] and deep learning [17,19]. Additionally, comparisons have been reported against eight incremental baselines validating each of the individual contributions. The evaluation was performed on the basis of three standard metrics including dice loss, retinal thickness estimation and deviation from layer contours. We demonstrate conclusively that ReLayNet exhibits superior performance in these comparisons and affirm that it can reliably segment even in the presence of a high degree of pathology which severely affects the normal layered structure of the retina. Open questions for future investigation include extension of ReLayNet into intra-operative scenarios like retinal microsurgeries, which poses challenges of poor spatial resolution and artifacts induced by surgical tools. With increasing training data, one could potentially introduce 3D convolutional kernels to improve inter-frame consistency in volume segmentation.