Deep learning approach for the detection and quantification of intraretinal cystoid fluid in multivendor optical coherence tomography.

We developed a deep learning algorithm for the automatic segmentation and quantification of intraretinal cystoid fluid (IRC) in spectral domain optical coherence tomography (SD-OCT) volumes independent of the device used for acquisition. A cascade of neural networks was introduced to include prior information on the retinal anatomy, boosting performance significantly. The proposed algorithm approached human performance reaching an overall Dice coefficient of 0.754 ± 0.136 and an intraclass correlation coefficient of 0.936, for the task of IRC segmentation and quantification, respectively. The proposed method allows for fast quantitative IRC volume measurements that can be used to improve patient care, reduce costs, and allow fast and reliable analysis in large population studies.


Introduction
Age-related macular degeneration (AMD) is a complex multi-factorial retinal disease where genetic and environmental factors play a large role in the development of the disease. Since the disease is influenced by a wide array of intrinsic and extrinsic risk factors and protective factors, a universal one-fits-all treatment is arguably not the optimal solution as shown by the large percentage of non-responders to available therapies [1]. A common treatment option for exudative AMD is the injection of intravitreal anti-vascular endothelial growth factor (anti-VEGF). This treatment stops the growth of abnormal blood vessels that cause leakage of vascular fluid, blood and lipids in the retina [1,2]. Anti-VEGF treatment is typically administered monthly for an extended period of time, causing a high burden on the patient while not always resulting in improved vision [1,3]. Moreover, as anti-VEGF treatment is expensive, treatment on a monthly basis also causes a significant financial burden to the healthcare system. To improve patient care, and reduce cost, other treatment regimens such as the Pro Re Nata (PRN or "as needed") and the Treat-and-Extend (TE) regimes have been proposed, where re-treatment is only applied in case of re-occurrence of retinal bleeding or fluid accumulation [4]. However, these regimens require a constant assessment of the presence of fluid and its changes over time in order to provide personalized care and reduce unnecessary injections.
Fluid accumulation is best visualized on spectral domain optical coherence tomography (SD-OCT) imaging, an almost indispensable tool in the assessment of AMD treatment success. SD-OCT imaging provides a noninvasive, high resolution, three-dimensional visualization of the retina, where fluid is visible as a hyporeflective area. A clear distinction can be made on SD-OCT imaging between intraretinal fluid (IRF) and subretinal fluid (SRF) based on the relative location in the retina, i.e., inside the sensory retina or below it, respectively. For IRF a distinction can be made between intraretinal cystoid fluid (IRC) or cysts, and diffuse non-cystic IRF. Examples of SRF and both types of IRF are shown in Fig. 1. Detecting areas of fluid accumulation, and the specific type of fluid, in SD-OCT, is important in determining the best treatment option and treatment efficacy [5]. Especially the presence of IRC at baseline has been found to be an important factor related to decreased treatment response [6,7]. During treatment, detection of changes in IRC helps assessing treatment efficacy, especially in the PRN and TE treatment regimens, where retreatment depends on monitoring fluid changes. Accurate analysis of these changes by visual assessment is difficult and subjective, especially if they are small. Manual delineation of IRC does allow for an exact and reproducible objective comparison, but it is extremely time-consuming and often infeasible, especially in large scale population studies or time-constrained clinics. For this reason, computer-aided detection systems capable of automatically detecting and quantifying IRC are highly sought after [8]. In the last years, computer-based algorithms have shown their potential in the automatic analysis of retinal images and, particularly, in SD-OCT volumes [9]. Previously proposed works for automated analysis of SD-OCT volumes have reported good results in the automated segmentation of intraretinal layers, either for healthy or mildly affected retinas [10,11] or retinas with more severe pathology [12][13][14][15][16]. As the image quality and the amount of noise in OCT images can vary strongly, a significant amount of research has also been spent on the development of effective denoising algorithms, either to improve viewing quality or to improve performance of a consecutive segmentation algorithm [17][18][19][20]. Detection and segmentation of specific retinal lesions has also been proposed, e.g., segmentation of SRF [21-23], segmentation of geographic atrophy [24-26] and automated detection of drusen [27][28][29][30][31]. For the segmentation of IRF a distinction can be made in the previously published methods: while some methods focused on IRC [22, 32-36], other methods extend the segmentation to also include diffuse non-cystic IRF [37-41], making a direct comparison between methods difficult. One of the first attempts to automatically segment IRC was approached using a deformable model reporting good results in a small dataset, but requiring manual initialization for each lesion separately [41]. Segmentation of IRC was addressed in a previous work using a three-dimensional curvelet based k-singular value decomposition (K-SVD) and validated in four SD-OCT volumes acquired by a Heidelberg SD-OCT scanner, reaching a Dice overlap coefficient of 0.65 [32]. A method relying upon a transformation to the neutrosophic domain followed by a shortest path algorithm was evaluated on the same four Heidelberg SD-OCT volumes used by the previously stated method, reaching a Dice coefficient of 0.705 [35]. Recently, a modified version of this algorithm was published improving upon their previous results, achieving a Dice coefficient of 0.812 [36]. Unfortunately, evaluation was only performed on the same four Heidelberg SD-OCT volumes. Other authors proposed a filtering and thresholding based algorithm for IRC segmentation and evaluated it in a small dataset of 16 SD-OCT volumes obtained by a Cirrus SD-OCT scanner, showing good results but failing to detect small cysts [33]. A method based on k-means cluster analysis showed good performance, but was only evaluated in a small set of 130 lesions [42]. Other authors proposed a method based on the marker controlled watershed transform, showing good performance in a small set of 4 OCT volumes acquired by a single OCT device [43]. A semi-automatic variational segmentation algorithm has also been proposed for the segmentation of IRC achieving good performance [22]. While the authors claimed the method can be extended to other SD-OCT vendors, the algorithm was only validated on SD-OCT volumes obtained by a Heidelberg SD-OCT scanner. A convolutional neural network (CNN) was proposed for the segmentation of IRC using a multiscale patch based approach [34].
The method was validated in a large dataset of 157 SD-OCT volumes from a single vendor with good results. Recently, this method was extended and evaluated in a large dataset containing cases of AMD, diabetic macular edema (DME), and retinal vein occlusion (RVO) and reported good performance for segmentation and quantification of IRC [44]. Another deep learning based method obtained a Dice coefficient of 0.729, but instead of evaluating on SD-OCT volumes, the analysis was performed in a set of 934 B-scans with varying pathology acquired with a single SD-OCT device [39]. A CNN developed for the segmentation of 7 retinal layers and fluid masses showed good performance [45]. The method was, however, developed and evaluated in a small dataset containing 110 B-scans from 10 patients acquired by a single SD-OCT device. For the undifferentiated segmentation of IRC and diffuse IRF, a method based on kernel regression was proposed and validated in a dataset containing cases with diabetic macular edema (DME) acquired with a Heidelberg SD-OCT scanner [37]. The algorithm showed segmentation performance close to the interobserver variability. Fuzzy level-sets were applied for the segmentation of IRF obtaining good agreement with manual segmentations, but were evaluated in a small set of SD-OCT volumes acquired by a single SD-OCT device [40]. Other authors described a method using probability constrained graph cuts to segment so-called 'symptomatic exudate-associated derangement' or SEADs, which includes IRF, both cystoid and diffuse [38]. While the method showed reasonable performance, it relied heavily on a good initialization of the method. Recently a method for the simultaneous segmentation of retinal fluid and retinal layers was proposed using an auto-context approach [14], obtaining an undifferentiated segmentation of both diffuse and cystoid IRF. Although relatively poor results were reported in a private dataset containing Cirrus SD-OCT volumes, the method achieved comparable results in the DME dataset used by a previously discussed method [37]. Finally, in a recent benchmark study, several methods for IRC segmentation were compared and evaluated in a publicly available dataset [46]. Unfortunately, only data from two of the four SD-OCT devices present in the dataset were included for analysis. As described, the majority of the previously proposed methods were evaluated on SD-OCT data from a single SD-OCT vendor, limiting the assessment and the widespread use of these algorithms in data with different imaging characteristics.
In this work we propose a deep learning algorithm for the detection, segmentation and quantification of IRC in SD-OCT volumes acquired with different SD-OCT devices. The proposed algorithm is based on a fully convolutional neural network (FCNN), where IRC is segmented by performing a semantic segmentation of the SD-OCT volume, i.e., every pixel in an SD-OCT volume is analyzed and given a probability of belonging to IRC. The proposed FCNN implements a multi-scale analysis, providing a large contextual window that allows for accurate segmentation of a wide range of cysts, ranging from small micro-cysts to large intraretinal cysts spanning over a wide area of a B-scan. The large modeling capacity of the proposed FCNN allows it to learn a wide range of different complex features, capable of capturing the large variability in IRC appearance and vendor-dependent SD-OCT characteristics. Segmentation performance is further enhanced by the inclusion of an additional step for retina segmentation as a means to constrain the search space during training.
The algorithm performance is evaluated in 1) a large private database of SD-OCT volumes from AMD patients with advanced AMD with IRC; 2) a dataset of healthy controls; and 3) a publicly available dataset containing SD-OCT volumes with IRC from four different vendors.

Data
For this study a total of 221 SD-OCT volumes from 151 patients (6158 B-scans) with varying presence of IRC were randomly selected from the European Genetic Database (EUGENDA, http://eugenda.org), a large multi-center database for clinical and molecular analysis of AMD [47,48]. Written informed consent was obtained before enrolling patients in EUGENDA. The EUGENDA study was performed according to the tenets set forth in the Declaration of Helsinki, and Institutional Review Board approval was obtained.
SD-OCT volumes were acquired using a Spectralis HRA+OCT (Heidelberg Engineering, Heidelberg, Germany) at a wavelength of 870 nm, a transversal resolution ranging from 6 µm to 14 µm and an axial resolution of up to 3.9 µm. The dimension in the axial resolution was 496 pixels, in the transversal direction the dimensions varied between 512 and 1536 pixels. The number of slices, i.e. the number of B-scans, varied from 19 to 37, corresponding to a B-scan spacing ranging from 240 µm up to 120 µm, respectively. Before processing, to remove the variability in resolution, all B-scans from an SD-OCT volume were resampled to a constant pixel size of 11.5 µm x 3.9 µm.
The data were randomly divided into three sets (approximately 40/10/50 split on patient level): a training set, consisting of 103 SD-OCT volumes (3131 B-scans) from 60 patients, for the development and optimization of the algorithm; a validation set of 19 SD-OCT volumes (540 B-scans) from 16 patients, for monitoring the algorithm training process; and an independent In the training set full volume annotation of IRC and the total retina, i.e., the region between the inner limiting membrane (ILM) and the outer boundary of the retinal pigment epithelium (RPE), was performed. Annotation was performed by an experienced grader using a computer-assisted annotation platform which allows manual pixel level annotation and correction. In order to reduce annotation time, annotations were made using a semi-automated approach. An initial retina and IRC segmentation produced by a preliminary segmentation algorithm was proposed to the human grader, who was provided with tools for manual correction if errors were present in the initial annotation proposed by the system. Examples of annotated B-scans from the EUGENDA database are shown in Fig. 2.
In the first test set annotation of IRC was performed by the human grader using the same computer-assisted annotation platform as used for the training set.
In order to assess the inter-observer variability 50 B-scans containing IRC were randomly selected from the second test set for annotation. These 50 B-scans were annotated by three independent observers, of which one was selected as reference standard. Annotations were made without the support of the computer assisted annotation platform.
To assess the generalizability of the proposed algorithm to multiple vendors an external set was created from a publicly available database (OPTIMA) containing 30 SD-OCT volumes acquired by four different SD-OCT devices [49], namely: • Heidelberg Spectralis HRA+OCT (Heidelberg Engineering, Heidelberg, Germany) • Zeiss Cirrus (Carl Zeiss Meditec, Dublin, CA, USA) Vol. 9, No. 4 | 1 Apr 2018 | BIOMEDICAL OPTICS EXPRESS 1551 Examples of annotated B-scans for each of the four devices are shown in Fig. 3. This external dataset is comprised of a training set of 15 SD-OCT volumes, and a test set also containing 15 SD-OCT volumes. Both the training and test subsets are comprised of four volumes per vendor aside from Nidek with three. For further details and information concerning Fig. 4. Overview of the proposed algorithm for IRC segmentation. The first FCNNs responsible for retina segmentation is visualized in green, while the second FCNN responsible for IRC segmentation is shown in red. The retina segmentation produced by the first FCNN is stacked together with the input B-scan to form a two-channel input for the IRC segmentation.
the inclusion criteria we refer to the paper describing the data set [49]. Manual pixel level IRC annotations made by two independent observers were provided for both the training and test set in the external dataset. For this study we used the intersection of the two observers, i.e., where both observers agree on the presence of IRC, as the reference standard for training the algorithm.

Deep learning algorithm
The proposed deep learning algorithm automatically detects, segments and quantifies IRC in the entire SD-OCT volume. The algorithm produces an IRC volume segmentation by processing every B-scan in an SD-OCT volume individually. After processing all B-scans in a volume the resulting segmentations are combined to form the output volume segmentation. The algorithm consists of a cascade of two FCNNs with two complementary tasks: the first FCNN aims at delimiting the total retinal area; and the second FCNN aims at segmenting IRC by integrating the output of the first FCNN in the optimization process. This allows for the inclusion of a priori knowledge of the retinal anatomy and improves segmentation performance. Figure 4 gives a schematic overview of the entire processing pipeline with the two cascaded FCNNs indicated in green and red. The specifics for each task will be explained in more detail in the two following subsections. The FCNN architecture used for these tasks and their training procedures will be explained in subsections 2.2.3 and 2.2.4, respectively.

First task: Retina segmentation
As IRC is restricted to the neurosensory retina, information about the relative location in the retina can be beneficial in determining if a pixel is part of IRC. For example, fluid closer to the RPE is more likely to be SRF than IRC. This a priori information about the retina anatomy can be extracted by identifying and delimiting the retina, i.e. by segmenting the retina. Besides location information, retina segmentation also allows to focus the learning process within the limits of the retina area, ignoring other areas present in the scan, such as the choroid or the vitreous.
Building upon our previous work [12], we develop an FCNN to obtain a semantic segmentation of the retina, i.e. every pixel in an SD-OCT volume is analyzed and given a probability of belonging to the retina. The architecture of this FCNN is specifically developed for robust segmentation of the retina in the presence of disruptive pathology such as IRC, explained in section 2.2.3. The output of this first step, after thresholding, is a binary image indicating all the pixels between the ILM and the RPE. The weight for background pixels (black) is set to 0, retina pixels (blue) get a weight of 1, and IRC pixels (red) get a weight between 0 and 5.

Second task: Intraretinal cystoid fluid segmentation
For the second task a FCNN is developed to obtain a semantic segmentation of IRC, i.e. a pixelwise classification of the full SD-OCT scan indicating the areas of IRC. In order to introduce a priori information about the retina anatomy, this FCNN integrates the output produced by the previous step using two different approaches: 1) as an additional input, and 2) as a constraint of the network training process. For the first approach, additional input, the binary retina segmentation produced by the retina segmentation step is stacked together with the input B-scan and provided as a two-channel input to the FCNN. This addition allows the network to learn location specific features that can improve segmentation performance. For the second approach, the retina segmentation output is used to create a weight map that give locations within the retina more importance during training, ignoring areas outside the retina, i.e., choroidal tissue and vitreous fluid. Given a B-scan B i from the training set, its obtained retina segmentation R i and the manually annotated IRC S i considered as reference standard (see Fig. 5), the weight map for each location x in B i is defined as: With N r etina the total number of pixels that are part of the retina (excluding IRC) in B i , and N I RC the number of pixels that are part of IRC in B i . This weight map is then introduced in the loss function of the developed FCNN in order to focus the training process on those areas with higher weight, such as IRC, and disregard less important parts of a B-scan such as the vitreous fluid or choroidal tissue. An example of the weight map is shown in Fig. 5(d), where the weight for background pixels is set to 0, retina pixels get a weight of 1, and IRC pixels get a weight between 0 and 5.

FCNN architecture
The FCNN architecture used in both tasks is based on U-net, a fully convolutional deep learning approach specifically designed for dense segmentation tasks [50]. The U-net architecture designed for this problem is visualized in Fig. 6. U-net can be subdivided in a left and right side, where the left side is responsible for obtaining contextual information, i.e., what, while the right side is responsible for accurate localization, i.e., where. Due to the successive application of max pooling operations in the left side of the network, indicated by the orange downward arrows in Fig. 6, the spatial context, or receptive field, increases, allowing the input image to be analyzed at multiple scales. At every consecutive scale the spatial context is doubled while the resolution is halved. To recover a segmentation at full resolution, the right side of the "U" performs a series of upsampling operations indicated by the green upward arrows in Fig. 6. Finally, shortcut connections going from the left part of the "U" to the right part are added to integrate high resolution information from the left side with low resolution information from the right side.
The standard U-net architecture has a receptive field of 140x140 pixels, corresponding to the largest red square in Fig. 7. An increase in contextual information is required to account for large deformations in the retina, and to allow correct segmentation of large IRC lesions. We therefore increase the receptive field by adding two additional downsampling steps. The receptive field size increases to 572x572 pixels, corresponding to the largest green square in Fig. 7, i.e., the context is as large as the entire B-scan.

FCNN training procedure
The network parameters of the defined FCNNs were optimized by training on randomly selected samples from the training data. The two models were individually trained until convergence was reached. Convergence was determined by calculating the performance on the independent validation set at regular intervals during training. Since the proposed network belongs to the subclass of FCNNs, i.e., does not contain any fully connected layers, learning and inference are performed using whole images by dense feedforward computation and backpropagation, increasing processing speed compared to patch-based networks. At every iteration, a small subset of B-scans (mini-batch) and the corresponding IRC annotations were randomly selected from the training data and used to optimize the system parameters. To maximize the throughput of the GPU the mini-batch size was set to process six B-scans at the same time. To increase robustness of the system, we increase the variance in the training dataset by applying an extensive data augmentation strategy. The following data augmentations are randomly applied to every B-scan selected for training: • Random cropping of a 512x512 subimage from a B-scan (additional zero padding if necessary) • Random rotation between -15 and +15 degrees • Random mirroring • Random multiplicative speckle noise with a magnitude between 0 and 0.4 The limits have been selected in such a way that after augmentation the resulting B-scan is a plausible example observable in clinical practice. The data augmentation process is visualized in Fig. 8. Fig. 8. Data augmentation strategy. The training dataset is synthetically increased by data augmentation. The following augmentations are successively applied: random rotation, random cropping, random mirroring and speckle noise addition.
The proposed CNN was trained using stochastic gradient descent with a learning rate starting at 10 −3 up to 10 −6 . The learning rate was manually decreased by a factor of 10 whenever a plateau was reached in the learning curve. Momentum was set to 0.99 in order to include a large amount of previously seen samples in the current update step. He-normal initialization was used for weight initialization of the network. The network architecture and training procedures were implemented in Python 2.7 using the Theano [51] and Lasagne [52] packages for the deep learning framework. Training of the network was performed on a Nvidia titan X GPU using CUDA 7.5 with cuDNN v4 and CNMeM optimizations enabled. Convergence of the network was reached in about 36 hours. When the training procedure is finished, the time required to produce a IRC segmentation is about 5 seconds for a complete OCT volume.

Evaluation design
We perform three different experiments to evaluate the performance of the proposed deep learning algorithm in three different scenarios.

Segmentation of IRC
To assess whether we can accurately segment IRC we compare the segmentation performance of our proposed algorithm to manual annotations by human observers. Segmentation performance is quantified using two different metrics: 1) the Dice similarity coefficient (DC), a statistic that quantifies the overlap between two binary images, e.g., ground truth and segmentation output, and 2) the area segmentation error (ASE), i.e., the difference between the area annotated in the ground truth and the segmented area by the proposed method in mm 2 . The DC is assessed on the total volume, while the area segmentation error is assessed on a B-scan level.
To assess whether the addition of the retina segmentation as a means to include prior information is effective, we perform four different experiments: 1) using the retina segmentation as a weight map during training; 2) using the retina segmentation as an additional input to the system; 3) using both methods to include prior information simultaneously; and 4) without using the retina segmentation. These experiments are performed in the first test set containing 53 SD-OCT volumes, the second test set containing 10 SD-OCT volumes, and the third test set containing 33 OCT volumes of healthy controls. Four different approaches to include prior information are compared, i.e., without using prior information (red), using the retina segmentation as an additional input (yellow), using the retina segmentation as a weight map during training (blue), and finally the proposed method where both techniques to include prior information are used together (green).

Quantification of IRC volume
To investigate the ability of the proposed method to make an accurate quantification of the volume of IRC in the retina, we compare the total segmented volume in µl to the volume annotated by independent human observers. This experiment is performed in the second test set consisting of 50 B-scans from 10 independent SD-OCT volumes. Volume quantification performance is assessed by the intraclass correlation coefficient (ICC) and Pearson's correlation coefficient (ρ), two commonly used summarizing measures to indicate how well the proposed method and the two observers resemble the ground truth IRC volume. Finally, the volume differences are visualized using Bland-Altman plots comparing the proposed method and the two observers to the ground truth. This experiment is performed for all four approaches to include prior information.

Generalization to multiple OCT vendors
To assess the performance of the system to segment SD-OCT volumes acquired by different SD-OCT devices, we compare the output of the system to manual annotations by two human observers in the external dataset consisting of SD-OCT volumes acquired by four different SD-OCT devices. The proposed method was retrained on the EUGENDA dataset extended with the training data from the external dataset (15 OCT volumes, 3-4 cases per vendor) Finally, the analysis is performed on the 15 OCT volumes from the test set in the external dataset.
The output of the proposed system is compared to reference standard 1 (i.e., the intersection of both observers), and reference standard 2 (i.e., the intersection of both observers where pixels from which the observers disagreed were not considered for the evaluation). This means that IRC segmented by the proposed method that overlaps with a region annotated by only a single observer does not contribute to oversegmentation, i.e., are considered as not segmented. This reference standard allows evaluation of the method on a more certain ground truth, and does not penalize segmented IRC in regions for which observers are uncertain.
Using these two reference standards, we report the average and median DC. Additionally, we Table 1. Dice coefficients, i.e., mean ± standard deviation (median) and area segmentation error obtained on the first test set using four different approaches to include prior information, i.e., 1) by using the retina segmentation as a weight map during training; 2) using the retina segmentation as an additional input to the system ; 3) both methods to include prior information used simultaneously; and 4) without using the retina segmentation. include the inter-observer variability between the two observers to provide an indication of the maximum achievable human performance.

Segmentation of IRC
The boxplots visualizing the DC and the ASE obtained when comparing the output of the system to the ground truth for all 53 SD-OCT volumes in the first test set are shown in Fig. 9. The proposed method achieves an average and median DC of 0.775 ± 0.208 and 0.852. When the prior information from the retina segmentation is not used an average and median DC obtained of 0.698 ± 0.237 and 0.777 is obtained, respectively. Similarly for the ASE, the proposed method achieves an ASE of 0.000 812 mm 2 ± 0.032 mm 2 . When the prior information from the retina segmentation is not used an ASE of 0.0187 mm 2 ± 0.0389 mm 2 is obtained. Results obtained when each approach to include prior information is used individually are shown in Table 1.
The DC and segmentation error obtained on the second test set for the proposed method and the human observers are shown in Fig. 10. The proposed method achieves an average and median DC of 0.754 ± 0.136 and 0.778, respectively. The average and median DC obtained without using prior information are 0.701 ± 0.184 and 0.755, respectively. Observer 1 obtained an average and median dice of 0.783 ± 0.103 and 0.805, respectively, while observer 2 obtained an average and median dice of 0.783 ± 0.097 and 0.786, respectively.
When considering the ASE, the proposed method achieves an value of 0.022 mm 2 ± 0.040 mm 2 compared to the observers with an ASE of 0.0046 mm 2 ± 0.030 mm 2 and −0.0185 mm 2 ± 0.020 mm 2 for observer 1 and observer 2 respectively. The ASE obtained without using prior information is 0.061 mm 2 ± 0.066 mm 2 . Results obtained when each approach to include prior information is used individually are shown in Table 2.
Finally, the boxplots visualizing the segmentation error obtained in the control test set are shown in Fig. 11. As no IRC is present in this set, the DC can not be defined. For the ASE, the proposed method, which included both approaches to include prior information, achieves an ASE of 0.000 17 mm 2 ± 0.0011 mm 2 . When no prior information is exploited, an ASE of 0.000 36 mm 2 ± 0.0015 mm 2 is obtained. Results obtained when each approach to include prior information is used individually are shown in Table 3. Figure 12 shows the Bland-Altman plots visualizing the volume difference on the second test set when comparing the output segmentation to the manual ground truth for each of the four different approaches to include prior information. The quantification performance of the two human observers is also included. Furthermore, in Table 4 the performance for each approach to include prior information is shown. Fig. 10. Boxplots showing the distribution of (a) the dice coefficients and (b) the area segmentation errors in test set 2 Four different approaches to include prior information are compared, i.e., without using prior information (red), using the retina segmentation as an additional input (yellow), using the retina segmentation as a weight map during training (blue), and finally the proposed method where both techniques to include prior information are used together (green). The performance of the two human observers is also visualizes (brown, white) Table 2. Dice coefficients, i.e., mean ± standard deviation (median) and area segmentation error obtained on the second test set using four different approaches to include prior information, i.e., 1) by using the retina segmentation as a weight map during training; 2) using the retina segmentation as an additional input to the system ; 3) both methods to include prior information used simultaneously; and 4) without using the retina segmentation. The proposed method obtained an average absolute volume difference (AVD) of 0.0334 µl ± 0.0324 µl, where observer 1 and observer 2 obtained an average AVD of 0.0175 µl ± 0.0253 µl and 0.0202 µl ± 0.0181 µl, respectively. The systematic difference for the proposed method is 0.0221 µl with the 95% limits of agreement at −0.0581 µl and 0.1022 µl, for the lower and upper bound, respectively. The systematic difference for observer 1 and observer 2 was −0.0186 µl and 0.0047 µl, respectively. The lower and upper bound of the 95% limits of agreement for observer 1 were at −0.0572 µl and 0.0200 µl, and at −0.0549 µl and 0.0642 µl for observer 2.

Quantification of IRC volume
Overall agreement with the ground truth reference volume was established using the ICC and Pearson's correlation coefficient (ρ). The proposed method achieved an ICC (ρ) of 0.936 (0.949), where observer 1 and observer 2 obtained an ICC (ρ) of 0.978 (0.990), and 0.975 (0.978), respectively. Finally, in Fig. 13 qualitative results and the corresponding annotations by the two graders are shown for two representative cases. Fig. 11. Boxplots showing the distribution of the area segmentation errors in the control test set. Four different approaches to include prior information are compared, i.e., without using prior information (red), using the retina segmentation as an additional input (yellow), using the retina segmentation as a weight map during training (blue), and finally the proposed method where both techniques to include prior information are used together (green). Table 3. Area segmentation error obtained on the control test set using four different approaches to include prior information, i.e., 1) by using the retina segmentation as a weight map during training; 2) using the retina segmentation as an additional input to the system ; 3) both methods to include prior information used simultaneously; and 4) without using the retina segmentation.

Generalization to multiple OCT vendors
When applying the proposed method trained on EUGENDA data, extended with additional vendor specific training data from the external dataset, an average Dice coefficient of 0.738 ± 0.159 is obtained when comparing against the intersection of the two observers (reference standard 1). When excluding the regions where the two observers disagree from the evaluation (reference standard 2) an average dice coefficient of 0.786 ± 0.174 is obtained. The dice coefficients obtained for the subgroups of vendor specific data are shown in Table 5. Finally, example segmentation results and the corresponding annotations by the two graders are shown for B-scans acquired by a Cirrus, Nidek, Spectralis and Topcon scanner, in Fig. 14, Fig. 15, Fig. 16 and Fig. 17, respectively.

Discussion
In this study we assessed the performance of a deep learning algorithm for the automated detection, segmentation and quantification of IRC in SD-OCT imaging. A novelty of the developed system is the applicability in SD-OCT volumes acquired by SD-OCT devices from different vendors with widely varying imaging quality and image characteristics. Furthermore, the proposed method implements a multiscale analysis of an input B-scan, allowing accurate segmentation of both Fig. 12. Bland altman plots visualizing the volume difference on the second test set using four different approaches to include prior information, i.e., (a) by using the retina segmentation as a weight map during training; (b) by using the retina segmentation as an additional input to the system; (c) by using both methods to include prior information simultaneously; and (d) without using the retina segmentation. The performance of the two human observers is shown in (e) and (f), respectively. Table 4. Absolute volume difference, intraclass correlation coefficient (ICC) an Pearson's correlation coefficient (ρ) on the second test set using four different approaches to include prior information, i.e., 1) by using the retina segmentation as a weight map during training; 2) by using the retina segmentation as an additional input to the system; 3) by using both methods to include prior information simultaneously; and 4) without using the retina segmentation. The performance of the two human observers is also included. small and large intraretinal cysts. Finally, segmentation performance is improved by embedding a priori information about retinal anatomy in the algorithm. The proposed algorithm was trained and validated in a set of 122 SD-OCT volumes, and finally evaluated in three independent test sets of which the first contains 53 semi-automatically annotated SD-OCT volumes, the second contains 10 manually annotated SD-OCT volumes, and the third contains 33 SD-OCT volumes acquired from healthy controls. The SD-OCT volumes were all acquired using a Heidelberg Spectralis SD-OCT device. The proposed algorithm was applied in all test sets using the Dice similarity coefficient and the area segmentation error as  Fig. 9(a) and Fig. 10(a), respectively. Similarly, the area segmentation error is visualized in Figs. 9(b) and 10(b). To assess whether the added retina segmentation step in the proposed algorithm provides additional information to improve segmentation performance, four different experiments were performed: 1) the retina segmentation was used as a weight map during training; 2) the retina segmentation was used as an additional input to the system; 3) both methods to include prior information were used simultaneously; and 4) without using the retina segmentation. Without including the retina segmentation an average Dice coefficient of 0.698 ± 0.237 was obtained in the first test set, whereas the Dice coefficient increased to 0.775 ± 0.208 when both methods to include prior information are used. Similarly, for the second test set, the Dice coefficient increased from 0.701 ± 0.184 to 0.754 ± 0.136, when using both approaches to include prior information. These result supports the hypothesis that the information contained in the retina segmentation can be exploited to improve segmentation performance.
This claim is further supported by the observed decrease in the area segmentation error when including the retina segmentation. In the first test set, the average ASE drops from 0.0187 mm 2 ± 0.0389 mm 2 to 0.000 812 mm 2 ± 0.032 mm 2 when using both methods to include prior information. The same effect can also be observed in the second test set, where the ASE improves from 0.061 mm 2 ± 0.066 mm 2 to 0.022 mm 2 ± 0.040 mm 2 .
When looking at the performance gain for each of the two approaches to include prior information individually, it can be concluded that using the retina segmentation as a weight map during training has a bigger impact on performance compared to adding the retina segmentation as an additional input. However, the best performance is still obtained when both approaches are used simultaneously.
Several outliers with low Dice coefficients are observer in the first test set as indicated in Fig. 9. While the use of the retina segmentation offers some improvement, there are still cases for which the Dice coefficient is unsatisfactory. After visual inspection of the outliers it was found that these cases typically only contained small cysts that were either missed by the proposed method, or for which the output segmentation was slightly different from the ground truth annotation. The Dice coefficient is known to be very sensitive when the object to be segmented is small, strongly penalizing small errors resulting in a low dice coefficient for these specific cases. An interesting outlier is the case for which a dice coefficient of zero was obtained, this case is shown in Fig. 18. The proposed method detected a IRC like structure that was not present in the ground truth annotation. After closer inspection, the method mistakenly segmented a small outer retinal tubulation (ORT), retinal pathology with visual characteristics similar to IRC. This type of error can be explained by the underrepresentation of ORT in the training set. Adding cases with IRC confounders like ORT will likely resolve this issue.
In the second test, in addition to a single reference standard, two additional human graders manually annotated the cases, both achieving an average Dice coefficient of 0.783 ± 0.103 and 0.783 ± 0.097 when comparing them to the reference standard. While human performance is still higher with respect to the proposed method 0.754 ± 0.136, the performance gap is small. Qualitative results showing the segmentation output from the proposed method and the manual annotations by the two human observers are shown in Fig. 13. While segmentation in itself is an important task, fluid quantification might arguably be more clinically relevant as the total volume can be directly used as a (prognostic) biomarker. We therefore compare the total segmented IRC volume for the proposed method and both observers to the annotated ground truth volume. The proposed method achieves an average absolute volume manually annotated ground truth volume. This ICC is comparable to the two observers with ICC (ρ) values of 0.975 (0.978) and 0.978 (0.990), respectively. This suggests that our proposed method can serve as a reliable tool that will produce a fluid quantification similar to human performance in a fraction of the time. This statement is further supported by the Bland-Altman plots in Fig. 12, showing quantification performance that is highly correlated to both human observers. Moreover, the obtained Pearson's correlation coefficient obtained by our proposed method (0.949) compares favorably to the performance reported on IRC quantification in a recently published method [44] (0.86) using a similar deep learning approach and evaluated on data acquired using the same OCT device (Heidelberg Spectralis). In addition to the two private test sets, the performance of the proposed algorithm was also assessed in a publicly available external dataset consisting of SD-OCT volumes acquired by four different SD-OCT devices. When applying the proposed method, which was retrained on EUGENDA data extended with a small amount of vendor specific data (3-4 cases per vendor), an average Dice coefficient of 0.786 ± 0.174 is obtained based on the analysis of the regions for which both observers are in agreement. The results for the individual SD-OCT devices are shown in Table 1. The level of performance obtained is approaching the interobserver variability between the two observers of 0.846 ± 0.049.
It is interesting to note that without retraining, i.e., without adding vendor specific data to the training set, the proposed method already generalizes well across different datasets reaching an average Dice coefficient of 0.721 ± 0.196. It is surprising that without specifically optimizing for data acquired by different SD-OCT devices, the proposed method already produces adequate results. This indicates that without optimizing for data acquired by a specific SD-OCT device the proposed method already generalizes well across different datasets. Moreover, by adding only a few vendor specific SD-OCT volumes the proposed method is able to transfer and apply the knowledge obtained from one vendor to data from another vendor efficiently.
In Table 1, a difference in performance can be observed when assessing the performance of the proposed system on each vendor independently. Especially the large variance in segmentation performance for the Cirrus and Nidek device is apparent, both having cases for which the dice coefficient is substantially lower compared to the average performance. This variability in performance can largely be attributed to the varying imaging quality obtained by the Cirrus and Nidek devices. High noise and low image contrast makes IRC segmentation substantially more difficult. An example Cirrus case is shown in Fig. 19, where poor image contrast caused the proposed algorithm to fail in detecting IRC. However, when image quality is sufficient, the performance is similar or even higher compared to the performance on data from the other SD-OCT devices as indicated by the median Dice coefficients of 0.865 and 0.895 in Table 1 for the Cirrus and Nidek device, respectively. Qualitative results shown in Fig. 14  show a good correspondence to both of the human observers for the Cirrus and Nidek scanner, respectively. It has to be noted that preprocessing steps to remove image noise and improve image quality are typically used to improve performance. Instead of removing noise from the images, we chose to apply noise augmentation during the training phase of the system, i.e., random noise is randomly added to an input image to make the system more robust to variations in noise [53]. A more advanced denoising step might however give rise to an increase in performance. As shown in Table 1 the performance on the Spectralis and Topcon scanners is more consistent, with a lower variance in the obtained dice coefficients, and approaches the interobserver variability. The average Dice coefficient compared to the intersection of both observers in the subset op Topcon cases reached a value of 0.796 ± 0.072, where an average Dice coefficients of 0.775 ± 0.067 was obtained for the Spectralis cases. This compares favorably to the results by two previously published methods that were evaluated on the same external dataset, i.e., a three-dimensional curvelet based K-SVD approach that obtained an average Dice coefficient of 0.652 [32], and a method built around a transformation to the neutrosophic domain with a Dice coefficient of 0.705 [35]. Furthermore, these other methods were not evaluated in SD-OCT volumes from vendors other than the Heidelberg Spectralis. Finally, visual examples of our performance in the Spectralis and Topcon dataset are shown in Fig. 16 and Fig. 17, respectively.
It has to be noted that the performance of the proposed method is only evaluated in a dataset containing IRC as a result of AMD. In future work a more extensive evaluation of the method on IRC as a result of DME and RVO will be included. Nonetheless, considering the performance of the proposed method, which approaches the interobserver variability, a few clinical applications are to be considered or are within reach. One such application would be the automated tracking of fluid volume changes in consecutive SD-OCT volumes. Quantitative volume measures can help guide PRN and TE treatment regimens, where fluid change is the main determining factor in the decision for retreatment. Volume quantification can detect non-response in an early stage, allowing for improved personalized patient care, resulting in better patient outcome and a reduction of the financial burden anti-VEGF treatment places on the healthcare system. Another possible application of the proposed method is in the analysis of genotype-phenotype correlations in large population studies. Selecting homogeneous subgroups based on quantification of IRC, is time-consuming or even infeasible. The proposed method with performance in the range of human graders allows for fast processing of such large datasets.
In conclusion, we developed a fully automated system to detect and segment IRC in SD-OCT volumes independent of the SD-OCT device used for acquisition. The system proved to have excellent performance compared to that of two expert human observers on data from four different SD-OCT vendors. The proposed method allows for fast quantitative volume measurements that can be used to improve patient care, reduce costs, and allow fast and reliable analysis in large population studies. Our automatic approach may be considered to be applied as a fast, reliable and cost-effective method in retinal research and clinical practice.

Acknowledgments
The funding organizations had no role in the design or conduct of this research. They provided unrestricted grants.

Disclosures
The authors declare that there are no conflicts of interest related to this article.