Automated identification of cone photoreceptors in adaptive optics optical coherence tomography images using transfer learning

: Automated measurements of the human cone mosaic requires the identiﬁcation of individual cone photoreceptors. The current gold standard, manual labeling, is a tedious process and can not be done in a clinically useful timeframe. As such, we present an automated algorithm for identifying cone photoreceptors in adaptive optics optical coherence tomography (AO-OCT) images. Our approach ﬁne-tunes a pre-trained convolutional neural network originally trained on AO scanning laser ophthalmoscope (AO-SLO) images, to work on previously unseen data from a diﬀerent imaging modality. On average, the automated method correctly identiﬁed 94% of manually labeled cones when compared to manual raters, from twenty diﬀerent AO-OCT images acquired from ﬁve normal subjects. Voronoi analysis conﬁrmed the general hexagonal-packing structure of the cone mosaic as well as the general cone density variability across portions of the retina. The consistency of our measurements demonstrates the high reliability and practical utility of having an automated solution to this problem.


Introduction
Adaptive optics (AO) techniques have been used to facilitate the visualization of the retinal photoreceptor mosaic in ocular imaging systems by improving the lateral resolution [1][2][3]. AO techniques have been combined with scanning light ophthalmoscope (AO-SLO) [4], flood illumination ophthalmoscopy (AO-FIO) [2,5] and optical coherence tomography (AO-OCT) [6]. Similar to AO-SLO and AO-FIO, AO-OCT systems can provide high-resolution en face images of the photoreceptor mosaic in which the cones appear as bright circles surrounded by dark regions [7][8][9], but with the added benefit of high axial resolution which allows for cross-sectional tomography images in which the different outer retinal layers can be clearly delineated.
High-resolution images acquired using AO-assisted imaging systems have been used to investigate various changes in the appearance of the photoreceptor mosaic in both normal eyes [10-13] and eyes with degenerative retinal diseases, such as cone-rod dystrophy [14-16], retinitis pigmentosa [17][18][19] and occult macular dystrophy [20,21]. Although these images can be qualitatively useful, quantitative analysis of different morphometric properties of the mosaic is generally preferred and requires identification of individual cones.
As manual segmentations are both subjective and laborious, several automated methods for detecting cones in AO images using a variety of traditional image processing techniques such as local intensity maxima detection [22-25], graph-theory and dynamic programming (GTDP) [26], and estimation of cone spatial frequency [27-29] have been developed. Although these mathematical methods work for the specific data for which they were designed, reliance on ad hoc rules and specific algorithmic parameters does not allow for alternative imaging conditions, such as different resolutions, areas within the retina, and imaging modalities.
A new alternative approach is using deep convolution neural networks (CNNs) where features of interest are learned directly from data. This allows for a higher degree of adaptability as the same machine learning algorithm can be re-purposed by using different training images given a sufficiently large training set [30]. Several of these networks have shown high performance for many different image analysis tasks, including ophthalmic applications [31]. For example, CNNs have been used for the segmentation of retinal blood vessels [32,33], and detection of diabetic retinopathy [34] in retinal fundus images, and classification of pathology [35] or segmentation of retinal layers [36] and microvasculature [37] in optical coherence tomography (OCT) images. More recently, a CNN using a large dataset of manually marked images for training, has been developed to identify cones in AO-SLO images [38].
Using supervised deep learning approaches for quantification of the photoreceptors requires manually marked images. Unfortunately, a large dataset of manually marked images from an AO-OCT system does not currently exist and the construction of a model on an inadequate amount of data can have a negative impact on performance by causing overfitting. One method of addressing a small dataset is to use data from a similar domain, a technique known as transfer learning. Transfer learning has proven to be highly effective when faced with domains with limited data [39][40][41]. Instead of training a new network, by fixing the weights in certain layers of a network already optimized to recognize general structures from a larger dataset, and retraining the weights of the non-fixed layers, the model can recognize features with appreciably fewer examples [42]. In this study, we present an effective transfer learning algorithm for retraining a CNN originally trained on manually segmented confocal AO-SLO images in order to detect cones in AO-OCT images with a different field-of-view (FOV). Three different transfer learning techniques were applied and compared against manual raters.

AO-OCT dataset
All AO-OCT subject recruitment and imaging was performed at the Eye Care Centre of Vancouver General Hospital. The project protocol was approved by the Research Ethics Boards at the University of British Columbia and Vancouver General Hospital, and the experiment was performed in accordance with the tenets of the Declaration of Helsinki. Written informed consent was obtained by all subjects. The 1060nm, 200 kHz swept source AO-OCT system used in this study is similar to a previously described prototype [8], but a fixed collimator was used in place of a variable collimator. The axial resolution defined by the −6 dB width was measured to be 8.5µm in air (corresponding to a resolution of 6.2µm in tissue (n = 1.38)), and the transverse resolution with a 5.18mm beam diameter incident on the cornea was estimated to be 3.6µm assuming a 22.2mm focal length of the eye and refractive index of 1.33 for water at 1.06µm. Images were acquired with the system focus placed at the photoreceptor layer using GPU-based real-time OCT B-scan images. Wavefront distortion correction was realized by optimizing the shape of a deformable mirror in the system using a Sensorless AO (SAO) technique.
Four different locations (centered at~3.5 • ,~5 • ,~6.5 • , and~8 • ) temporal to the fovea were imaged in five subjects. For each retinal location, five AO-OCT volumes (1.25 • × 1.25 • FOV) were acquired with 200 × 200 sampling density in a second. From the acquired volumes, a single volume with the least motion artifact was chosen and used for the rest of the analysis. Each AO-OCT image was resized to 400 × 400 pixels using bicubic interpolation to allow for the 'non-cone' space to be at least 1 pixel. An example of the data acquired from one subject is shown in Fig. 1. The center of each cone photoreceptor was manually segmented using a Wacom Intuos 4 tablet and free image processing software [GNU image manipulation program (GIMP)] in all 20 AO-OCT images. One manual rater (Rater A) segmented data from three subjects (12 images) and another manual rater (Rater B) segmented the data from the remaining two subjects (8 images). For analysis of inter-rater agreement, two AO-OCT images from a normal subject not included in retraining the network were segmented by both raters and compared to the CNN output.

AO-SLO dataset
The AO-SLO dataset used to implement the initial conditions for the convolutional neural network was obtained from Ref. [38], and consisted of 840 confocal AO-SLO images acquired at 0.65 • from the center of fixation, as well as the corresponding manual segmentations for the center of each cone. Each of the images within this dataset were extracted from a 0.96 • × 0.96 • FOV image, resulting in a FOV ranging from~0.20 • × 0.20 • to~0.25 • × 0.25 • from 21 subjects (including 20 normal subjects and 1 subject with deuteranopia) [38].

Data pre-processing
The image acquisition protocols and processing strategies of the AO-SLO and AO-OCT systems are quite different. In particular, the data from the AO-OCT system has a 5-6.25 times larger FOV. Therefore, data augmentation was performed on the AO-SLO data used for training the base network in order to improve the similarity between the two datasets. The data was sub-sampled at a rate of 1.5, 2, 2.5 and 3 to be a closer representation of the resolutions used for AO-OCT imaging.
Image patches were then extracted to use as inputs for training the network. As the manual segmentation protocol did not include non-cone locations, these were extracted using the protocol in [38,43].In brief, a Voronoi diagram was constructed using the manually labeled cones as the center of each Voronoi cell. The boundaries were then assumed to be non-cone pixels, and a single point along each boundary was randomly chosen and placed in the non-cone set. A 33x33 window was then placed over each cone and non-cone location and used as input to the network. Locations closer than 16 pixels to the edge were discarded and not used.

Network training methods
The convolutional neural network used in this experiment is a slightly modified Cifar network taken from the AO-SLO cone segmentation paper [38]. The details of the network architecture are given in Table 1. Inputs to this network are 33x33x1 feature maps centered on a cone or non-cone pixel with the corresponding binary label. The final fully connected layer provides a score for each class (cone and non-cone), which are input into a soft-max layer that outputs the probability of the original center pixel belonging to each class. A similar network was also used in [43] to incorporate confocal AO-SLO and split detector AO-SLO image pairs. In brief,separate paths were used for layers 1-15 for the confocal and the split detector images, after which a concatenation layer combined the two 1 x 1 x 64 vectors output from the confocal and split detector paths into a single 1 x 1 x 128 vector that was fed into the rest of the network.
The experiments reported in this paper use two different methods to modify the AO-SLO network to segment cone photorecepters in AO-OCT images: transfer learning and fine-tuning.
In both methods, certain layers of the base network trained on AO-SLO data were set to be non-trainable and the rest of the layers were then retrained using the AO-OCT data. In the first experiment, transfer learning was used so that only the classifier would be retrained and the first 15 layers of the base network were set to be non-trainable. The other two experiments used fine-tuning, where the base network was frozen before the second and third convolutional layers (layer 5 and 9, respectively) and the remaining trainable weights were subsequently retrained using the AO-OCT data. There was an average of 35,494 cone AO-OCT patches and 82,867 non-cone AO-OCT patches used on average for fine-tuning, and 336,280 AO-SLO patches used for the initial training. The batch size for training the base network was set to 100, and the maximum number of epochs was set to be 50 with an early stopping parameter set to when the validation loss hadn't decreased in 4 epochs. For the transfer learning techniques, the batch size was decreased to 32. The learning rate was 0.001 and the weight decay was set at 0.0001 for both initial training and fine tuning.
Five-fold cross-validation on all manually segmented AO-OCT images was performed. The 20 original images were divided into 5 sets, so that all images from the same subject were placed into the same set. Images from four of the subjects were used to train the network, and images from the remaining subject were used to test the network. This procedure was repeated five times with a different test subject each time.
The CNN based detection method was implemented in TensorFlow and the Keras API [44] using Python 3.5.4. We ran the algorithm on a desktop PC with an i7-6700K CPU at 4.0 GHz, 16 GB of RAM, and a GeForce GTX 780 Ti X GPU. The average run time for segmenting a new image after training was 20 seconds.

Performance evaluation methods
A number of quantitative metrics were used to determine the effectiveness of the convolutional neural network. Probability maps were generated by extracting 33x33 pixel patches from each pixel location in the image as inputs to the trained network. The outputs of the network, the probabilities that each pixel was centered on a cone, were then arranged to generate a probability map the same size as the original image. These probability maps were binarized using Otsu's method [45] and the centroid of any 4-connected components were taken to be the centers of cones. Any pixels within half the input size (16 pixels) to the edge of the input image were discarded from analysis. Within this implementation, True Positive (TP) results will indicate that the data was located within 0.5 of the median spacing between manually marked cones to a manually marked cone, False Positive (FP) results will indicate that the automatically detected cones were not matched to a manually detected cone, and False Negative (FN) results will indicate that manually marked cones did not automatically match detected cones. Given these definitions, Dice's coefficient, Sensitivity, and False Discovery Rate are defined in Equations (1-3) respectively.
Dice s Coe f f icient = 2T P T P + F N + T P + FP (1) False Discover y Rate = FP T P + FP

Cone mosaic analysis methods
Voronoi diagrams were automatically constructed from each automated cone mosaic to calculate density, area, and proportion of hexagonal Voronoi domains, which are indicators of the regularity of the cone packing arrangement [46][47][48]. To analyze the regularity of the cone mosaics, the images were grouped together by their retinal eccentricities where Area 1 was closest to the fovea (~3.5 • ), and Area 4 was furthest (~8 • ). Cone density was defined as the ratio of the number of bound Voronoi cells in an image to the summed area of the bound Voronoi cells. To calculate the proportion of 6-sided Voronoi domains, the number of Voronoi cells with six sides was divided by the total number of bound Voronoi cells within an image. The number of neighbours was calculated as the mean number of sides of all bound Voronoi cells in an image. Similarly, the Voronoi cell area was calculated as the mean area of the bound Voronoi cells in an image. An example summary of this analysis is shown in Fig. 2 along with the original AO-OCT image. The Voronoi boundary map (green) and the automated centres of each cone (magenta) are shown in Fig. 2B, the number of neighbours map where each Voronoi cell is shaded depending on the number of neighbours is shown in Fig. 2C and the Voronoi cell area map where each Voronoi cell is shaded based on the cell area is shown in Fig. 2D.

Performance evaluation
The performances of the automated algorithms in comparison to manual grading are summarized in Table 2, including training on a network with randomly initialized weights. For the case of the randomly initialized weights, only AO-OCT images were used to train the network. The results from all four methods are within the standard deviations across all three quantitative measurements. A trend of over-segmenting in the methods which retrained more weights (Fine-Tuning Layer 5 and Layer 9) can be seen in the lower resultant sensitivities and slightly better false discovery rates. These numbers are slightly worse than the original network trained only on confocal AO-SLO data, where the sensitivity was 0.989 ± 0.012, the false discovery rate was 0.008 ± 0.014, and the Dice's Coefficient was 0.990 ± 0.010 [38].  Figure 3 displays the results of the automated algorithms in comparison to manual grading for one of the AO-OCT datasets. In the marked images, a green point indicates an automatically detected cone that was matched to a manually marked cone (true positive), a yellow point indicates a cone missed by the automatic algorithm (false negative), and a red point indicates an automatic marking with no corresponding manually marked cone (false positive). From the images we can see that the methods performed quite well and that the methods that retrained more weights in the network produced more automated cone locations, resulting in fewer missed cones (hence less yellow locations) and more red locations, than the transfer learning method which only retrained the classifier.

Inter-rater agreement
As previously mentioned in Section 2.1, two AO-OCT images from a normal subject not included in training the network were segmented by both raters for inter-rater analysis. For the AO-SLO images in [38], the manual rater segmentation quality was high because the FOV was small, the images highly sampled, and the cones hence had a high contrast. The AO-OCT images used in this report had a larger FOV, and hence a lower sampling density. Consequently, the contrast of the cones was poorer, leading to minor disagreement in segmentation even between the manual graders. The inter-rater performance can be seen in Fig. 4 along with the original AO-OCT image and a comparison of Rater A to the CNN output. As can be seen in Fig. 4b,e, Rater B found more cones than Rater A did in the majority of the image area, with the exception of areas in the blood vessel shadow where they were markedly more conservative. Similarly, the CNN found more cones than Rater A and in fact was more similar to Rater B as shown in Table 3. All inter-rater measurements are within the standard deviation of the automated results, suggesting that the automated segmentation is comparable to that of a human rater.

Cone mosaic analysis
From the performance evaluation of the three different transfer learning methods, we chose the results from the Fine-Tuning (Layer 5) method to further analyze as it had the highest Dice's Coefficient. The results are summarized in Table 4 for all cone analysis parameters. The proportion of hexagonal cells and mean number of neighbours remained relatively constant over the areas imaged, while a general trend of the mosaic becoming less dense further from the fovea can be observed.

Discussion
In this work, we investigated the use of transfer learning techniques using an automatic CNN based method for detecting cone photoreceptors in AO-OCT images. Using manually marked images from a confocal AO-SLO system to initialize the weights of our network, we have demonstrated retraining a CNN to extract features of interest and classify cones in previously unseen images from an AO-OCT imaging system. We tested our method on images of various retinal eccentricities and showed that our method had good agreement with the current gold standard of manual marking. In addition, we used various morphometric cone mosaic measurements to show quantitative agreement with measurements from AO-SLO systems. As shown in Table 2, performance of the CNN based algorithms were comparable to the current gold standard of manual grading which suggests that the automated segmentation is  comparable to that of a human rater. This is highly encouraging, because traditional methods of photoreceptor cone segmentation heavily utilize modality and FOV specific ad hoc rules, which limit their application to other imaging protocols and require further algorithm modification and development for new imaging protocols. All that was needed to adapt the algorithm from confocal AO-SLO to AO-OCT was the corresponding training dataset. As we can see from the Table, the results from only using AO-OCT images and random weight initialization are comparable to using fine-tuning and transfer learning methods. We postulate that this is due to the large number of AO-OCT training data (118,361 patches) and that for a CNN, recognizing simple shapes like high contrast cones is a relatively straightforward task. As such, Fig. 5 shows how the number of training patches affects the performance of the different CNN algorithms. Random initialization performed poorly for 1000 patches, was close to the transfer learning methods at 2000 patches, but ultimately fit within the standard deviation at 3000 patches and above, whereas the transfer learning methods were stable from 1000 patches. This has important implications as we look to use CNNs for pathological images where the signal-to-noise ratio is lower and there are much fewer datasets from which to train. Additionally, the algorithm's output measurements are congruent with AO-SLO data from the Literature [48]. The proportion of hexagonal cells remained relatively constant over the areas imaged, which is consistent with the Literature although the overall values are slightly lower than reported for AO-SLO data (52.6 ± 6.56 at 3.5 • and 50.9 ± 7.32 at 8 • ) [48]. Similarly, the mean number of neighbours also remained relatively constant over the areas imaged, which is consistent with the Literature although the overall values are slightly lower than reported for AO-SLO data (52.6 ± 6.56 at 3.5 • and 50.9 ± 7.32 at 8 • ) [48]. The general trend of the mosaic becoming less dense further from the fovea was also observed and is shown in Fig. 6 which compares our cone density measurements to measurements found in the Literature [49][50][51] for histology and AO-SLO data. Datapoints from Ref. [49] were extracted from the figures in the paper. As shown, our data follows the general trend for the eccentricities imaged.
There is a trend towards overestimating cones in the algorithms where more layers were frozen with the AO-SLO initialized weights (Fine-Tuning (Layer 9) and Transfer Learning) as shown in the higher false discovery rate in Table 2 although the true positive rate was better. This is also reflected in Fig. 3, where the Fine-Tuning (Layer 5) results show less false positive cones. The majority of false positives was in regions below blood vessels. This could be improved by including more training data specifically around regions of vessels. This could either be done by acquiring more data around blood vessels, or using data augmentation techniques to modify the current dataset. Alternatively, the CNN could be combined with other pre-processing steps, such as blood vessel segmentation (for example, using OCT-A as shown in Fig. 7) to identify and remove the regions below vessels from the quantitative analysis. Moreover, poor inter-observer agreement can negatively affect the performance of learning based methods such as CNN. Utilization of datasets graded by multiple observers, for example both Rater A and Fig. 6. Comparison of cone density measurements from the Literature to the cone density measurements from the AO-OCT system. The gold standard of histology [49] is shown with a trendline, as well as measurements from two AO-SLO systems [50,51].
Rater B, could further improve performance.
There is also space for improvement in all of the proposed algorithms mentioned in this work. The network architecture was chosen as it was the one published with the open-source confocal AO-SLO dataset, and was not optimized for the AO-OCT data. Additionally, the hyper-parameters were empirically chosen to provide good performance for both the fine-tuning and transfer learning methods. It is possible these parameters could be further optimized to provide better performance. Additionally, applying further custom pre-processing and post-processing steps such as multi-acquisition registration and averaging [52] may improve the results presented here.
Previously demonstrated in the Literature was confocal AO-SLO to split aperture AO-SLO; although different, there were similarities in field of view. In this report, transfer to a completely different imaging modality was demonstrated. Furthermore, the differences in scale of the PRs across retinal eccentricities while maintaining good sensitivity and false discovery rate is a significant achievement and demonstration of CNN usefulness. In general, for the cone photoreceptor mosaic, the features have a relatively well defined underlying structure, and a regularity for which CNNs are well suited. This implies that this transfer learning technique could be used to analyze images from commercial flood fundus photoreceptor images as well. There exists potential for CNNs to help unite images from different modalities with or without AO enabled imaging techniques for understanding photoreceptor changes in disease.
An example of this kind of multi modality imaging is shown in Fig. 7 where structural OCT, OCT-A and AO-OCT were used. Though the individual AO-OCT images used in this report are currently considered to be large field of view when imaging photoreceptors, being able to view wide field structural images is necessary for locating areas of interest. Further studies looking at various pathologies using wide field structural OCT cross sections and OCT-A to observe microvasculature changes and locate smaller regions of interest to then image with the higher resolution AO-OCT, using the CNN for quantification of the cone mosaic and vasculature [37], would be pertinent for proving clinical utility.

Conclusion
CNNs provide the opportunity to adapt to changing conditions without having to adjust ad hoc rules, but instead retraining the network on a sufficiently large database. In this paper, we experimentally demonstrated three different transfer learning methods to identify the cones in a small set of AO-OCT images using a base network trained on AO-SLO images which all obtained results similar to that of a manual rater. Using the results from the Fine-Tuning (Layer 5) method, we calculated four different cone mosaic parameters which were similar to results found in AO-SLO images showing the utility of our method.