Automatic Detection of Cone Photoreceptors With Fully Convolutional Networks

Purpose To develop a fully automatic method, based on deep learning algorithms, for determining the locations of cone photoreceptors within adaptive optics scanning laser ophthalmoscope images and evaluate its performance against a dataset of manually segmented images. Methods A fully convolutional network (FCN) based on U-Net architecture was used to generate prediction probability maps and then used a localization algorithm to reduce the prediction map to a collection of points. The proposed method was trained and tested on two publicly available datasets of different imaging modalities, with Dice overlap, false discovery rate, and true positive reported to assess performance. Results The proposed method achieves a Dice coefficient of 0.989, true positive rate of 0.987, and false discovery rate of 0.009 on the first confocal dataset; and a Dice coefficient of 0.926, true positive rate of 0.909, and false discovery rate of 0.051 on the second split detector dataset. Results compare favorably with a previously proposed method, but this method provides quicker (25 times faster) evaluation performance. Conclusions The proposed FCN-based method demonstrates that deep learning algorithms can achieve accurate cone localizations, almost comparable to a human expert, while labeling the images. Translational Relevance Manual cone photoreceptor identification is a time-consuming task due to the large number of cones present within a single image; using the proposed FCN-based method could support the image analysis task, drastically reducing the need for manual assessment of the photoreceptor mosaic.


Introduction
Cone photoreceptors are vital for human vision. Specifically, these cells serve daylight and color vision. Diseases such as Stargardt's disease, 1 retinitis pigmentosa, 2 choroideremia, 3 and macular degeneration 4 are characterized by the loss of photoreceptors leading to impaired vision. A way to image the photoreceptor array is using an adaptive optics scanning laser ophthalmoscope (AOSLO). Two common variants of the AOSLO imaging modality are confocal and split detector, each providing slightly different information on photoreceptor structure. 5,6 Regardless of the method use to acquire the images, the cones must be located within the image to create quantifiable information and extract metrics, such as cone density and spacing and packing arrangements. Given the high density of cones within the image, manual cone identification can be timeconsuming and inconsistent. Several automatic or semi-automatic methods have been proposed to create a faster and more consistent cone detection process. Some methods are based on standard image analysis techniques: image histogram analysis, 7 multiscale modelling and normalized cross-correlation, 8 a circular Hough transform, 9 and multiscale circular voting. 10 In recent years, machine learning methods have also been applied to this problem. Cunefare et al. 11 proposed a so-called ''patch-based'' method involving generating a probability map through a sliding window convolutional neural network (CNN) and then postprocessing this probability map to locate cone positions. The CNN, which works on a small window of the entire image, generates a binary (two-class) classification of the image as either the patch centered on a cone or not centered on a cone. By moving the window along different sections of the image, the probability map is obtained. The postprocessing method, which is needed to extract peaks from the probability map (cone locations), contains several steps and several tunable parameters. Heisler et al. 12 investigated the use of transfer learning on the network of Cunefare et al. 11 to enable classifications of previously unseen data collected from a different imaging modality (AO scanning laser ophthalmoscope). Davidson et al. 13 proposed a method using a multidimensional Recurrent Neural Network (RNN), which generates a probability map for the entire image in a single set of computations.
Patch-based and CNNs have commonly been applied to ophthalmic medical images, such as retinal segmentation or classification, and provide state-ofthe-art performance in these areas. [14][15][16] Fully Convolutional Networks (FCNs) are an extension of CNNs. 17 The main benefit of an FCN is the ability to process the entirety of an image at once and provide a per-pixel probability map. FCNs are commonly used for object segmentation, region labeling, or other per-pixel operations 17 and have been used for geographic atrophy segmentation in retinal tomography images. 18 FCNs have been commonly used in medical image processing for problems such as retinal layer segmentation, 19 segmentation of neuronal structures, 20 and cell detection. 21 For per-pixel operations, FCNs are commonly quicker than a patch-based CNN or RNN, as the FCN only passes over the data once, whereas data are repeatedly evaluated in the case of a patch-based CNN and multiple recurrent loops increase the number of operations in the case of an RNN. 22 In this work, we propose the application on a FCN for cone detection in confocal and split detector AOSLO images. We use a previously published method, based on a patch-based technique, 11 as a baseline to assess the benefit that a FCN approach may have in this particular problem.

Methods
The method for finding cones consists of two steps. The first step is the generation of a probability map through an FCN, and the second step is postprocessing the probability map to a collection of cone locations. Training of the neural network and parameter selection of the postprocessing are done separately. The FCN was trained on the given training images for the dataset (details below), for 50 epochs. After training was complete, the training images were processed through the FCN and then generated probability maps were used to optimize the detection parameters of the cone location postprocessing according to a parameter sweep. This trained combination of FCN and detection parameters was then used to segment the test images giving the final performance.

Datasets
The FCN method was trained and evaluated on two publicly available datasets, namely, one consisting of confocal samples, acquired by Garrioch et al., 5 and one consisting of split detector samples acquired by Cunefare et al. 6 ; data associated with this publication and used here for testing of the proposed methods were obtained online (https://github.com/ DavidCunefare/CNN-Cone-Detection).
The confocal set consists a total of 840 images, split into a testing set of 640 images and a training set of 200 images. Full acquisition parameters are given in Garrioch. 5 Given some images had different dimensions and this would complicate setting the network, all images were cropped to a common pixel size, using the central 144 3 144-pixel region, corresponding to an area ranging from 62 to 74 lm 2 . The split detector set consists of 264 images, split into a testing set of 80 images and a training set of 184 images. Full acquisition parameters are given in Cunefare et al. 6 Like the confocal set, all images were cropped to a central 144 3 144-pixel region, corresponding to an area ranging from 56 to 68 lm 2 . Both sets were independently trained and evaluated, with independent networks and detection parameters.

Preprocessing and Augmentation
Given the relative sparseness of the pixels identified as being ''cone location'' in a single image, several modifications to create the ground truth maps (cone labels) were tested. A number of pixels from 0 to 3 were added in a diamond shape around each positive sample to give a larger number of positive samples to aid learning and assess if bias in the sample had an effect on performance. A size of 0 indicates a single pixel used as a mask, as shown in Figure 1.
Additionally, in the case of the confocal dataset, rotational augmentation in steps of 90 degrees was used because the confocal modality is rotation invariant. It is worth noting that split detector images are not rotation invariant due to the scan being created from the difference of two offset channels, and preliminary testing confirmed adding rotational augmentation to the training data reduced performance while testing on nonrotated images.

Fully Convolutional Network
The network used for this method was a modified U-net (Fig. 2) to have only a single convolutional-  ReLU-batch norm block per pooling layer, zero padding to maintain input-output size parity, and a dropout of 0.5 at the bottleneck. All images used as inputs to the network were normalized on a per-image basis to be between 0 and 1 inclusive. A detailed explanation of the original network can be found elsewhere, 20 for completeness; only a brief summary of the method is provided here. The network has a contracting path (encoder) followed by an expanding path (decoder), both of which contain a large number of feature channels giving the network greater capacity to propagate contextual information. To improve pixel-wise localization, ''skips'' were added to concatenate each feature map in the expanding path (decoder) with the corresponding feature map at the same level in the contracting path (encoder). 20 Figure 2 shows the exact structure of the network. Convolutional layers are the main learnable operation within the network, where an input feature map is convolved by a filter of size W3H3F, where W indicates width, H indicates height, and F indicates the number of filters. Rectified linear units provide activations, only allowing an output if the input is above zero. Batch norm layers are used to normalize training, keeping inputs bounded between 0 and 1. Max pooling layers subsample the input feature map by taking the largest value within a W3H region every S pixels in either dimension. Dropout layers commonly improve generalization and prevent overfitting to the training data and randomly turn off individual input pixels in the feature map at rate P. Transposed convolutional layers increase the size of the input feature map by S with a kernel of size W3H3F. Concatenation layers join two or more other layers together by adding extra channels. Softmax layers output the relative probabilities for each channel in the input feature map.

Cone Localization
The output from the FCN is a probability map of the same size as the input image, with a gradient from background prediction to cone prediction over a width of several pixels in practice. In order to reduce this probability map to a list of cones positions, the detection scheme used by Cunefare et al. 11 was applied to the probability maps. This allows the direct comparison of the methods (patch-based versus FCN).
The first step of this method is smoothing the probability map with a Gaussian filter of standard deviation r. The second step is extended-maxima transform to find maximal regions where the maximum intraregion height difference is less than H. The third step is to filter the maximal regions based on the original probability map against a minimal threshold T. Finally, the centroids of the remaining clusters are taken as the cone locations. The parameters r, H, and T were tested over 0 to 2 in steps of 0.1, 0 to 1 in steps of 0.1, and 0 to 0.3 in steps of 0.05, followed by 0.4 and 0.5, respectively.

Testing
To compare the performance of the proposed method to existing methods in the literature the true positive rate (TPR), false detection rate (FDR), and Dice overlap (Dice) are given for all methods. These are based on the number of true positives (T p ), where both automatic methods and truth indicate a cone; false positives (F p ), where the automatic methods indicate a cone, but there is no corresponding cone in the truth; and false negatives (F n ), where a cone is indicated by truth, but is not predicted by an automatic method. The equations for these metrics are shown below in Equations 1.1 to 1.3.
Given the cropping of the original images and the possible confusion around borders where positive samples may lie outside the area of the image, a padding of 2 pixels around the inside of each border was added to a ''free zone,'' with the remainder of the image serving as a ''testing area.'' In the event that a prediction in the testing area had no corresponding true positive in the testing area but it did in the free zone, this was treated as a positive prediction. Positive samples in the free zone that had no corresponding prediction in the testing area were not treated as negatives (Fig. 3).
Patch-based methods can be trained on images of different sizes if a consistent size patch is extracted. However, to ensure a fair comparison between methods, the same central region used for the FCN was extracted from the full segmentation. To assess the effects of the different architectures on performance, a repeated measures analysis of variance was performed to examine the statistical significance of the different metrics associated with these factors.

Results
All proposed methods were trained and tested in MATLAB 2018b (MathWorks, Natick, MA) by using the first party deep learning libraries, on a computer with a Xeon E5-2620v4 and an Nvidia Titan Xp. The patch-based networks 11 were taken directly from an open source implementation for comparison purposes and were trained and tested on MATLAB 2018b with MatConvNet 1.0-beta25, on a computer with a Xeon E5-2620v4 and an Nvidia Titan Xp.
Regarding the computational time, the proposed FCN method generates a probability map for a 144 3 144-pixel image in 0.03 seconds, compared to 0.85 seconds for the patch-based CNN approach with an identical image. As both methods use the same cone location algorithm, both methods have the same running time added. The cone location step took 0.008 seconds on average, so the total for the FCN method is 0.038 seconds, and the total for the patchbased CNN is 0.858 seconds.
To assess the performance and the repeatability of the method, each network was trained from scratch and tested four times, with identical testing and training data sets and with results recorded independently, and the average of three metrics (and their standard deviation) across all networks is presented. No transfer learning is performed. Instead, each network is trained from scratch with weights initialized using small random values. Given the inherent randomness with neural networks, this serves to control for instabilities during initialization and training and gives a more accurate idea of what performance window is to be expected if training this network independently. Tables 1 and 2 summarize the performance on the confocal dataset and split detector dataset, respectively, and provide a comparison with existing methods. Figure 4 provides a visual comparison on the two different imaging modalities. In this work, we present the performance as the mean (and standard deviation) of the four runs to assess both performance and consistency of the network. The small standard deviation across the different tested conditions ensures the mean values provide a representative picture of performance.

Discussion
Across both datasets, the proposed FCN method with the three-wide padded data provides the best performance, in terms of the mean value of the different evaluated metrics. On comparison to the    previously proposed patch-based technique, 11 the margin between existing patch based and the FCN methods is greater in the split detector dataset where the three-wide FCN exhibited the best performance across the three measured parameters. However, none of the three parameters were statistically significantly different to the Cunefare (P . 0.01). For the confocal dataset, all three metrics were statistically significantly different to the Cunefare (P , 0.01); although the differences were small in magnitude, the FCN shows superior performance for Dice and True positive rate. The network was also evaluated against the multidimensional RNN proposed by Davidson and colleagues. 13 However, because this particular network was designed to segment cones in the split detector image, the performance between datasets showed a big disparity and was not compared with the proposed method. The split detector had a good performance (true positive rate ¼ 0.775, false discovery rate ¼ 0.042, Dice ¼ 0.849), whereas the confocal presented poor metrics (true positive rate ¼ 0.233, false discovery rate ¼ 0.007, Dice ¼ 0.376). Changes in the network architecture may be needed to improve performance. A limitation of the proposed method is the twostep approach. It is likely that better performance could be achieved by evaluating the FCN on the exact metric used for performance in the training stages. Finding a way to change the network to directly output a collection of cone positions would be ideal; however, this is not a trivial problem and should be considered in future studies.
Cone photoreceptor imaging modalities can have a small depth of focus, so blur can be present in the image. 23 Understanding how the different models deal with this blur and the impact on performance should be considered in future studies. Also, future studies should evaluate the proposed method on datasets from pathological eyes, as many diseases can change the mosaic structure significantly, 1-4 as well as with other imaging modalities such as dual split and confocal. 24 Whether similar performance can be achieved with datasets from pathological eyes remains to be seen.

Conclusion
FCNs have proven useful for a number of image analysis tasks. In this work, the proposed end-to-end FCN method is able to provide a fast detection of cones in two different image modalities. The overall performance of the method is comparable or superior to previously proposed methods that are patch-based but with the added advantage that it runs in a fraction of the time.