Segmentation of the Optic Nerve Head Based on Deep Learning to Determine its Hemoglobin Content in Normal and Glaucomatous Subjects

Objective: To determine the limits of the optic nerve head (ONH) in color fundus images using Depp learning (DL) for the estimation of its hemoglobin topographic distribution. Also, to evaluate the usefulness of that distribution in glaucoma diagnosis singly or in association with perimetry. Methods: A DL method was trained using 40000 fundus images and applied to 89 normal eyes and 77 confirmed or suspect glaucomas. DL and manual segmentation were compared. The eyes were also examined once with TOP perimetry (Octopus 300) and Spectralis-OCT and twice with Cirrus-OCT and Laguna ONhE, a program which estimates hemoglobin from color photographs, using improved criteria from previous studies. Results: The Sorensen-Dice similarity index between manual and automatic segmentations was 0.993. Intraclass correlation coefficients were similar when comparing the results of the Laguna ONhE indices using the manual and automatic segmentations (confidence intervals: 0.933-0.978). For specificity close to 95%, the GDF index, a factor that measures the distribution of hemoglobin at the nerve, obtained sensitivities between 70.1 and 74.0% (manual vs. automatic segmentations). The retinal nerve fiber layer thickness (RNFLT) of both OCTs provided sensitivities between 67.1 and 68.8% and the BMO-RMW of Spectralis-OCT 69.7%. Associating several normalized indices, e.g. a new visual field harmony index (Threshold Coefficient of Variation, TCV) and GDF, provided 85.7% sensitivity for 97.8% specificity. GDF correlation with Spectralis-OCT BMO-RMW index was similar to that obtained between this index and the RNFLT of the same instrument. For 95% specificity, the diagnostic concordance (kappa value) between both Spectralis-OCT indices was 0.694 and between its BMO-RMW and Laguna ONhE GDF 0.804-0.828. Conclusion: A fully automatic delimitation of the optic nerve head allows the correct, reproducible and efficient use of the Laguna ONhE method, and its effectiveness is greatly increased if associated with a perimetric harmony index.


Introduction
A serious difficulty in assessing conventional color images of normal and glaucomatous optic nerve heads is to establish their anatomical limits in a reproducible manner. Each user will use her/his own criteria, so no uniformity of results can be expected.
Numerous procedures have been designed to automate this task [1][2][3][4], but they usually require some form of user support. Instruments such as the Heidelberg retina tomograph (HRT, Heidelberg Inst., Germany) were commonly used leaving the delimitation of nerve boundaries to the user's discretion, because some attempts at automatic definition did not produce results as consistent as expected [5].
There are several problems in achieving the goal of total automatism. One of them is the great variability in the size and shape of the optic nerve head. Another is the frequent presence of areas of atrophy and/or pigmentation at the edges. Finally, it is known that the real dimension of the internal canal of the nerve is less than its apparent magnitude [6].
For this work we have applied deep learning (DL) techniques to achieve the automatic identification of the position and shape of the optic nerve head (ONH) in fundus color images, and the fully automatic segmentation of its edges. The final objective was to optimize the reproducibility of the Laguna ONhE method described above [7][8][9], for the estimation of the distribution of hemoglobin in normal and glaucomatous optic nerves, avoiding the user's subjective criteria.
Our first paper on this subject [7] described a glaucoma discriminant function (GDF), which combines the slope of the Hb

Journal of Clinical & Experimental Ophthalmology
amount between the center and periphery of the nerve, and that of the sectors that tend to be most frequently affected in this disease. This index has been enriched as information has accumulated, including the relationships between hemoglobin at the rim sectors most frequently affected (upper and lower) and those least affected (nasal and temporal), and associating the result of the evaluation of hemoglobin distribution in optic nerves using machine learning. These networks identify the likelihood that the hemoglobin distribution at the ONH shows the usual characteristics of normal subjects or the peculiar ones in glaucoma. The results of this new formulation of the GDF index, which we could call "globin distribution factor", in terms of sensitivity, specificity and reproducibility, delimiting the ONH manually and automatically, is the ultimate objective of this work.

Methods
The study protocol adhered to the tenets of the Declaration of Helsinki and was approved by the Research Ethics Committee of the University Hospital of the Canary Islands. The participants were informed about the study objectives and signed informed consent was obtained from all of them.
A total of 40000 images including those we have made publicly available, known as RIM-ONE [10] were used to train DL networks in the identification of the position of the ONH and its edges. DL semantic segmentation was chosen in order to obtain pixel level information of the location and context of the ONH in the image. Specifically, a U-Net architecture has been used for this semantic segmentation. This architecture has had not only great results with biomedical image segmentation, as it was originally proposed by Ronneberger et al. [11], in whose paper it can be consulted in detail, but also on many other segmentation projects.
Synthetically, U-Net is encoder-decoder architecture. The encoder part (Max-pooling) gradually reduces the spatial dimension using grouping layers to detect the details to be identified, and the decoder (Up-convolution) gradually recovers the object details and spatial dimension. There are shortcut connections from encoder to decoder to help the decoder recover the object details better. For the purpose of this project, a customized version was used with contracting path filter sizes of 16, 32, 64, 128, 256 and 512 pixels towards a center filter of 1024 pixels with batch normalization [12] at each convolution block and ReLU activation.
The neural network was carried out with Python programming language (Python Software Foundation, https://www.python.org/) and libraries: Keras with Tensorflow background for the implementation of the U-Net architecture, OpenCV for image pre-processing and postprocessing, Numpy for calculations, Pandas for data manipulation and Scikit-learn for splitting the training set between training and validation sets.
Before the images entered the network for training, several preprocesses were carried out. Initially, all images and masks were re-sized to 512 × 512 pixels to fit the neural network's architecture. Also, data augmentation was carried out at random by changing saturation, shifting, scaling and rotating images and flipping them horizontally. The monitor value for the training of the network was Sorensen-Dice loss.
To do this, a single expert (Gonzalez de la Rosa) manually defined the limits of 2000 optic nerve heads, trying to adjust to the inner edge of the Elschnig scleral ring (Figure 1). Two thirds of the sample was used for training and one third for the evaluation of segmentations. Applying the results to a new series of 2000 cases, automatic segmentations that were considered erroneous were manually delimited and reentered into the system to re-train the net. The sample was then progressively expanded in successive iterations to complete the 40000 cases. From then on, a sample of 89 healthy eyes and 77 glaucomatous eyes were consecutively and prospectively selected to compare the manual with the automatic segmentations, as well as the results when applying the Laguna ONhE method. Normal eyes were recruited from patients referred for refraction who underwent routine examination without abnormal ocular findings, hospital staff, and relatives of patients in our hospital.
Eligible subjects had to have a best-corrected visual acuity of 20/40 or better, refractive error within ± 5.00 diopters equivalent sphere, and ± 2 diopters astigmatism, and an open anterior chamber angle. The presence of cataract was not considered a criterion for exclusion a priori. Age and previous cataract and glaucoma surgery were not criteria for exclusion. We excluded patients with any other associated eye disease that could interfere with the interpretation of the results.

Study protocol
Participants underwent a full ophthalmologic examination, including: clinical history, visual acuity, slit-lamp bio-microscopy, intraocular pressure (IOP) measurement, and ophthalmoscopy of the posterior pole.
All control and glaucoma patients had perimetric assessment, having undergone at least two previous examinations. White-on-white TOP-32 strategy was used in an Octopus 300 perimeter (Hagg-Streit AG, Bern, Switzerland). An abnormal perimetry was defined as reproducible glaucomatous visual field loss in the absence of any other abnormalities to explain the defect.
Two photographs of each eye fundus were obtained using a Horus Scope DEC-200 handheld camera (MiiS, Taiwan). Disk boundaries were defined manually and automatically as explained above.
Two examinations of each case with the Cirrus spectral-domain Optical Coherence Tomograph (Cube 200 × 200 acquisition protocol, software version 5.2, Cirrus-OCT; Carl Zeiss Meditec) and one examination with the Glaucoma Module Premium Edition (GMPE) of the Spectralis-OCT (Heidelberg Eng., Germany) were also performed.
All the ophthalmic examinations, perimetry tests, and morphologic evaluations were performed within 1 month from the subject's date of enrolment in the study.
The Laguna ONhE program has been previously described in detail [13]. It used mathematical algorithms for automatic component segmentation to identify the central retinal vessels. Thus, two areas of the ONH were defined: the central retinal vessels and the ONH tissue itself. The program analysed three components of ONH photographs: blue (B), green (G), and red (R) and applied the formula (R-G)/R to the pixels of vessels and tissue. The result obtained for the vessels was used as the reference value for calculating the Hb content in the tissue. The (R-G)/R value was calculated for any area of the tissue, then divided by the (R-G)/R value for the vessels and the result was multiplied by 100. Thus a relative measure (percentage) of the amount of hemoglobin in the tissue was obtained.
The distribution of hemoglobin in some sectors was used to estimate the vertical cup/disc ratio. Similarly, the position of the cup and rim, and hemoglobin present the rim sectors was estimated.
After a first multicenter study demonstrating the usefulness of the procedure, we analysed its reproducibility [14], and it was then applied to ocular hypertensives [15]. New studies have allowed separating rim from cup information, analyzing images obtained with a stereoscopic camera [16], and also associating two-dimensional photographic images with optical disk and cup delimitations provided by OCT [17]. Finally, their results have been analysed with the 360 degrees of the of the retinal nerve fiber thickness in a recent paper [18].

Classification into groups
Healthy eyes had an IOP of less than 21 mmHg, no history of increased IOP, normal optic disc morphology, and normal visual field results. Patients in the "glaucoma" group had characteristic optic nerve defects or suspected disease data, such as intraocular pressure (IOP) greater than 21 mmHg, associated with a family history of glaucoma, a suggestive aspect of the ONH, or a borderline condition in the visual field, for example, a mean defect greater than 2 dB or depressed points in the defect curve. Subjects with IOP greater than 25 mmHg or pressures between 21 and 25 mmHg associated with a corneal thickness (ECC) less than 500 μm were also included, regardless of other types of signs or symptoms, even if the diagnosis of glaucoma had not been fully confirmed.

Statistical analysis
For the development of the Laguna ONhE program, the blue, green, and red components of the images were evaluated using the Matlab image analysis program (The MathWorks Inc., Natick, MA) and its toolbox for image processing.
The coincidence between the areas of optic nerve delimited by the expert and those established by the neural network was evaluated using the Sorensen-Dice index, which consists of dividing twice the area of coincidence of both delimitations by the sum of the two areas.
The Jaccard index was also calculated by dividing the area of coincidence by the sum of the two areas minus the area of coincidence.
The areas under the receiver operating characteristic curves (AUCs) were calculated and sensitivities at a fixed specificity close to 95%, and 98% (5%-2% false positive rate) for some principal parameters of each procedure.
After checking for a normal distribution of the variables, Pearson correlations were also calculated between some structural and Laguna ONhE parameters. Reproducibility was assessed by obtaining the intra-class correlation coefficient for simple measures and the interrate agreement in the diagnostic classification using the kappa index.
A new index was calculated to value the harmony of the visual field, using the subject himself as a pattern [19]. This index, called threshold's coefficient of variation (TCV) is calculated as the Pearson coefficient of variation of the thresholds in symmetrical positions of the visual field.
For data normalization, the function available in Excel was used, which transforms the original values into data of a comparable range, taking into account the average and standard deviation observed in the reference cases.

Results
Demographic and clinical characteristics of the two groups studied are summarized in Table 1. No significant age-related changes were observed in the reference population for the GDF index value, or for the vertical cup/disc ratio estimated by hemoglobin values or the TCV perimetric index (p>0.05). Table 2 shows the result of the ROC analysis of the analysed variables. In all cases, the difference observed between the two groups was statistically significant (Student t-test p<0.0001). The standard error allows estimating the confidence intervals of the areas shown. Table 3 shows the reproducibility results of the indices obtained in two examinations of the same eye, estimated by calculating the intraclass correlation coefficient. Table 4 shows the diagnostic coincidences of the indices obtained, when the cut-off point between normality and pathology was set at the cut-off value specified in Table 2 for specificity close to 95%.  Figure 3 compares this last index with the BMO-RNFLT of the same instrument. It should be noted that, in the latter case, two values were observed that could be considered outliers and that are distinguished, isolated, in the right part of the graph. If both data were deleted, the value of R 2 would be 0.757 (p<0.0001). Table 5 shows the ROC analysis of the result of adding some combinations of morphological, functional and perfusion indices, after normalization of their original value. Figure 4 shows some examples of ROC curves obtained by summing several of these normalized indices. In the case of GDF, automatic nerve delimitation was used.

Discussion
Some authors believe that the segmentation carried out by various experts should be used as a reference [20,21]. Our opinion, on the contrary, is that in order for a result to be reproducible, no variability should be introduced in training the networks, because this variability would be transferred to the result.
Al-Bander et al. [22] have published an extensive review of coincidence indices obtained by various authors when segmenting the ONH. The maximum Jaccard index obtained in the reviewed literature was 0.9398, a much lower figure than ours (0.990). Even lower is the Jaccard index of 0.8289 obtained by these authors in our RIM-ONE series. Sevastopolsky [23] has also used the RIM-ONE database obtaining a Sorensen-Dice index of 0.95, and also with the same series Cerentini et al. [24] reported an accuracy of 87.6%. Other papers using Structured Learning have obtained a similar Dice index of 0.9181 [25].
Our better results seem to depend not only on the use of a much larger sample, but also on the training strategy of the network, incorporating sequentially the manual segmentations of those images that the automatic method segmented in an erroneous way.

Reference group (Average ± SD)
Glaucoma group (Average ± SD) P      Unlike the recent study by Sidong et al. [26] our DL method is used exclusively for the delimitation of the ONH, while diagnosis variables are obtained by calculations made from the RGB values of the image, adding a value of probability obtained through conventional machine learning. On the other hand, Sidong's work confronts normal subjects with confirmed glaucoma, at least in the selection of cases from our RIM-ONE base, which they also use. In our work we have also included cases at risk, but without clear confirmation of glaucoma, trying to simulate a more realistic clinical situation. These differences very likely influence the absolute data of sensitivity and specificity observed.
Zhixi et al. [27] have also applied DL to the diagnosis of glaucoma. In this case, the sample has multiple origins and, as in the previous publication, there is no uniform diagnostic classification. On the other hand, the high sensitivity and specificity obtained contrast with the low index of diagnostic concordance among the experts used as a standard (kappa=0.75). In neither of these two studies is there any clear evidence of the morphological or functional parameters of the cases studied, the comparison of which results could have been more clarifying of their respective advantages. It should be noted that in Li's work, confirmed glaucoma are those with a vertical cup/disc ratio equal to or greater than 0.9.
The ROC analysis of our cases shows the consistency of the sensitivity results of the Laguna ONhE indices when applying the automatic delimitation, with respect to the manual delimitation, which is congruent with the concordance between both segmentations. There are also no significant differences between the results of the GDF index and the morphological indices obtained with both OCTs. Only in the case of the Spectralis BMO-MRW was there a slightly higher ROC area and a slightly higher sensitivity observed for a specificity of 98%, but not for 95%.
There are also no essential differences between the reproducibility of the Laguna ONhE indices and those of the fiber layer thickness measured by Cirrus-OCT. However, the reproducibility of the vertical cup/disc ratio of Cirrus-OCT was significantly lower.
The vertical cup/disc ratio has also been calculated using multi-label deep network [28]. The ROC areas obtained (0.814-0.822) were lower than ours.
The concordance kappa values obtained also guide us on the reproducibility and diagnostic capacity of the indices. Only results higher than 0.9 are observed between the different GDF values, although the one obtained between the two RNFLT exams with the Cirrus-OCT is not very different. It should also be remarked that the correlation between the two Spectralis-OCT indices is similar to the one presented by its BMO-MRW with the Laguna ONhE GDF, as can be seen in Figures 2 and 3.
The association between different indices seems to indicate that adding functional information with morphological or perfusion information, after normalization of their values, is a simple strategy that allows to gain in diagnostic ability. Bearing in mind that some of the cases analysed were not confirmed glaucoma, it is worth highlighting the sensitivity achieved for very high specificities, which are especially desirable for pathology screening tasks. The proposed method seems simpler than other attempts to combine morphology and function [29] including an event-based score that we have recently proposed [19].
Consequently, from the data obtained it appears that it is possible to automatically identify the position of the ONH, with enough reproducibility and agreement with the criterion of an expert. This delimitation facilitates the application of the Laguna ONhE method, contributing with the improvement of its indices to define a useful tool as an isolated procedure, which effectiveness can be enhanced when associated with other morphological and functional methods of examination.
It would be desirable to compare our method in the future with other methods of ocular perfusion measurement, such as laser speckle flowgraphy, to which machine-learning classification [30] methods are also being applied.