Automatic glaucoma classification using color fundus images based on convolutional neural networks and transfer learning

: Glaucoma detection in color fundus images is a challenging task that requires expertise and years of practice. In this study we exploited the application of different Convolutional Neural Networks (CNN) schemes to show the influence in the performance of relevant factors like the data set size, the architecture and the use of transfer learning vs newly defined architectures. We also compared the performance of the CNN based system with respect to human evaluators and explored the influence of the integration of images and data collected from the clinical history of the patients. We accomplished the best performance using a transfer learning scheme with VGG19 achieving an AUC of 0.94 with sensitivity and specificity ratios similar to the expert evaluators of the study. The experimental results using three different data sets with 2313 images indicate that this solution can be a valuable option for the design of a computer aid system for the detection of glaucoma in large-scale screening programs

tests and exam usually inclu photography imaging test (OCT) [11] a [12]. Nowada the optic nerv Color fun algorithms to digital fundus programs exp Nevertheless, glaucomatous To overcome detection algo four main ch cupping, neur [15]. Figure 1 images. To aid in proposed. We [16][17][18] and from the imag extracted feat potential for others we can compressed fe B-splines coe extracted from with a feature SVM classifi [23] or with a minations for udes detailed and a tonome such as scanni and scanning la ays, only OCT ve in glaucoma ndus imaging aid in the de s cameras in plains the int the subjective s signs is a cha e this difficulty orithms based hanges in the ro-retinal rim th 1 shows some . Example of find e optic cup to disc glaucoma) provok disc (b) The neur or region is broa oral region. The alt the detection e can find work numerous glau ge or transform tures could ide better represen n mention glauc eatures extract efficients [19], m preprocessed e extraction ba er [21,22], or an adaptive hist final confirma medical histo etry [8,9], and ing laser tomo aser polarimetr images and ph . has been oft etection and gr primary care terest in scree e interpretation allenging task t y a great effor on image proc retinal structu hinning, retina of the typical ings used to detec ratio (CDR). The kes optic disc cupp roretinal rim usua der than the supe teration of this pat of glaucoma ks that focus on ucoma-detectio med versions of entify or consid ntation o case coma detection ed from the pix or using highe d images and a ased on higher using an emp togram equaliz ation of the dia ory, slit lamp d since the 90 ography (HRT) ry with variabl hotographs are ften used in c rading of eye settings and t ening for gla n of color fun that requires sp rt has been m cessing of colo ures associate al nerve fibre la signs assessed t glaucoma in colo e reduction of the ping, central cup b ally follows a nor erior, broader tha ttern is a suspiciou numerous ima n the localizati on algorithms f the image to t der relevant inf classification n based on a pr xel intensity va er order spectr a Support Vecto order spectra irical wavelet zation convolv agnosis. A com p examination 's it also inclu ) [10], optical le corneal com widely used to combination w diseases [ [20], or orm and a ares SVM processed to create local configuration patterns that feed a k-nearest neighbor (kNN) classifier [24]. The previous methods apply the approach of identifying features in the image in order to train a classifier with all the findings extracted directly from the image or from a transformed version of it (using wavelets, FT, high order spectra analysis…). At the end, the different algorithms explore different aspects and transformations of the ONH to determine patterns that are representative and may identify glaucoma. In this work we applied a different approach to the glaucoma detection problem through the use of Convolutional Neural Networks (CNNs). Convolutional networks, commonly known as one of the most popular deep learning algorithms for image analysis, have become very rapidly a successful alternative for analyzing medical images. These methods could be considered as the evolution of the supervised techniques started at the end of the 1990s, where training data sets of previously classified images are used to develop the system. This strategy supersedes the previous approach based on feature extraction and posterior classification mentioned in previous paragraphs. The new deep learning paradigm implies that computers can perform the feature learning and classification simultaneously. We can usually find in a deep learning algorithm a network (model) formed by many layers that transform an input data (images normally) to outputs (e.g. pathology present/absent). The most successful type of models for medical image analysis is a sub-class of neural networks called convolutional neural networks (CNN) that was introduced in the 1980s [25].
In [26] Litjens et al., provided a thorough review of the current use of these techniques in medical analysis. The study mentions state-of-the art applications of deep learning technology in the main topics of biomedical image processing: classification, object detection, segmentation or registration among others. Shin et al. [27] mentioned three mayor strategies that used CNNs to medical image classification problems: training from scratch, using off-the shell pre-trained CNN features, and conducting unsupervised CNN pre-training with supervised fine-tuning. Training a deep CNN from scratch (or full training) presents relevant limitations. It requires a large amount of labeled data, which in fields like medical imaging could be extremely expensive to collect both in time and budget, especially for images that present pathological findings relevant for diagnosis. Besides, the training of a deep CNN usually requires extensive memory and computational resources and it could be a very time consuming task. Finally, the design of a CNN and the adjustment of the hyper-parameters of the network could be a challenging process that requires dealing with overfitting and other issues that can limit the success of the application of this technology. One alternative to overcome these problems is the use of transfer learning with fine tuning. Transfer learning is a method successfully used in machine learning and data mining for classification, regression and clustering problems. It is generally defined as the capability of the system to utilize the knowledge learned in one domain of interest, to another that shares some common characteristics [28]. The use of state-of-the art performance CNNs on the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) [29,30] (with millions of labeled images from 1000 different classes) has been successfully tested in several medical image analysis studies [27,[31][32][33]. This last method is the one we have used in this work: the transfer learning of CNN models pre-trained with different image data sets and fine-tuned to solve a specific medical imaging task, the automatic classification of glaucoma in fundus photographs.
CNNs have been successfully applied to color fundus images in the context of Computer-Aided-Detection (CAD) systems and screening programs for eye diseases, achieving state-ofthe art performance or outperforming previous implementations. We can find since 2015 several studies applying CNN in retinal vessel segmentation [34], image quality assessment [35], segmentation of the optic disc and the optic disc cup [36,37], diabetic retinopathy detection [38], age-macular degeneration detection [13] or hemorrhage detection [39] among others. The CNN architectures successfully exploit both local and global features present in the images, being a proper tool for the detection of glaucoma. Several studies have already tackled this problem in color fundus images using CNN. In [40] the authors applied a six layers architecture to optic disc patches previously segmented. In [41] the authors used CNN to extract features and train a SVM classifier to detect glaucoma. Recently Fu et. al. [42] presented a novel ensemble network based on the application of different CNNs to the global fundus image and to different versions of optic disc region. The assessment of deep learning algorithms with transfer learning has also been addressed in [43][44][45] implementing studies with greater number of images than previous works and achieving expert level accuracy and high sensitivity and specificity. Finally, in OCT there are also recent studies applying CNNs for glaucoma detection [46] or segmentation of layers [47,48]. In Table 1 we present a summary of the methods used for glaucoma detection, describing the data sets used and the results reported. Our work is aimed at developing tools for prescreening in computer aided diagnosis system for the detection of glaucoma in large-scale screening programs. In this paper we assessed the application of different CNN architectures for the classification of glaucoma with fundus color images. We studied the performance of different architectures as well as some transfer learning schemes from pre-trained CNN models. We have ensured that all the architectures that are the basis of the current state-of-the-art methods are represented in our comparisons including an Inception based network such as GogleLeNet, a RestNet based architecture, concretely ResNet50 and a recently proposed network such as DENet. Finally, we will consider the potential benefit of integrating basic clinical data collected in the screening studies of the patients in the final classification.
The rest of the paper is structured in several sections. In section 2 we describe the data sets, the network architectures, the training process and the performance metrics used in the study. In section 3 we report the results of the different experiments. Finally, section 4 contains the summary of the study with the main conclusions and suggestions for further works.

Study data sets
We used 2313 retinal fundus images in this work coming from three different data sets: two publicly available (RIM-ONE [49] and DRISHTI-GS [50]) and one private from a screening campaign performed at Hospital de la Esperanza (Parc de Salut Mar) in Barcelona (Spain). We considered two categories, glaucoma and normal. Glaucoma includes the images classified by specialists as suspect of glaucoma or with glaucoma.
The Open Retinal Image Database for Optic Nerve Evaluation (RIM-ONE) is an ophthalmic image group of databases designed in order to be a reference for the design of optic nerve head segmentation algorithms and in development of computer-aided glaucoma diagnosis. The database was created by the collaboration of three Spanish hospitals: Hospital Universitario de Canarias, Hospital Clínico San Carlos and Hospital Universitario Miguel Servet. Our study used the three releases included by the authors until now. The image set was designed in collaboration of 4 glaucoma experts. The camera used to capture the images was a Nidek AFC-210 background camera with a 21.1-megapixel Canon EOS 5D Mark II body. All the images are centered at the optic disc.
The data set DRISHTI-GS consist of 101 retinal fundus images for optic disc segmentation. All images were collected at Aravind Eye Hospital in Madurai (India). Glaucoma patient selection was done by clinical experts based on findings during examination. The retinal images come from Indian patients of 40-80 years old. The images were taken with the eyes dilated, centered on the optic disk, with a field of view of 30-degrees and of dimension 2996x1944 pixels and PNG uncompressed image format.
Finally the ESPERANZA data set consisted of 1446 color fundus images with a field of view 45 degrees, centered on the macula and including the optic disc from patients with age ranging from 55 to 86 years. The retinal images were provided from the glaucoma detection campaign performed to 1006 different patients. During the examination of the patients, a short clinical history was collected (like age, family history of glaucoma, personal record of glaucoma and glaucoma therapy, among others) and besides the color fundus images, other tests (like the measurement of the intraocular pressure (IOP) in both eyes, the visual acuity and Optical Coherence Tomography images) were performed. Special care was taken in order to create a reference gold standard data set. All the images had a double complete glaucoma evaluation performed by six expert (senior) ophthalmologists and nine non-expert (younger) ophthalmologists using a tele-screening tool. The ophthalmologists with more than five years of experience were considered as experts in this work. In case of disagreement between the two evaluations performed, two glaucoma experts decided the final classification of the image by consensus. Each ophthalmologist evaluated a proportional part of all the images inside its category. The assessment of the images included the evaluation of image quality in four categories (good, enough, bad or not evaluable) and the clinical classification in three categories (normal, glaucoma suspect or glaucoma). The selection of images from the campaign to be included in the final data set used in this work corresponds to the images that were labeled by the evaluators with good or enough quality and in the case of glaucoma positive images we included the ones classified as glaucoma suspect or glaucoma. Table 2 contains information of the retinal images considered in the study from each data set. We considered two classes for the classifications tasks studied in this work because the majority of the data sets used (RIM-ONE r2, r3 and DRISHTI-GS) were defined with only two classes (glaucoma and normal). RIM-ONE r1 is the only one that accounted four classes (normal, early, moderate and deep glaucoma) while ESPERANZA data set defined three classes (normal, glaucoma-suspect and glaucoma). The initial size of the images of the RIM-ONE data sets is very different. In the case of r1 the image sizes vary from 316x342 to 831x869. In the version r2 the sizes vary from 290x290 to 1375x1654. Figure 2    As we confirmed in this work, the use of these architectures offers two valuable benefits. First, the design and architectural improvements of these CNNs can warranty superior performance ratios even in a training from scratch scheme. Second, the use of pre-trained deep CNNs and the subsequent fine-tune of the weights of the network applying the new labeled images, could lead to even better performance metrics and a potential reduction in training resources in terms of time, memory and computational operations. Besides the STANDARD CNN presented in Fig. 3, we selected other commonly used architectures (VGG19, GoogLeNet, and ResNet50) and the recently presented DENet, specifically designed for glaucoma screening detection. We prepared several experiments to analyze quantitatively the contribution of the selection of the architecture, the training scheme, finetuning (VGG19 TL, GOOGLENET TL, RESNET50 TL, DENET DISC TL) versus full training (VGG19, GOOGLENET, RESNET50 and DENET DISC), and the data set used for training.
VGG19 [53] is a publicly available CNN model that includes five stacks, each stack contains between two and four convolutional layer followed by a max-pooling layer, and it ends with three fully connected layers. The main contribution of this architecture was the increasing of the depth of the network (in this work we applied the version with 19 layers) and the use of very small (3x3) convolutional filters. We can find applications of this network in several studies, for the fixed features extraction in CADx for breast cancer detection with different imaging modalities [54] or for classification of 19 different skin diseases [55].
GoogLeNet proposed in [56] accomplished as main contribution the improved utilization of the computing resources inside the network, increasing the depth and width of the network but keeping the computational budget constant. It also proposed a new module called "Inception" which concatenates convolution layers with different kernel sizes and one pooling layer into a single new filter. The complete architecture contains 22 layers including two convolution layers, three pooling layers and nine inception layers. GoogLeNet was successfully used in the detection of lymph node metastases in women with breast cancer [57], in the classification of normal and cancerous lung tissues from CARS (Coherent anti-Stokes Raman scattering) images [58] or retinal pathologies using optical coherence tomography (OCT) images [31].
Residual Networks (ResNet) [59] have been broadly tested in general and medical image classification. The main characteristics of ResNet are the intensive application of batch normalization and the use of "shortcut connections" in order to tackle the low performance issues due to the vanishing divergence and the vanishing gradient problems in deep CNN. ResNet has been successfully adopted in recent works of glaucoma classification [43,45] and it has also been included in the ensemble CNN DENet [42].
The Disc-Aware Ensemble Network (DENet) [42,60] is a glaucoma screening network based on an ensemble of four networks. DENet considers various levels and modules of the fundus images. Two networks exploit global fundus images and are based on ResNet and Ushape convolutional network (U-net). The other two networks are centered on the local optic disc region, previously cropped based on one of the previous networks, and use ResNet to classify the disc region and a polar transformed version of the same disc area. The authors provided the code of the four networks and the trained models using the ORIGA full data set. In this study we used DENet in two different contexts. First, we evaluated the ensemble network using the global fundus images from a subset of our data set (only ESPERANZA and DRISHTI-GS have global images) and the pre-trained models provided by the authors (DENET). Secondly, in order to assess the impact in the performance of the use of transfer learning, fine tuning a pre-trained glaucoma classification model, we used one of the networks of the ensemble, DENet Disc. This model uses segmented optic disc color fundus images and allows us to train the network with all our training data set from scratch (DENET DISC) and with transfer learning initializing it with the pre-trained models provided (DENET DISC TL).
The input in VGG19, GoogleNet, ResNet50 and DENet Disc networks were the preprocessed color fundus images from the data sets of the study with a final size of 224x224x3 and centered at the optic disc. The input images in the case of the standard network were 256x256x3. The region of interest presented to all the networks was the same. We changed the last layer with the softmax classifier in VGG19, GoogleNet and ResNet50 to consider only the two classes of interest in our study (glaucoma and normal).

Performance metrics
To evaluate the performance of the algorithms Receiver Operating Characteristic (ROC) analysis was performed and sensitivity/specificity ratios were calculated. We defined sensitivity, or true positive rate, as the number of true positives (number of images with glaucoma correctly detected) divided by the sum of the number of true positives and false negatives (images incorrectly classified as normal). Therefore, the sensitivity shows the percentage of glaucoma cases correctly identified by the algorithm. We defined specificity as the number of true negatives (number of images normal correctly detected in our case) divided by sum of the number of true negatives and the false positives (images incorrectly classified as glaucoma). The specificity is a ratio that shows the percentage of normal cases correctly identified. We also considered the Balanced Accuracy (BAcc) as the mean of sensitivity and specificity to take into account the imbalance in the number of positive and negative cases in the testing data set. We used the ROC graph for visualizing the performance of the networks [61]. The ROC graph is a two dimensional representation with the sensitivity in the Y axis and 1-specificity in the X axes. We compared the performance of the algorithms using the area under the receiver operating curve (AUC) generated by ROC curve. For the calculation of the optimum threshold we considered the Youden index, defined as the index where the sum of the specificity −1 and sensitivity is maximum [62].

Training and testing processes
The complete data set was randomly divided in three different groups: training, validation and test set with the distribution of images presented in Table 5. The validation set was used to monitor the number of epochs of the training process. The number of epochs with better performance in terms of accuracy in both classes (glaucoma and normal) with the validation set is listed in Table 6. After the selection of the hyper-parameters and the cross validation experiments, the rest of the trainings in the study considered in the training set the images from validation set. The final training set contained 370 glaucoma and 1364 normal images. The test sets were the same during all the experiments (124 glaucoma and 455 normal images) Table 5. Distribution of images in the groups "glaucoma" and "normal".

Group
Training The overfitting it is a well-known issue in CNNs with limited training data. In order to limit this important problem in the training process we applied different strategies commonly used for this purpose. First, the selection of the number of epochs to complete the process was stopped when the performance on the validation set was reduced. Second, we applied dropout, the technique presented by [63] that consists in temporarily removing units along with all its incoming and outgoing connections in a neural network. We included dropout with p = 0.5 in the standard CNN. In the rest of the tested architectures we maintained the regularization schemes designed by the authors. Finally, we applied data augmentation. This technique consists of training and/or testing on systematically transformed images. The transformations used typically have to maintain the classification of the original image. In our study we first balanced the number of images in each group to 1400 in both groups (glaucoma and normal), with horizontal flips in the case of the normal images and with a random combination of flips and translations of 20 pixels for the glaucoma images. We also used data augmentation during training in all the networks. In each iteration, every image included in the batch could be transformed by a random combination of the operations: random flip, random small rotations between −10 degrees and + 10 degrees and random translation of maximum *-20 pixels in the x or y direction of the image.

Implementation
The preprocessing steps were implemented using Matlab R2016b 64-bits (Mathworks, Inc.) on a desktop computer equipped with an Intel Xeon CPU ES31245 and the CNN experiments were implemented in Python (version 2.7.12) using the libraries Lasagne (version 0.2) [64] and Theano (version 0.9) [65] in a desktop computer with a NVIDIA GeForce GTX Titan Pascal 12GB GPU.

Results and discussion
We defined several experiments to evaluate the best solution in terms of performance and to get more insight of the different alternatives tested.

Hyper-parameter selection
As described before we used the validation set to select the number of epochs for all the networks and types of training (full training/fine tuning) used during the study. In Table 6 we present the final values selected. For the STANDARD CNN we used Stochastic Gradient Descent (SDG) updates, a binary cross entropy loss function, a learning rate of 0.005 and a batch size of 64. In the case of the transfer learning networks (VGG19, RESNET50, GOOGLENET and DENET DISC) we selected SDG updates with Nesterov momentum 0.9, a learning rate of 0.0001, the categorical cross entropy loss function and a batch size of 32.

CNN algorithm comparison
After the selection of the hyper-parameters we evaluated all the architectures under study (STANDARD CNN, VGG19, RESNET, GOOGLENET and two versions of DENET) in terms of the performance metrics, using all the data included in the training and validation subsets for training, and for testing the full test set. In Table 7 and Fig. 4 we present the ROC curves and the performance metrics. The two first options were VGG19 TL and RESNET50 TL with an AUC of 0.942 and 0.930 and in the case of RESNET TL the best sensitivity 91.94%. It is also remarkable that DENET DISC TL (based on ResNet50) and fine tuned from ORIGA data set did not outperform RESNET50 TL fine-tuned from ImageNet. The impact from fine-tuning from a general data set (ImageNet) is clear if we consider the difference between the performance of both trainings (full training and transfer learning) in VGG19 and RESNET50. In the case of DENET DISC, that used fine tuning from a targeted model (ORIGA), we noticed the biggest improvement in AUC (from 0.8371 to 0.9142) compared with the rest of the networks that used ImageNet.
VGG19 TL performed with the highest AUC, 0.9420, which is in the range of state of art results of the last published studies (AUCs between 0.91 and 0.985) [42][43][44][45]. The performance of the VGG19 TL was even higher (AUC 0.9640, sensitivity 94.51%, specificity 90.99% and BAcc 92.75%) if we considered only the glaucoma images at any stage of the disease, without the cases represented as glaucoma-suspect present in the ESPERANZA data set.
As we mentioned before, DENET uses global fundus images. As all the images in RIM-ONE data sets were centered in the optic disc, the number of data sets that we considered to evaluate DENET were limited to ESPERANZA and DRISHTI-GS data sets. Taking into account these two data sets we evaluated DENET directly without further training. The performance ratios (AUC 0.7507, sensitivity 70.45%, specificity 70.26% and BAcc 70.35%) were inferior to the performance of all other networks for this sub-test data set (Table 8).
These results demonstrate that CNNs can be trained, using large data sets and without having to specify lesion-based features, to identify glaucoma in retinal fundus images with high sensitivity and high specificity. Besides, the best option achieved state-of-the art performance ratios with images coming from multiples sources and different sizes and formats. According to our results this heterogeneity of the data sets could represent a differential value for the training of this type of classification strategies, and suppose a relevant parameter to consider respect to other issues like the complexity of the network or a great amount of more homogeneous images, where the capacity of the network to learn new

Ten-fold
To confirm ro was performe (Table 5) to performance results showe deviation of A curves and the

CNN/hum
We compare performance o points. The (89.14%) and evaluators (76 performance specificity of considering th ble 9 the perfo erformance rat tween the per slight improv T presented an results sugges entary and thei .

Integration of medical data and CNNs
The final experiment consisted in the integration of clinical history data collected during patient's screening visit with color fundus images. As we have mentioned, CNNs are especially successful analyzing medical images. But the clinical diagnosis usually involves the assessment of a variety of exams that includes medical images but also data coming from different sources, like the medical history of the patient or from tests where the output could be a simple value. From this perspective, the exclusive use of one test could limit the final diagnosis and represent only a part of the global disease complexity. In order to consider a broader view in the diagnosis, some studies have explored the application of CNN integrating different imaging modalities [67,68] but also combining images with raw data, like in [69] where histological images and genomic data were integrated in a single CNN. Following this approach, we designed one experiment with the ESPERANZA data set integrating color fundus images with medical history data collected from the same patient during the screening campaign. The data selected from the screening visit were: the age, the intraocular pressure (IOP) in both eyes, the family history of glaucoma, the personal record of glaucoma and glaucoma related therapy. The selection of the data was based on some of the major risk factors for glaucoma described in the literature [1,70]. In the next tables we present more information of the clinical data used in the experiment. This data was collected during the examination of the patients in the ophthalmological screening campaign described in section 2.1. Color fundus images and the examination data were used together during the training by including the raw data into the last fully connected layer of the network. We present more details of the integration in Fig. 10. The clinical data were used in its raw format without post-processing. The value of the "Family record of glaucoma", "Personal record of glaucoma" and "Personal glaucoma related therapy" had values 0 or 1. All the cases from the training and validation subsets from ESPERANZA were used for training (80 glaucoma and 995 normal) and the ESPERANZA cases from the test data set were used for testing (33 glaucoma and 338 normal). For this experiment we performed a full training and we only used the ESPERANZA data set, because is the only data set that contains information of the examination of the patients. We trained the network for 100 epochs. We performed two different experiments: the first one using all previous data and the second only with the age and the personal record of glaucoma because according with Tables 10 and 11 are the only data which appears different in the glaucoma and normal groups. As we can see in Table 12 and Fig. 11, clinical histor this informati and further t considered.  In this study we have concentrated in a direct classification approach but other strategies and tools could be considered in order to aid the clinicians in the final diagnosis. The integration of clinical data from different sources or considering other imaging modalities could improve the performance of the classification, but other techniques that give insight and help to visualize and understand the features that have a relevant role in the final classification represent a valuable option. In this sense we highlight alternatives like the use of occlusion testing to recognize areas of the image with more impact in the classification [43,71] or saliency maps where the clinicians could be informed of the areas of the images with more influence in the prediction [72].

Conclusion
In this paper, we exploited and evaluated the application of deep convolutional neural networks for glaucoma detection using color fundus images in the context of large screening campaigns. We studied the influence of the architecture, the data set size, the training strategy and the integration of data collected from the clinical history and the patient examination. We used three different data sets, two publicly available DRISHTI-GS and RIM-ONE, and other created from a glaucoma screening campaign to assess the performance of the alternatives. The five architectures tested, standard CNN, VGG19, ResNet50, DENet and GoogLeNet offered good performance ratios in terms of AUC, sensitivity and specificity. The best option for the data set used was VGG19 with transfer learning and fine tuning, with an AUC of 0.94, a sensitivity of 87.01% and a specificity of 89.01%, which showed a similar performance with respect to the expert evaluators of the screening campaign. We confirmed the great influence of the number of images and data sets using CNNs. The AUC of the VGG19 with fine tuning increased from 0.85 with one data set to 0.94 with all the data sets of the study. Finally, we evaluated the performance of the integration of the data from the clinical history and tonometry tests with the color fundus images. The results show a slight improvement in sensitivity and specificity with similar AUCs. Further tests with more data and new architectural approaches should be developed and assessed to confirm this line of work. The good results presented demonstrated that CNNs are a valuable alternative for CAD systems to assess and classify fundus images for glaucoma detection campaigns.